Computer Science ›› 2026, Vol. 53 ›› Issue (6): 128-136.doi: 10.11896/jsjkx.250600137
• High Performance Computing • Previous Articles Next Articles
WANG Yipin1, CAI Chenghuan1, XU Jiabin2, ZHOU Xuegong3, ZHANG Fengzhe3, CAO Wei3, ZHANG Fan3, YU Xinsheng4
CLC Number:
| [1]WANG Y,WANG Y,LI H,et al.Systolic cube:A spatial 3D CNN accelerator architecture for low power video analysis[C]//Proceedings of the 56th Annual Design Automation Conference 2019.2019:1-6. [2]PATWARDHAN N,MARRONE S,SANSONE C.Transfor-mers in the real world:A survey on nlp applications[J].Information,2023,14(4):242. [3]PAREKH D,PODDAR N,RAJPURKAR A,et al.A review on autonomous vehicles:Progress,methods and challenges[J].Electronics,2022,11(14):2162. [4]MEZGER B W,SANTOS D A,DILILLO L,et al.A survey of the RISC-V architecture software support[J].IEEE Access,2022,10:51394-51411. [5]CUI E,LI T,WEI Q.Risc-v instruction set architecture extensions:A survey[J].IEEE Access,2023,11:24696-24711. [6]GENC H,KIM S,AMID A,et al.Gemmini:Enabling systematic deep-learning architecture evaluation via full-stack integration[C]//2021 58th ACM/IEEE Design Automation Conference (DAC).IEEE,2021:769-774. [7]LI M,LIU Y,LIU X,et al.The deep learning compiler:A comprehensive survey[J].IEEE Transactions on Parallel and Distributed Systems,2020,32(3):708-727. [8]ZHANG H,XING M,WU Y,et al.Compiler Technologies in Deep Learning Co-Design:A Survey[J].Intelligent Computing,2023,2:40. [9]PECCIA F N,BRINGMANN O.Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework[C]//Proceedings of the 2023 Workshop onCompi-lers,Deployment,and Tooling for Edge AI.2023:21-26. [10]AHMADIFARSANI S,MUELLER-GRITSCHNEDER D,SCHLICHTMANN U.A High-Level Compiler Integration Approach for Deep Learning Accelerators Supporting Abstraction and Optimization[J].arXiv:2507.04828,2025. [11]GitHub.An MLIR-based compiler framework bridges DSLs(domain-specific languages) to DSAs (domain-specific architectures)[DB/OL].https://github.com/buddy-compiler/buddy-mlir. [12]ROTEM N,FIX J,ABDULRASOOL S,et al.Glow:Graph lowering compiler techniques for neural networks[J].arXiv:1805.00907,2018. [13]ZHENG L,JIA C,SUN M,et al.Ansor:Generating {High-Performance} tensor programs for deep learning[C]//14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20).2020:863-879. [14]ZHENG B,JIANG Z,YU C H,et al.DietCode:Automatic optimization for dynamic tensor programs[J].Proceedings of Machine Learning and Systems,2022,4:848-863. [15]EKLUNDH J O.Efficient matrix transposition[J].Two-Dimensional Digital Signal Prcessing II:Transforms and Median Filters,2006,43:9-35. [16]FRIGO M,LEISERSON C E,PROKOP H,et al.Cache-oblivious algorithms[J].ACM Transactions on Algorithms,2012,8(1):1-22. [17]RUETSCH G,MICIKEVICIUS P.Optimizing matrix transpose in CUDA[C]//NVIDIA Technical Report.2009:1-24. [18]GOMEZ-LUNA J,SUNG I J,CHANG L W,et al.In-place matrix transposition on GPUs[J].IEEE Transactions on Parallel and Distributed Systems,2015,27(3):776-788. [19]VASILACHE N,ZINENKO O,THEODORIDIS T,et al.Tensor comprehensions:Framework-agnostic high-performance machine learning abstractions[J].arXiv:1802.04730,2018. [20]BADER M,ZENGER C.Cache oblivious matrix multiplication using an element ordering based on a Peano curve[J].Linear Algebra and Its Applications,2006,417(2/3):301-313. [21]FRIGO M,LEISERSON C E,PROKOP H,et al.Cache-oblivious algorithms[C]//40th Annual Symposium on Foundations of Computer Science (Cat.No.99CB37039).IEEE,1999:285-297. [22]TORRES L A,BARRIOS C J,DENNEULIN Y.Evaluation of Computational and Power Performance in Matrix Multiplication[C]//High Performance Computing:11th Latin American High Performance Computing Conference.2024. [23]ADEFEMI T.Analysis of the Performance of the Matrix Multiplication Algorithm on the Cirrus Supercomputer[J].arXiv:2408.15384,2024. [24]NIU W,GUAN J,WANG Y,et al.Dnnfusion:accelerating deep neural networks execution with advanced operator fusion[C]//Proceedings of the 42nd ACM SIGPLAN International Confe-rence on Programming Language Design and Implementation.2021:883-898. [25]ZHAO J,GAO X,XIA R,et al.Apollo:Automatic partition-based operator fusion through layer by layer optimization[J].Proceedings of Machine Learning and Systems,2022,4:1-19. [26]GENC H,KIM S,AMID A,et al.Gemmini Tutorial:Generate Custom DNN Accelerators with Full-System Full-Stack Evaluation [C]//Proceedings of the Conference on Machine Learning and Systems (MLSys).2022. |
| [1] | HAN Lin, DING Yongqiang, CUI Pingfei, LIU Haohao, LI Haoran, CHEN Mengyao. SLP Vectorization Across Basic Blocks Based on Region Partitioning [J]. Computer Science, 2025, 52(9): 186-194. |
| [2] | JIANG Jun, ZHAI Yanhe, ZENG Zhiheng, GU Yichao, HUANG Liangming. Loop-invariant Code Motion Algorithm Based on Loop Cost Analysis [J]. Computer Science, 2025, 52(6): 44-51. |
| [3] | GAO Wei, WANG Lei, LI Jianan, LI Shuailong, HAN Lin. Operator Fusion Optimization for Deep Learning Compiler TVM [J]. Computer Science, 2025, 52(5): 58-66. |
| [4] | WEI Xiaohui, GUAN Zeyu, WANG Chenyang, YUE Hengshan, WU Qi. Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators [J]. Computer Science, 2025, 52(5): 91-100. |
| [5] | XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120. |
| [6] | LIU Lei, ZHOU Zhide, LIU Xingxiang, CHE Haoyang, YAO Lei, JIANG He. Automatic Tensorization for TPU Coarse-grained Instructions [J]. Computer Science, 2024, 51(6): 52-60. |
| [7] | PEI Xue, WEI Shuai, SHAO Yangxue, YU Hong, GE Chenyang. Compilation Optimization and Implementation of High-order Cryptographic Operators on FPGA [J]. Computer Science, 2024, 51(11A): 231200184-11. |
| [8] | FAN Lilin, QIAO Yihang, LI Junfei, CHAI Xuqing, CUI Rongpei, HAN Bingyu. CP2K Software Porting and Optimization Based on Domestic c86 Processor [J]. Computer Science, 2023, 50(6): 58-65. |
| [9] | WANG Xiaofeng, LI Chaoran, LU Kunfeng, LUAN Tianjiao, YAO Na, ZHOU Hui, XIE Yujia. Acceleration Design and FPGA Implementation of CNN Scene Matching Algorithm [J]. Computer Science, 2023, 50(11): 8-14. |
| [10] | FU Si-qing, LI Tie-jun, ZHANG Jian-min. Architecture Design for Particle Transport Code Acceleration [J]. Computer Science, 2022, 49(6): 81-88. |
| [11] | CAO Hao, GUO Shao-zhong, LIU Dan, XU Jin-chen. Automatic Porting of Basic Mathematics Library for 64-bit RISC-V [J]. Computer Science, 2021, 48(6): 41-47. |
| [12] | XIE Jing-ming, HU Wei-fang, HAN Lin, ZHAO Rong-cai, JING Li-na. Quantum Fourier Transform Simulation Based on “Songshan” Supercomputer System [J]. Computer Science, 2021, 48(12): 36-42. |
| [13] | CHI Hao-yu, CHEN Chang-bo. Prediction of Loop Tiling Size Based on Neural Network [J]. Computer Science, 2020, 47(8): 62-70. |
| [14] | ZHANG YU,FENG Dan. Study and Design of Reconfigurable Embedded Computing System Based on Xilinx SoPC [J]. Computer Science, 2010, 37(5): 274-277. |
| [15] | . [J]. Computer Science, 2009, 36(3): 45-47. |
|
||