利用即时代码生成加速ARM平台的SPMM

doi:10.11896/jsjkx.251000116

Abstract

Abstract: In recent years,with the rapid adoption of ARM architecture processors in edge computing devices and cloud servers,as well as the increasingly critical role of sparse matrix multiplication(SPMM)in compute-intensive applications such as deep learning,research on sparse computing optimization for ARM platforms has become an academic hotspot.However,current SPMM solutions for ARM platforms primarily employ the ahead-of-time(AOT)compilation model,where all compilation is completed before program execution.Nevertheless,AOT solutions for SPMM face three major limitations:unnecessary memory accesses,additional branch overhead,and redundant instructions.This paper proposes ASPJIT,a just-in-time(JIT)assembly code generation framework for ARM platforms,designed to accelerate SPMM computation on ARM multi-core CPUs.ASPJIT dynamically optimizes the computation sequence using a sparse judgment-based column-to-row algorithm,leveraging runtime sparse feature analysis to significantly improve instruction-level parallelism(ILP).Additionally,ASPJIT reduces memory access latency by employing a register allocation strategy to cache frequently accessed data and maximizes arithmetic throughput by carefully selecting SIMD instruction sets.A performance evaluation of ASPJIT is conducted and compared it with two AOT baselines.The first involves existing SPMM implementations compiled with automatic vectorization using the ARM gcc compiler.The second utilizes optimized SPMM routines provided by ARM Eigen.The results demonstrate that ASPJIT delivers average speedups of 3.8x and 5.6x,respectively.

Key words: Sparse matrix-matrix multiplication(SPMM), Ahead-of-time(AOT), Just-in-time(JIT), Single instruction multiple data(SIMD), ARM multi-core CPUs

CLC Number:

TP391

SHI Jun, WANG Qinglin, TIAN Feiyang, WANG Zhicheng, LI Runhua, LIU Jie. Optimizing SPMM on ARM Architectures with JIT Instruction Generation[J].Computer Science, 2026, 53(6): 163-170.

References

[1]YANG C,AYDIN B,OWENS J D.Design principles for sparse matrix multiplication on the gpu[C]//European Conference on Parallel Processing.Cham:Springer,2018.
[2]WU T,WANG B,SHAN Y,et al.Efficient pagerank and spmv computation on amd gpus[C]//2010 39th International Conference on Parallel Processing.IEEE,2010:81-89.
[3]FU Q,ROLINGER T B,HUANG H H.JITSPMM:Just-in-time instruction generation for accelerated sparse matrix-matrix multiplication[C]//2024 IEEE/ACM International Symposium on Code Generation and Optimization(CGO).IEEE,2024:448-459.
[4]LANGVILLE A N,MEYER C D.Google's PageRank and beyond:The science of search engine rankings[M].Princeton University Press,2006.
[5]KOREN Y,BELL R,VOLINSKY C.Matrix factorization techniques for recommender systems[J].Computer,2009,42(8):30-37.
[6]SCHAEFFER S E.Graph clustering[J].Computer Science Review,2007,1(1):27-64.
[7]WANG E,ZHANG Q,SHEN B,et al.Intel Math Kernel Library[J].2014.DOI:10.1007/978-3-319-06486-4_7.
[8]VIRTANEN P,GOMMERS R,OLIPHANT T E,et al.Fundamental algorithms for scientific computing in python and SciPy 1.0 contributors.SciPy 1.0[J].Nature Methods,2020,17:261-272.
[9]WANG M Y.Deep graph library:Towards efficient and scalable deep learning on graphs[C]//ICLR Workshop on Representation Learning on Graphs and Manifolds.2019.
[10]FEY M,LENSSEN J E.Fast graph representation learning with PyTorch Geometric[J].arXiv:1903.02428,2019.
[11]SELVITOPI O,BROCK B,NISA I,et al.Distributed-memoryparallel algorithms for sparse times tall-skinny-dense matrix multiplication[C]//Proceedings of the 35th ACM International Conference on Supercomputing.2021:431-442.
[12]ZHANG Y,YANG W,LI K,et al.Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor[J].Journal of Parallel and Distributed Computing,2021,158:126-137.
[13]ZHANG W,JIANG Z,CHEN Z,et al.NUMA-Aware DGEMM based on 64-bit ARMv8 multicore processors architecture[J].Electronics,2021,10(16):1984.
[14]ZHENG J,JIANG J,DU J,et al.Optimizing massively parallel sparse matrix computing on ARM many-core processor[J].Parallel Computing,2023,117:103035.
[15]SATO M,ISHIKAWA Y,TOMITA H,et al.Co-design fora64fx manycore processor and” fugaku”[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-15.
[16]HUANG G,DAI G,WANG Y,et al.Ge-spmm:General-purpose sparse matrix-matrix multiplication on gpus for graph neural networks[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-12.
[17]INOUE H,OHARA M,TAURA K.Faster set intersection with SIMD instructions by reducing branch mispredictions[J].Proceedings of the VLDB Endowment,2014,8(3):293-304.
[18]SHAYLOR N.A {Just-in-Time} Compiler for {Memory-Constrained}{Low-Power} Devices[C]//2nd Java Virtual Machine Research and Technology Symposium(Java VM 02).2002.
[19]FU Q,JI Y,HUANG H H.TLPGNN:A lightweight two-level parallelism paradigm for graph neural network computation on GPU[C]//Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing.2022:122-134.
[20]MERRILL D,GARLAND M.Merge-based parallel sparse matrix-vector multiplication[C]//Proceedings of the International Conference for High Performance Computing(SC'16).Networking,Storage and Analysis.IEEE,2016:678-689.
[21]PAI V S,RANGANATHAN P,ADVE S V.The impact of instruction-level parallelism on multiprocessor performance and simulation methodology[C]//Proceedings Third International Symposium on High-Performance Computer Architecture.IEEE,1997:72-83.
[22]HENNESSY J L,PATTERSON D A.Computer architecture:a quantitative approach[M].Elsevier,2011.
[23]DAVIS T A,HU Y.The University of Florida sparse matrix collection[J].ACM Transactions on Mathematical Software,2011,38(1):1-25.
[24]CHANDRA R.Parallel programming in OpenMP[M].Morgan Kaufmann,2001.
[25]ACER S,SELVITOPI,O,AYKANAT C.Improving performance of sparse matrix dense matrix multiplication on large-scale parallel systems[J].Parallel Computing,2016,59:71-96.
[26]HU Y,YE Z,WANG M,et al.Featgraph:A flexible and efficient backend for graph neural network systems[C]//SC20:International Conference for High Performance Computing,Networking,Storage and Analysis.IEEE,2020:1-13.
[27]GUO M,WANG Y,GU Y,et al.Bs-SpMM:Accelerate Sparse Matrix-Matrix Multiplication by Balanced Split Strategy on the GPU[C]//IEEE INFOCOM 2023-IEEE Conference on Computer Communications Workshops(INFOCOM WKSHPS).IEEE,2023:1-6.
[28]CAO L,WANG Q,YANG S,et al.LSSM-SpMM:A Long-Row Splitting and Short-Row Merging Approach for Parallel SpMM on PEZY-SC3s[C]//International Conference on Algorithms and Architectures for Parallel Processing.Singapore:Springer,2024:78-97.

Related Articles 15

[1]	KE Changbo, LI Tianhao, ZHANG Bolei, XIAO Fu, XU Kang. Teaching Evaluation Sentiment Analysis Method Based on Capsule Network [J]. Computer Science, 2026, 53(6): 10-18.
[2]	LIU Ruyi, LYU Xiaohan, MIAO Qiguang, LU Zixiang, WANG Di. Academic Early Warning Prediction Model Based on Attention Mechanism and FeatureInteraction [J]. Computer Science, 2026, 53(6): 19-29.
[3]	XIE Hui, LIANG Dan, YANG Huiting, JIA Chunli, HE Jiangshan, DONG Zexiao, REN Ziqi, JIANG Mingzhe, CHEN Xueli. Research on Adaptive Disciplinary Learning Effectiveness Evaluation Model Driven by PrefrontalEEG [J]. Computer Science, 2026, 53(6): 39-49.
[4]	SHANG Yi, YING Di, ZHAO Hui. Multi-task Classroom Title Generation Method Integrates Core Sentences and Keyword Guidance [J]. Computer Science, 2026, 53(6): 50-58.
[5]	XU Zhihong, YANG Xinlei, WANG Liqin, DONG Yongfeng, WANG Xu. Knowledge Tracing Model Based on Relational Learning Memory Network [J]. Computer Science, 2026, 53(6): 84-92.
[6]	ZHAO Lei, YANG Yulu, YUAN Bo. Personalized Course Recommendation System Based on Knowledge Graph [J]. Computer Science, 2026, 53(6): 93-101.
[7]	ZHU Huming, LIU Huijie, DONG Ximiao, CHEN Zhipeng, GAO Tianqi, JIAO Licheng. Review on Parallel Training and Inference of Diffusion Models [J]. Computer Science, 2026, 53(6): 102-116.
[8]	LI Zhenjia, WANG Wu. Kokkos-based Direct Solver and Its Implementation on Heterogeneous Platform [J]. Computer Science, 2026, 53(6): 137-144.
[9]	ZHU Pengzhi, HUANG Chun, SHEN Jie, CHEN Cheng, XU Haoran, LONG Biao. Research on Fortran Compiler Implementation Technology on CPU-DSP Heterogeneous Processor [J]. Computer Science, 2026, 53(6): 145-152.
[10]	LI Jinyou, ZHANG Wenshuai, SHEN Yu, ZHANG Yundong, LI Huimin, LI Jing. Machine Learning-based Parallel Parameter Optimization in High-performance ComputingApplications [J]. Computer Science, 2026, 53(6): 153-162.
[11]	LIU Zhongyi, XIAO Wei, ZHANG Lei, YAN Songbai, HUANG Xiangping, LI Mengxiao. MMCache:High-performance Cluster Cache with Memory-mapped Mirroring [J]. Computer Science, 2026, 53(6): 203-213.
[12]	JI Wenyu, LI Yang, WANG Jiabao, FU Ruizhi, LIU Xiaoyu, MIAO Zhuang. Review of 3D Object Detection Based on LiDAR-camera Fusion [J]. Computer Science, 2026, 53(6): 214-231.
[13]	LI Xiuying, CHEN Xuesong, LI Haoze, LIAO Hongwei, HAN Jiameng, DUAN Xiaoyi. MambaCS:Mamba-based Image Compressed Sensing Algorithm [J]. Computer Science, 2026, 53(6): 232-241.
[14]	LI Peng, ZHANG Zihao, HAN Yahong. Primitive Dynamic Weighting for Multi-modal Salient Object Detection [J]. Computer Science, 2026, 53(6): 242-251.
[15]	LIU Jikang, HUANG Lei, ZHANG Ke, NIE Jie, WEI Zhiqiang. Object Detection Method Based on Dynamic Feature Fusion [J]. Computer Science, 2026, 53(6): 263-269.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Optimizing SPMM on ARM Architectures with JIT Instruction Generation

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0