Computer Science ›› 2023, Vol. 50 ›› Issue (11A): 220900277-6.doi: 10.11896/jsjkx.220900277

• Computer Software & Architecture • Previous Articles     Next Articles

Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002

MO Shangfeng, ZHOU Zhenfen, HU Yonghua, XU Minmin, MAO Chunxian, YUAN Yudi   

  1. School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan,Hunan 411201,China
    China Hunan Key Laboratory for Service computing and Novel Software Technology,Xiangtan,Hunan 411201,China
  • Published:2023-11-09
  • About author:MO Shangfeng,born in 1977,Ph.D,is a member of China Computer Federation.His main research interests include DSP compilation and embedded system.
  • Supported by:
    Research Projects of Hunan Provincial Department of Education(20B242) and Natural Science Foundation of Hunan Province,China(2017JJ3087).

Abstract: FT-M7002 is a high-performance DSP independently developed in China,with powerful vector processing capability.In order to give full play to its performance advantages,it is urgent to optimize and transplant the efficient VSIP function library for FT-M7002.Row vector matrix multiplication in complex domain is a frequent algorithm used in VSIP library,which is widely used in digital communication,image processing and other application fields.In this paper,we study the optimization algorithm of row vector matrix multiplication in complex domain on FT-M7002 DSP,and improve the performance of the algorithm by changing the column vector of the computation matrix to the row vector of the computation matrix,vectorization,loop expansion and software pipelining.The test results show that the optimized vector C algorithm achieves a speedup ratio of 6.2~20.6 compared with the VSIP library function,and the assembly optimization algorithm achieves a speedup ratio of 3.4~14.3 compared with the vector C algorithm.The speedup effect is obvious.

Key words: Matrix multiplication, Digital signal processor, SIMD, VSIPL

CLC Number: 

  • TP313
[1]ZHANG Y H,LIU X G.Parallel Algorithm of Matrix Multiplication Based on MPI & OpenMP[J].Computer and Modernization,2011(7):84-87.
[2]LIM R,LEE Y,et al.An implementation of matrix-matrix mul-tiplication on the Intel KNL processor with AVX-512[J].Cluster Computing,2018,21:1785-1795.
[3]LI X W,CUI X.Performance optimization of matrix multiplication and FFT in GPU[J].Modern Electronics Technique,2013,36(4):80-84.
[4]ZHANG M Y.Parallel implementation of matrix multiplication based on CUDA[J].Changjiang Information & Communications,2012(2):20-21.
[5]SHAO Y M,ZHOU J.Implementation of Customized Instruc-tion for RISC-V CPU Based on FPGA[J].Software,2022,43(1):161-164.
[6]TIAN X,ZHOU F.Design of field programmable gate arraybased real time double-precision floating-point matrix multiplier[J].Journal of Zhejiang University(Engineering Science),2008(9):1611-1615.
[7]WANG Y H,LI C,LIU C,et al.Advancing DSP into HPC,AI,and beyond:challenges,mechanisms,and future directions[J].CCF Transactions on High Performance Computing,2021,3(1):114-125.
[8]LI H X,ZHANG H F.A Cholesky decomposition vector processing algorithm for FT-M7002[J].Journal of Shaoyang University(Natural ScienceEdition),2022,19(3):9-17.
[1] WANG Bo-yang, PANG Jian-min, XU Jin-long, ZHAO Jie, TAO Xiao-han, ZHU Yu. Matrix Multiplication Vector Code Generation Based on Polyhedron Model [J]. Computer Science, 2022, 49(10): 44-51.
[2] HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[3] YAO Jian-yu, ZHANG Yi-wei, ZHANG Guang-ting, JIA Hai-peng. High Performance Implementation and Optimization of Trigonometric Functions Based on SIMD [J]. Computer Science, 2021, 48(12): 29-35.
[4] LI Shuang, ZHAO Rong-cai, WANG Lei. Implementation and Optimization of Sunway1621 General Matrix Multiplication Algorithm [J]. Computer Science, 2021, 48(11A): 699-704.
[5] HAN Xiao-dong, GAO Fei, ZHANG Li-wei. Novel Real-time Algorithm for Critical Path of Linear Network Coding [J]. Computer Science, 2020, 47(9): 232-237.
[6] GONG Tong-yan,ZHANG Guang-ting,JIA Hai-peng,YUAN Liang. High-performance Implementation Method for Even Basis of Cooley-Tukey FFT [J]. Computer Science, 2020, 47(1): 31-39.
[7] ZHOU Bei, HUANG Yong-zhong, XU Jin-chen, GUO Shao-zhong. Study on SIMD Method of Vector Math Library [J]. Computer Science, 2019, 46(1): 320-324.
[8] YANG Fei, MA Yu-chun, HOU Jin and XU Ning. Research on Acceleration of Matrix Multiplication Based on Parallel Scheduling on MPSoC [J]. Computer Science, 2017, 44(8): 36-41.
[9] JIN Xing-tong, LI Peng, WANG Gang, LIU Xiao-guang and LI Zhong-wei. Optimizing Small XOR-based Non-systematic Erasure Codes [J]. Computer Science, 2017, 44(6): 36-42.
[10] HAO Xin and GUO Shao-zhong. Optimization of 3D Finite Difference Algorithm on Intel MIC [J]. Computer Science, 2017, 44(5): 26-32.
[11] CHEN Yong and XU Chao. Symbolic Execution and Human-Machine Interaction Based Auto Vectorization Method [J]. Computer Science, 2016, 43(Z6): 461-466.
[12] YU Hai-ning, HAN Lin and LI Peng-yuan. Structure Optimization for Automatic Vectorization [J]. Computer Science, 2016, 43(2): 210-215.
[13] XU Jin-long ZHAO Rong-cai ZHAO Bo. Research on Non-full Length Usage of SIMD Vector Instruction [J]. Computer Science, 2015, 42(7): 229-233.
[14] YIN Meng-jia, XU Xian-bin, XIONG Zeng-gang and ZHANG Tao. Quantitative Performance Analysis Model of Matrix Multiplication Based on GPU [J]. Computer Science, 2015, 42(12): 13-17.
[15] SUN Hui-hui, ZHAO Rong-cai, GAO Wei and LI Yan-bing. Control Flow Vectorization Based on Conditions Classification [J]. Computer Science, 2015, 42(11): 240-247.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!