计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 220900277-6.doi: 10.11896/jsjkx.220900277

• 计算机软件&体系架构 • 上一篇    下一篇

基于FT-M7002的复数域行向量矩阵乘法移植与优化

莫尚丰, 周振芬, 胡勇华, 徐敏敏, 毛春献, 袁钰迪   

  1. 1 湖南科技大学计算机科学与工程学院 湖南 湘潭 411201
    2 服务计算与软件服务新技术湖南省重点实验室 湖南 湘潭 411201
  • 发布日期:2023-11-09
  • 通讯作者: 莫尚丰(mosfxy@foxmail.com)
  • 基金资助:
    湖南省教育厅科研项目(20B242);湖南省自然科学基金(2017JJ3087)

Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002

MO Shangfeng, ZHOU Zhenfen, HU Yonghua, XU Minmin, MAO Chunxian, YUAN Yudi   

  1. School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan,Hunan 411201,China
    China Hunan Key Laboratory for Service computing and Novel Software Technology,Xiangtan,Hunan 411201,China
  • Published:2023-11-09
  • About author:MO Shangfeng,born in 1977,Ph.D,is a member of China Computer Federation.His main research interests include DSP compilation and embedded system.
  • Supported by:
    Research Projects of Hunan Provincial Department of Education(20B242) and Natural Science Foundation of Hunan Province,China(2017JJ3087).

摘要: FT-M7002是我国自主研发的高性能DSP,具有强大的向量处理能力。为有效地发挥它的性能优势,亟待优化移植面向FT-M7002的高效VSIP函数库。复数域行向量矩阵乘法是VSIP库中频繁使用的算法,在数字通信、图像处理等应用领域中大量使用。文中在FT-M7002 DSP上研究优化复数域行向量矩阵乘法算法,通过改变计算矩阵列向量为计算矩阵行向量、向量化、循环展开和软件流水等手段提升算法性能。测试结果表明:优化后的向量C算法相比VSIP库函数获得了6.2~20.6的加速比,汇编优化算法相比向量C算法获得了3.4~14.3的加速比,加速效果明显。

关键词: 矩阵乘法, 数字信号处理器, 单指令多数据流, VSIPL

Abstract: FT-M7002 is a high-performance DSP independently developed in China,with powerful vector processing capability.In order to give full play to its performance advantages,it is urgent to optimize and transplant the efficient VSIP function library for FT-M7002.Row vector matrix multiplication in complex domain is a frequent algorithm used in VSIP library,which is widely used in digital communication,image processing and other application fields.In this paper,we study the optimization algorithm of row vector matrix multiplication in complex domain on FT-M7002 DSP,and improve the performance of the algorithm by changing the column vector of the computation matrix to the row vector of the computation matrix,vectorization,loop expansion and software pipelining.The test results show that the optimized vector C algorithm achieves a speedup ratio of 6.2~20.6 compared with the VSIP library function,and the assembly optimization algorithm achieves a speedup ratio of 3.4~14.3 compared with the vector C algorithm.The speedup effect is obvious.

Key words: Matrix multiplication, Digital signal processor, SIMD, VSIPL

中图分类号: 

  • TP313
[1]ZHANG Y H,LIU X G.Parallel Algorithm of Matrix Multiplication Based on MPI & OpenMP[J].Computer and Modernization,2011(7):84-87.
[2]LIM R,LEE Y,et al.An implementation of matrix-matrix mul-tiplication on the Intel KNL processor with AVX-512[J].Cluster Computing,2018,21:1785-1795.
[3]LI X W,CUI X.Performance optimization of matrix multiplication and FFT in GPU[J].Modern Electronics Technique,2013,36(4):80-84.
[4]ZHANG M Y.Parallel implementation of matrix multiplication based on CUDA[J].Changjiang Information & Communications,2012(2):20-21.
[5]SHAO Y M,ZHOU J.Implementation of Customized Instruc-tion for RISC-V CPU Based on FPGA[J].Software,2022,43(1):161-164.
[6]TIAN X,ZHOU F.Design of field programmable gate arraybased real time double-precision floating-point matrix multiplier[J].Journal of Zhejiang University(Engineering Science),2008(9):1611-1615.
[7]WANG Y H,LI C,LIU C,et al.Advancing DSP into HPC,AI,and beyond:challenges,mechanisms,and future directions[J].CCF Transactions on High Performance Computing,2021,3(1):114-125.
[8]LI H X,ZHANG H F.A Cholesky decomposition vector processing algorithm for FT-M7002[J].Journal of Shaoyang University(Natural ScienceEdition),2022,19(3):9-17.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!