基于FT-M7002的复数域行向量矩阵乘法移植与优化

doi:10.11896/jsjkx.220900277

计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 220900277-6.doi: 10.11896/jsjkx.220900277

• 计算机软件&体系架构 • 上一篇下一篇

基于FT-M7002的复数域行向量矩阵乘法移植与优化

莫尚丰, 周振芬, 胡勇华, 徐敏敏, 毛春献, 袁钰迪

1 湖南科技大学计算机科学与工程学院湖南湘潭 411201
2 服务计算与软件服务新技术湖南省重点实验室湖南湘潭 411201

发布日期:2023-11-09
通讯作者: 莫尚丰(mosfxy@foxmail.com)
基金资助:
湖南省教育厅科研项目(20B242);湖南省自然科学基金(2017JJ3087)

Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002

MO Shangfeng, ZHOU Zhenfen, HU Yonghua, XU Minmin, MAO Chunxian, YUAN Yudi

School of Computer Science and Engineering,Hunan University of Science and Technology,Xiangtan,Hunan 411201,China
China Hunan Key Laboratory for Service computing and Novel Software Technology,Xiangtan,Hunan 411201,China

Published:2023-11-09
About author:MO Shangfeng,born in 1977,Ph.D,is a member of China Computer Federation.His main research interests include DSP compilation and embedded system.
Supported by:
Research Projects of Hunan Provincial Department of Education(20B242) and Natural Science Foundation of Hunan Province,China(2017JJ3087).

摘要/Abstract

摘要： FT-M7002是我国自主研发的高性能DSP,具有强大的向量处理能力。为有效地发挥它的性能优势,亟待优化移植面向FT-M7002的高效VSIP函数库。复数域行向量矩阵乘法是VSIP库中频繁使用的算法,在数字通信、图像处理等应用领域中大量使用。文中在FT-M7002 DSP上研究优化复数域行向量矩阵乘法算法,通过改变计算矩阵列向量为计算矩阵行向量、向量化、循环展开和软件流水等手段提升算法性能。测试结果表明:优化后的向量C算法相比VSIP库函数获得了6.2～20.6的加速比,汇编优化算法相比向量C算法获得了3.4～14.3的加速比,加速效果明显。

关键词: 矩阵乘法, 数字信号处理器, 单指令多数据流, VSIPL

Abstract: FT-M7002 is a high-performance DSP independently developed in China,with powerful vector processing capability.In order to give full play to its performance advantages,it is urgent to optimize and transplant the efficient VSIP function library for FT-M7002.Row vector matrix multiplication in complex domain is a frequent algorithm used in VSIP library,which is widely used in digital communication,image processing and other application fields.In this paper,we study the optimization algorithm of row vector matrix multiplication in complex domain on FT-M7002 DSP,and improve the performance of the algorithm by changing the column vector of the computation matrix to the row vector of the computation matrix,vectorization,loop expansion and software pipelining.The test results show that the optimized vector C algorithm achieves a speedup ratio of 6.2~20.6 compared with the VSIP library function,and the assembly optimization algorithm achieves a speedup ratio of 3.4~14.3 compared with the vector C algorithm.The speedup effect is obvious.

Key words: Matrix multiplication, Digital signal processor, SIMD, VSIPL

中图分类号:

TP313

引用本文

莫尚丰, 周振芬, 胡勇华, 徐敏敏, 毛春献, 袁钰迪. 基于FT-M7002的复数域行向量矩阵乘法移植与优化[J]. 计算机科学, 2023, 50(11A): 220900277-6. https://doi.org/10.11896/jsjkx.220900277

MO Shangfeng, ZHOU Zhenfen, HU Yonghua, XU Minmin, MAO Chunxian, YUAN Yudi. Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002[J]. Computer Science, 2023, 50(11A): 220900277-6. https://doi.org/10.11896/jsjkx.220900277

使用本文

/ 推荐

导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks

链接本文: https://www.jsjkx.com/CN/10.11896/jsjkx.220900277

https://www.jsjkx.com/CN/Y2023/V50/I11A/220900277

参考文献

[1]ZHANG Y H,LIU X G.Parallel Algorithm of Matrix Multiplication Based on MPI & OpenMP[J].Computer and Modernization,2011(7):84-87.
[2]LIM R,LEE Y,et al.An implementation of matrix-matrix mul-tiplication on the Intel KNL processor with AVX-512[J].Cluster Computing,2018,21:1785-1795.
[3]LI X W,CUI X.Performance optimization of matrix multiplication and FFT in GPU[J].Modern Electronics Technique,2013,36(4):80-84.
[4]ZHANG M Y.Parallel implementation of matrix multiplication based on CUDA[J].Changjiang Information & Communications,2012(2):20-21.
[5]SHAO Y M,ZHOU J.Implementation of Customized Instruc-tion for RISC-V CPU Based on FPGA[J].Software,2022,43(1):161-164.
[6]TIAN X,ZHOU F.Design of field programmable gate arraybased real time double-precision floating-point matrix multiplier[J].Journal of Zhejiang University(Engineering Science),2008(9):1611-1615.
[7]WANG Y H,LI C,LIU C,et al.Advancing DSP into HPC,AI,and beyond:challenges,mechanisms,and future directions[J].CCF Transactions on High Performance Computing,2021,3(1):114-125.
[8]LI H X,ZHANG H F.A Cholesky decomposition vector processing algorithm for FT-M7002[J].Journal of Shaoyang University(Natural ScienceEdition),2022,19(3):9-17.

相关文章 0

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

本文评价

推荐阅读 0

No Suggested Reading articles found!

摘要

参考文献

相关文章

Metrics

本文评价

推荐阅读

回顶部

基于FT-M7002的复数域行向量矩阵乘法移植与优化

Transplantation and Optimization of Row-vector-matrix Multiplication in Complex Domain Based on FT-M7002

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0