计算机科学 ›› 2010, Vol. 37 ›› Issue (8): 168-171.

• 软件工程 • 上一篇    下一篇

基于GPU的稀疏矩阵向量乘优化

白洪涛,欧阳丹彤,李熙铭,李亭,何丽莉   

  1. (吉林大学计算机科学与技术学院 长春130012);(吉林大学符号计算与知识工程教育部重点实验室 长春130012)
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金重大项目基金(60496320,60496321),国家白然科学基金(60973089,60773097,60873148),吉林省科技发展计划项目基金(20060532,20080107),欧盟合作项目(155776-EM-1-2009-1-IT-ERAMUNDUS-ECW-L72)资助。

Optimizing Sparse Matrix-vector Multiplication Based on GPU

BAI Hong-tao,OUYANG Dan-tong,LI Xi-ming,LI Ting,HE Li-li   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对稀疏矩阵运算难以发挥图形处理器的强大运算能力的现状,基于图形处理器的统一计算架构,在线程映射、数据复用等方面研究了一系列并行计算优化方法,从而完成了一种行压缩存储表示下的稀疏矩阵向量乘并行算法。这些优化方法包括:(1)利用Warp内线程天然同步特性,Half-warp完成结果向量一个元素的计算;(2)取整读取数据,实现合并访问;(3)输入向量放入纹理存储器,数据复用;(4)申请分页锁定内存,加速数据传输;(5)使用共享存储器,加速数据存取。实验分析表明,提出的各种手段起到了优化的作用。与已有的CUDPP和SpMV library中的CSR-vector算法相比,本算法获得了更高的存储器带宽和浮点运算吞吐量;整体性能比CPU串行执行版本快了3倍以上。

关键词: 稀疏矩阵,行压缩存储,图形处理器,统一计算架构,优化策略

Abstract: Sparse matrix computations present additional challenges for harnessing the potential of modern graphics processing unit(GPU) for general-purpose computing. We investigated various optimizations on thread-mapping, data reuse etc. and a parallel Sparse Matrix-Vector multiplication(SpMV) on GPU with compute unified device architecture(CUDA) was proposed under compressed sparse row(CSR) structure afterwards. The optimizations include; (1) exploiting each clement using half-warp threads, which synchronize free within one warp; (2) making up integer address to achieve coalesced accesses; (3) data reuse through reading from texture vector resides in; (4) data transfer using page locked memory; (5) reading results in shared memory. We compared the performance of our approach with that of efficicnt parallel SpMV implementations such as(1)the one from NVIDIA’s CUDPP library and(2)the one from NVIDIA's SpMV library. Our approach outperforms two both in memory bandwidth and GFLOPS. In addition, the total performance of our approach is three times greater than that of a CPU counterpart.

Key words: Sparse matrix, Compressed sparse row, Graphics processing unit, Compute unified device architecture, Optimizations

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!