计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.210500226

• 交叉与前沿 • 上一篇    下一篇

基于国产众核处理器的深度神经网络算子加速库优化

高捷1, 刘沙2, 黄则强2, 郑天宇3, 刘鑫2, 漆锋滨2   

  1. 1 信息工程大学网络空间安全学院 郑州450000
    2 江南计算技术研究所 江苏 无锡214083
    3 山东大学软件学院 济南 250101
  • 收稿日期:2021-05-31 修回日期:2021-06-27 出版日期:2022-05-15 发布日期:2022-05-06
  • 通讯作者: 刘鑫(yyylx@263.net)
  • 作者简介:(george_jie_work@163.com)
  • 基金资助:
    国家自然科学基金(U1806205)

Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor

GAO Jie1, LIU Sha2, HUANG Ze-qiang2, ZHENG Tian-yu3, LIU Xin2, QI Feng-bin2   

  1. 1 Department of Cyberspace Security Academy,Information Engineering University,Zhengzhou 450000,China
    2 Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214083,China
    3 School of Software,Shangdong University,Jinan 250101,China
  • Received:2021-05-31 Revised:2021-06-27 Online:2022-05-15 Published:2022-05-06
  • About author:GAO Jie,born in 1997,postgraduate.His main research interests include high performance computing and deep lear-ning.
    LIU Xin,born in 1979,Ph.D,Ph.D supervisor,is a member of China Compu-ter Federation.Her main research in-terests include parallel algorithms and parallel application software.
  • Supported by:
    National Natural Science Foundation of China(U1806205).

摘要: 基于不同硬件设备的算子加速库已经成为深度学习框架不可或缺的一部分,能够为大规模训练或者推理任务提供数倍的性能加速。当前的主流算子库都是基于GPU架构开发的,与其他异构设计并不兼容; SWDNN算子库是基于申威26010开发的,无法充分发挥升级后的申威26010 pro处理器的性能,也不能满足当前GPT-3等大型神经网络模型对大容量内存和高访存带宽的需求。文中面向申威26010 pro处理器体系结构的特点和大型神经网络模型的训练需求,提出了基于多核组的三级并行和神经网络算子任务调度方案,在满足大型模型训练内存需求的同时,提高了并行效率和整体计算性能;提出了三级异步流水机制和计算访存重叠的访存优化方法,显著缓解了神经网络算子的访存性能瓶颈。基于以上方法,文中构建了基于申威26010 pro处理器的SWTensor多核组算子加速库,在自然语言处理模型GPT-2上进行了实验,结果表明,其典型计算密集型算子和访存密集型算子在单精度浮点计算性能和访存带宽上分别达到了理论峰值的90.4%和88.7%。

关键词: 负载均衡, 深度神经网络, 双缓冲, 算子加速库, 异步流水

Abstract: Operator acceleration libraries based on different hardware devices have become an indispensable part of deep learning framework,which can provide performance improvement for large-scale training or inference tasks dramatically.The current main-stream operator libraries are all developed based on GPU architecture,which is not compatible with other heterogeneous designs.SWDNN operator library is based on the development of SW26010 processor,which can not give full play to the performance of the upgraded SW26010 pro processor,nor can it meet the needs of the current large neural network models such as GPT-3 for large memory capacity and high memory access bandwidth.According to the architecture characteristics of SW26010 pro processor and the training requirements of large neural network model,a three-level parallel and neural network operator task sche-duling scheme based on multi-core group is proposed,which can satisfy the memory requirements of large model training and improve the overall computing performance and parallel efficiency.A memory access optimization method with triple asynchronous flow and overlap of computation and memory access is proposed,which significantly alleviates the memory access performance bottleneck of neural network operators.Based on the above methods,this paper constructs the SWTensor many-core group operator acceleration library based on the SW26010 pro processor.The experimental results of natural language processing model GPT-2 show that,computation-intensive operators and memory access intensive operators in SWTensor operator library reach the maxi-mum of 90.4% and 88.7% of the theoretical peak values respectively in single-precision floating-point computing performance and memory access bandwidth.

Key words: Asynchronous flow, Deep neural network, Double-buffering, Load balancing, Operator acceleration library

中图分类号: 

  • TP311
[1]HUANG G,LIU Z,LAURENS V,et al.Densely ConnectedConvolutional Networks[C]//IEEE Computer Society.2016.
[2]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[C]//Conference and Workshop on Neural Information Processing Systems.2020.
[3]NAUMOV M,MUDIGERE D,SHI H,et al.Deep LearningRecommendation Model for Personalization and Recommen-dation Systems[J].arXiv:1906.00091,2019.
[4]KOSSMANN J,SCHLOSSER R.Self-Driving Database Sys-tems:A Conceptual Approach[J].Distributed and Parallel Databases,2020,38:1-2.
[5]YE D,CHEN G,ZHANG W,et al.Towards Playing Full MOBA Games with Deep Reinforcement Learning[C]//Conference and Workshop on Neural Information Processing Systems.2020.
[6]HEO L,FEIG M.High-Accuracy Protein Structures By Combining Machine-Learning With Physics-Based Refinement[J].Proteins-Structure Function and Bioinformatics,2020,5:637-642.
[7]CHETLUR S,WOOLLEY C,VANDERMERSCH P,et al.cu-DNN:Efficient Primitives for Deep Lear-ning[C]//Deep Learning and Representation Learning Workshop (NIPS2014).2014.
[8]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional Architecture for Fast Feature Embedding[C]//Procee-dings of the 22nd ACM International Conference(MM’14).2014.
[9]ABADI M,AGARWAL A,BARHAM P,et al.Tensor Flow:Large-Scale Machine Learning on Heterogeneous Distributed Systems[J].arXiv:1603.04467,2015.
[10]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Impera-tive Style,High-Performance Deep Learning Library[J].arXiv:1912.01703,2019.
[11]LAVIN A.maxDNN:An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs[J].arXiv:1501.06633,2015.
[12]STEFAN H,FIRAS A,CE Z,et al.Caffe con troll:Shallowideas to speed up deep learning[C]//Proceedings of the Fourth Workshop on Data Analytics in the Cloud.2015.
[13]VASILACHE N,JOHNSON J,MATHIEU M,et al.Fast Con-volutional Nets With fbfft:A GPU Performance Evaluation[C]//International Conference on Learning Representations.2015.
[14]FU H H,LIAO J,YANG J,et al.The Sunway Taihu Lightsupercomputer:system and applications[J].Science China(Information Sciences),2016,7(59):113-128.
[15]FANG J,FU H,ZHAO W,et al.swDNN:A Library for Acce-lerating Deep Learning Applications on Sunway TaihuLight[C]//2017 IEEE International Parallel and Distributed Proces-sing Symposium (IPDPS).IEEE,2017.
[16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017,30:6000-6010.
[17]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].[2018-06-11].https://openai.com/blog/language-unsupervised/.
[18]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2018.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].[2019-02-14].https://openai.com/blog/better-language-models/.
[20]MOHAMMAD S,MOSTOFA P,RAUL P,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019.
[21]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory Optimization Towards Training A Trillion Parameter Models[J].arXiv:1910.02054v2,2020.
[22]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020.
[23]FEDUS W,ZOPH B,SHAZEER N,et al.Switch Transformers:Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[J].arXiv:2101.03961,2021.
[1] 田真真, 蒋维, 郑炳旭, 孟利民.
基于服务器集群的负载均衡优化调度算法
Load Balancing Optimization Scheduling Algorithm Based on Server Cluster
计算机科学, 2022, 49(6A): 639-644. https://doi.org/10.11896/jsjkx.210800071
[2] 焦翔, 魏祥麟, 薛羽, 王超, 段强.
基于深度学习的自动调制识别研究
Automatic Modulation Recognition Based on Deep Learning
计算机科学, 2022, 49(5): 266-278. https://doi.org/10.11896/jsjkx.211000085
[3] 谭双杰, 林宝军, 刘迎春, 赵帅.
基于机器学习的分布式星载RTs系统负载调度算法
Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning
计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126
[4] 范红杰, 李雪冬, 叶松涛.
面向电子病历语义解析的疾病辅助诊断方法
Aided Disease Diagnosis Method for EMR Semantic Analysis
计算机科学, 2022, 49(1): 153-158. https://doi.org/10.11896/jsjkx.201100125
[5] 夏中, 向敏, 黄春梅.
基于CHBL的P2P视频监控网络分层管理机制
Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL
计算机科学, 2021, 48(9): 278-285. https://doi.org/10.11896/jsjkx.201200056
[6] 周欣, 刘硕迪, 潘薇, 陈媛媛.
自然交通场景中的车辆颜色识别
Vehicle Color Recognition in Natural Traffic Scene
计算机科学, 2021, 48(6A): 15-20. https://doi.org/10.11896/jsjkx.200800078
[7] 宋海宁, 焦健, 刘永.
高速公路中的移动边缘计算研究
Research on Mobile Edge Computing in Expressway
计算机科学, 2021, 48(6A): 383-386. https://doi.org/10.11896/jsjkx.200900212
[8] 王政, 姜春茂.
一种基于三支决策的云任务调度优化算法
Cloud Task Scheduling Algorithm Based on Three-way Decisions
计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023
[9] 郑增乾, 王锟, 赵涛, 蒋维, 孟利民.
带宽和时延受限的流媒体服务器集群负载均衡机制
Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster
计算机科学, 2021, 48(6): 261-267. https://doi.org/10.11896/jsjkx.200400131
[10] 刘东, 王叶斐, 林建平, 马海川, 杨闰宇.
端到端优化的图像压缩技术进展
Advances in End-to-End Optimized Image Compression Technologies
计算机科学, 2021, 48(3): 1-8. https://doi.org/10.11896/jsjkx.201100134
[11] 潘雨, 邹军华, 王帅辉, 胡谷雨, 潘志松.
基于网络表示学习的深度社团发现方法
Deep Community Detection Algorithm Based on Network Representation Learning
计算机科学, 2021, 48(11A): 198-203. https://doi.org/10.11896/jsjkx.210200113
[12] 姚泽玮, 林嘉雯, 胡俊钦, 陈星.
基于PSO-GA的多边缘负载均衡方法
PSO-GA Based Approach to Multi-edge Load Balancing
计算机科学, 2021, 48(11A): 456-463. https://doi.org/10.11896/jsjkx.210100191
[13] 马琳, 王云霄, 赵丽娜, 韩兴旺, 倪金超, 张婕.
基于多模型判别的网络入侵检测系统
Network Intrusion Detection System Based on Multi-model Ensemble
计算机科学, 2021, 48(11A): 592-596. https://doi.org/10.11896/jsjkx.201100170
[14] 曾德泽, 李跃鹏, 赵宇阳, 顾琳.
基于强化学习的高能效基站动态调度方法
Reinforcement Learning Based Dynamic Basestation Orchestration for High Energy Efficiency
计算机科学, 2021, 48(11): 363-371. https://doi.org/10.11896/jsjkx.201000008
[15] 刘天星, 李伟, 许铮, 张立华, 戚骁亚, 甘中学.
面向高维连续行动空间的蒙特卡罗树搜索算法
Monte Carlo Tree Search for High-dimensional Continuous Control Space
计算机科学, 2021, 48(10): 30-36. https://doi.org/10.11896/jsjkx.201000129
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!