计算机科学 ›› 2022, Vol. 49 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.210500226
高捷1, 刘沙2, 黄则强2, 郑天宇3, 刘鑫2, 漆锋滨2
GAO Jie1, LIU Sha2, HUANG Ze-qiang2, ZHENG Tian-yu3, LIU Xin2, QI Feng-bin2
摘要: 基于不同硬件设备的算子加速库已经成为深度学习框架不可或缺的一部分,能够为大规模训练或者推理任务提供数倍的性能加速。当前的主流算子库都是基于GPU架构开发的,与其他异构设计并不兼容; SWDNN算子库是基于申威26010开发的,无法充分发挥升级后的申威26010 pro处理器的性能,也不能满足当前GPT-3等大型神经网络模型对大容量内存和高访存带宽的需求。文中面向申威26010 pro处理器体系结构的特点和大型神经网络模型的训练需求,提出了基于多核组的三级并行和神经网络算子任务调度方案,在满足大型模型训练内存需求的同时,提高了并行效率和整体计算性能;提出了三级异步流水机制和计算访存重叠的访存优化方法,显著缓解了神经网络算子的访存性能瓶颈。基于以上方法,文中构建了基于申威26010 pro处理器的SWTensor多核组算子加速库,在自然语言处理模型GPT-2上进行了实验,结果表明,其典型计算密集型算子和访存密集型算子在单精度浮点计算性能和访存带宽上分别达到了理论峰值的90.4%和88.7%。
中图分类号:
[1]HUANG G,LIU Z,LAURENS V,et al.Densely ConnectedConvolutional Networks[C]//IEEE Computer Society.2016. [2]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[C]//Conference and Workshop on Neural Information Processing Systems.2020. [3]NAUMOV M,MUDIGERE D,SHI H,et al.Deep LearningRecommendation Model for Personalization and Recommen-dation Systems[J].arXiv:1906.00091,2019. [4]KOSSMANN J,SCHLOSSER R.Self-Driving Database Sys-tems:A Conceptual Approach[J].Distributed and Parallel Databases,2020,38:1-2. [5]YE D,CHEN G,ZHANG W,et al.Towards Playing Full MOBA Games with Deep Reinforcement Learning[C]//Conference and Workshop on Neural Information Processing Systems.2020. [6]HEO L,FEIG M.High-Accuracy Protein Structures By Combining Machine-Learning With Physics-Based Refinement[J].Proteins-Structure Function and Bioinformatics,2020,5:637-642. [7]CHETLUR S,WOOLLEY C,VANDERMERSCH P,et al.cu-DNN:Efficient Primitives for Deep Lear-ning[C]//Deep Learning and Representation Learning Workshop (NIPS2014).2014. [8]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional Architecture for Fast Feature Embedding[C]//Procee-dings of the 22nd ACM International Conference(MM’14).2014. [9]ABADI M,AGARWAL A,BARHAM P,et al.Tensor Flow:Large-Scale Machine Learning on Heterogeneous Distributed Systems[J].arXiv:1603.04467,2015. [10]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Impera-tive Style,High-Performance Deep Learning Library[J].arXiv:1912.01703,2019. [11]LAVIN A.maxDNN:An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs[J].arXiv:1501.06633,2015. [12]STEFAN H,FIRAS A,CE Z,et al.Caffe con troll:Shallowideas to speed up deep learning[C]//Proceedings of the Fourth Workshop on Data Analytics in the Cloud.2015. [13]VASILACHE N,JOHNSON J,MATHIEU M,et al.Fast Con-volutional Nets With fbfft:A GPU Performance Evaluation[C]//International Conference on Learning Representations.2015. [14]FU H H,LIAO J,YANG J,et al.The Sunway Taihu Lightsupercomputer:system and applications[J].Science China(Information Sciences),2016,7(59):113-128. [15]FANG J,FU H,ZHAO W,et al.swDNN:A Library for Acce-lerating Deep Learning Applications on Sunway TaihuLight[C]//2017 IEEE International Parallel and Distributed Proces-sing Symposium (IPDPS).IEEE,2017. [16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017,30:6000-6010. [17]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].[2018-06-11].https://openai.com/blog/language-unsupervised/. [18]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2018. [19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].[2019-02-14].https://openai.com/blog/better-language-models/. [20]MOHAMMAD S,MOSTOFA P,RAUL P,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019. [21]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory Optimization Towards Training A Trillion Parameter Models[J].arXiv:1910.02054v2,2020. [22]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020. [23]FEDUS W,ZOPH B,SHAZEER N,et al.Switch Transformers:Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[J].arXiv:2101.03961,2021. |
[1] | 田真真, 蒋维, 郑炳旭, 孟利民. 基于服务器集群的负载均衡优化调度算法 Load Balancing Optimization Scheduling Algorithm Based on Server Cluster 计算机科学, 2022, 49(6A): 639-644. https://doi.org/10.11896/jsjkx.210800071 |
[2] | 焦翔, 魏祥麟, 薛羽, 王超, 段强. 基于深度学习的自动调制识别研究 Automatic Modulation Recognition Based on Deep Learning 计算机科学, 2022, 49(5): 266-278. https://doi.org/10.11896/jsjkx.211000085 |
[3] | 谭双杰, 林宝军, 刘迎春, 赵帅. 基于机器学习的分布式星载RTs系统负载调度算法 Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning 计算机科学, 2022, 49(2): 336-341. https://doi.org/10.11896/jsjkx.201200126 |
[4] | 范红杰, 李雪冬, 叶松涛. 面向电子病历语义解析的疾病辅助诊断方法 Aided Disease Diagnosis Method for EMR Semantic Analysis 计算机科学, 2022, 49(1): 153-158. https://doi.org/10.11896/jsjkx.201100125 |
[5] | 夏中, 向敏, 黄春梅. 基于CHBL的P2P视频监控网络分层管理机制 Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL 计算机科学, 2021, 48(9): 278-285. https://doi.org/10.11896/jsjkx.201200056 |
[6] | 周欣, 刘硕迪, 潘薇, 陈媛媛. 自然交通场景中的车辆颜色识别 Vehicle Color Recognition in Natural Traffic Scene 计算机科学, 2021, 48(6A): 15-20. https://doi.org/10.11896/jsjkx.200800078 |
[7] | 宋海宁, 焦健, 刘永. 高速公路中的移动边缘计算研究 Research on Mobile Edge Computing in Expressway 计算机科学, 2021, 48(6A): 383-386. https://doi.org/10.11896/jsjkx.200900212 |
[8] | 王政, 姜春茂. 一种基于三支决策的云任务调度优化算法 Cloud Task Scheduling Algorithm Based on Three-way Decisions 计算机科学, 2021, 48(6A): 420-426. https://doi.org/10.11896/jsjkx.201000023 |
[9] | 郑增乾, 王锟, 赵涛, 蒋维, 孟利民. 带宽和时延受限的流媒体服务器集群负载均衡机制 Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster 计算机科学, 2021, 48(6): 261-267. https://doi.org/10.11896/jsjkx.200400131 |
[10] | 刘东, 王叶斐, 林建平, 马海川, 杨闰宇. 端到端优化的图像压缩技术进展 Advances in End-to-End Optimized Image Compression Technologies 计算机科学, 2021, 48(3): 1-8. https://doi.org/10.11896/jsjkx.201100134 |
[11] | 潘雨, 邹军华, 王帅辉, 胡谷雨, 潘志松. 基于网络表示学习的深度社团发现方法 Deep Community Detection Algorithm Based on Network Representation Learning 计算机科学, 2021, 48(11A): 198-203. https://doi.org/10.11896/jsjkx.210200113 |
[12] | 姚泽玮, 林嘉雯, 胡俊钦, 陈星. 基于PSO-GA的多边缘负载均衡方法 PSO-GA Based Approach to Multi-edge Load Balancing 计算机科学, 2021, 48(11A): 456-463. https://doi.org/10.11896/jsjkx.210100191 |
[13] | 马琳, 王云霄, 赵丽娜, 韩兴旺, 倪金超, 张婕. 基于多模型判别的网络入侵检测系统 Network Intrusion Detection System Based on Multi-model Ensemble 计算机科学, 2021, 48(11A): 592-596. https://doi.org/10.11896/jsjkx.201100170 |
[14] | 曾德泽, 李跃鹏, 赵宇阳, 顾琳. 基于强化学习的高能效基站动态调度方法 Reinforcement Learning Based Dynamic Basestation Orchestration for High Energy Efficiency 计算机科学, 2021, 48(11): 363-371. https://doi.org/10.11896/jsjkx.201000008 |
[15] | 刘天星, 李伟, 许铮, 张立华, 戚骁亚, 甘中学. 面向高维连续行动空间的蒙特卡罗树搜索算法 Monte Carlo Tree Search for High-dimensional Continuous Control Space 计算机科学, 2021, 48(10): 30-36. https://doi.org/10.11896/jsjkx.201000129 |
|