Computer Science ›› 2022, Vol. 49 ›› Issue (5): 355-362.doi: 10.11896/jsjkx.210500226

• Interdiscipline & Frontier • Previous Articles     Next Articles

Deep Neural Network Operator Acceleration Library Optimization Based on Domestic Many-core Processor

GAO Jie1, LIU Sha2, HUANG Ze-qiang2, ZHENG Tian-yu3, LIU Xin2, QI Feng-bin2   

  1. 1 Department of Cyberspace Security Academy,Information Engineering University,Zhengzhou 450000,China
    2 Jiangnan Institute of Computing Technology,Wuxi,Jiangsu 214083,China
    3 School of Software,Shangdong University,Jinan 250101,China
  • Received:2021-05-31 Revised:2021-06-27 Online:2022-05-15 Published:2022-05-06
  • About author:GAO Jie,born in 1997,postgraduate.His main research interests include high performance computing and deep lear-ning.
    LIU Xin,born in 1979,Ph.D,Ph.D supervisor,is a member of China Compu-ter Federation.Her main research in-terests include parallel algorithms and parallel application software.
  • Supported by:
    National Natural Science Foundation of China(U1806205).

Abstract: Operator acceleration libraries based on different hardware devices have become an indispensable part of deep learning framework,which can provide performance improvement for large-scale training or inference tasks dramatically.The current main-stream operator libraries are all developed based on GPU architecture,which is not compatible with other heterogeneous designs.SWDNN operator library is based on the development of SW26010 processor,which can not give full play to the performance of the upgraded SW26010 pro processor,nor can it meet the needs of the current large neural network models such as GPT-3 for large memory capacity and high memory access bandwidth.According to the architecture characteristics of SW26010 pro processor and the training requirements of large neural network model,a three-level parallel and neural network operator task sche-duling scheme based on multi-core group is proposed,which can satisfy the memory requirements of large model training and improve the overall computing performance and parallel efficiency.A memory access optimization method with triple asynchronous flow and overlap of computation and memory access is proposed,which significantly alleviates the memory access performance bottleneck of neural network operators.Based on the above methods,this paper constructs the SWTensor many-core group operator acceleration library based on the SW26010 pro processor.The experimental results of natural language processing model GPT-2 show that,computation-intensive operators and memory access intensive operators in SWTensor operator library reach the maxi-mum of 90.4% and 88.7% of the theoretical peak values respectively in single-precision floating-point computing performance and memory access bandwidth.

Key words: Asynchronous flow, Deep neural network, Double-buffering, Load balancing, Operator acceleration library

CLC Number: 

  • TP311
[1]HUANG G,LIU Z,LAURENS V,et al.Densely ConnectedConvolutional Networks[C]//IEEE Computer Society.2016.
[2]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[C]//Conference and Workshop on Neural Information Processing Systems.2020.
[3]NAUMOV M,MUDIGERE D,SHI H,et al.Deep LearningRecommendation Model for Personalization and Recommen-dation Systems[J].arXiv:1906.00091,2019.
[4]KOSSMANN J,SCHLOSSER R.Self-Driving Database Sys-tems:A Conceptual Approach[J].Distributed and Parallel Databases,2020,38:1-2.
[5]YE D,CHEN G,ZHANG W,et al.Towards Playing Full MOBA Games with Deep Reinforcement Learning[C]//Conference and Workshop on Neural Information Processing Systems.2020.
[6]HEO L,FEIG M.High-Accuracy Protein Structures By Combining Machine-Learning With Physics-Based Refinement[J].Proteins-Structure Function and Bioinformatics,2020,5:637-642.
[7]CHETLUR S,WOOLLEY C,VANDERMERSCH P,et al.cu-DNN:Efficient Primitives for Deep Lear-ning[C]//Deep Learning and Representation Learning Workshop (NIPS2014).2014.
[8]JIA Y,SHELHAMER E,DONAHUE J,et al.Caffe:Convolutional Architecture for Fast Feature Embedding[C]//Procee-dings of the 22nd ACM International Conference(MM’14).2014.
[9]ABADI M,AGARWAL A,BARHAM P,et al.Tensor Flow:Large-Scale Machine Learning on Heterogeneous Distributed Systems[J].arXiv:1603.04467,2015.
[10]PASZKE A,GROSS S,MASSA F,et al.PyTorch:An Impera-tive Style,High-Performance Deep Learning Library[J].arXiv:1912.01703,2019.
[11]LAVIN A.maxDNN:An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs[J].arXiv:1501.06633,2015.
[12]STEFAN H,FIRAS A,CE Z,et al.Caffe con troll:Shallowideas to speed up deep learning[C]//Proceedings of the Fourth Workshop on Data Analytics in the Cloud.2015.
[13]VASILACHE N,JOHNSON J,MATHIEU M,et al.Fast Con-volutional Nets With fbfft:A GPU Performance Evaluation[C]//International Conference on Learning Representations.2015.
[14]FU H H,LIAO J,YANG J,et al.The Sunway Taihu Lightsupercomputer:system and applications[J].Science China(Information Sciences),2016,7(59):113-128.
[15]FANG J,FU H,ZHAO W,et al.swDNN:A Library for Acce-lerating Deep Learning Applications on Sunway TaihuLight[C]//2017 IEEE International Parallel and Distributed Proces-sing Symposium (IPDPS).IEEE,2017.
[16]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017,30:6000-6010.
[17]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[EB/OL].[2018-06-11].https://openai.com/blog/language-unsupervised/.
[18]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2018.
[19]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[EB/OL].[2019-02-14].https://openai.com/blog/better-language-models/.
[20]MOHAMMAD S,MOSTOFA P,RAUL P,et al.Megatron-LM:Training Multi-Billion Parameter Language Models Using GPU Model Parallelism[J].arXiv:1909.08053,2019.
[21]RAJBHANDARI S,RASLEY J,RUWASE O,et al.ZeRO:Memory Optimization Towards Training A Trillion Parameter Models[J].arXiv:1910.02054v2,2020.
[22]BROWN T B,MANN B,RYDER N,et al.Language Models are Few-Shot Learners[J].arXiv:2005.14165,2020.
[23]FEDUS W,ZOPH B,SHAZEER N,et al.Switch Transformers:Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[J].arXiv:2101.03961,2021.
[1] TIAN Zhen-zhen, JIANG Wei, ZHENG Bing-xu, MENG Li-min. Load Balancing Optimization Scheduling Algorithm Based on Server Cluster [J]. Computer Science, 2022, 49(6A): 639-644.
[2] WEI Hui, CHEN Ze-mao, ZHANG Li-qiang. Anomaly Detection Framework of System Call Trace Based on Sequence and Frequency Patterns [J]. Computer Science, 2022, 49(6): 350-355.
[3] JIAO Xiang, WEI Xiang-lin, XUE Yu, WANG Chao, DUAN Qiang. Automatic Modulation Recognition Based on Deep Learning [J]. Computer Science, 2022, 49(5): 266-278.
[4] TAN Shuang-jie, LIN Bao-jun, LIU Ying-chun, ZHAO Shuai. Load Scheduling Algorithm for Distributed On-board RTs System Based on Machine Learning [J]. Computer Science, 2022, 49(2): 336-341.
[5] FAN Hong-jie, LI Xue-dong, YE Song-tao. Aided Disease Diagnosis Method for EMR Semantic Analysis [J]. Computer Science, 2022, 49(1): 153-158.
[6] XIA Zhong, XIANG Min, HUANG Chun-mei. Hierarchical Management Mechanism of P2P Video Surveillance Network Based on CHBL [J]. Computer Science, 2021, 48(9): 278-285.
[7] SONG Hai-ning, JIAO Jian, LIU Yong. Research on Mobile Edge Computing in Expressway [J]. Computer Science, 2021, 48(6A): 383-386.
[8] WANG Zheng, JIANG Chun-mao. Cloud Task Scheduling Algorithm Based on Three-way Decisions [J]. Computer Science, 2021, 48(6A): 420-426.
[9] ZHOU Xin, LIU Shuo-di, PAN Wei, CHEN Yuan-yuan. Vehicle Color Recognition in Natural Traffic Scene [J]. Computer Science, 2021, 48(6A): 15-20.
[10] ZHENG Zeng-qian, WANG Kun, ZHAO Tao, JIANG Wei, MENG Li-min. Load Balancing Mechanism for Bandwidth and Time-delay Constrained Streaming Media Server Cluster [J]. Computer Science, 2021, 48(6): 261-267.
[11] LIU Dong, WANG Ye-fei, LIN Jian-ping, MA Hai-chuan, YANG Run-yu. Advances in End-to-End Optimized Image Compression Technologies [J]. Computer Science, 2021, 48(3): 1-8.
[12] YAO Ze-wei, LIU Jia-wen, HU Jun-qin, CHEN Xing. PSO-GA Based Approach to Multi-edge Load Balancing [J]. Computer Science, 2021, 48(11A): 456-463.
[13] PAN Yu, ZOU Jun-hua, WANG Shuai-hui, HU Gu-yu, PAN Zhi-song. Deep Community Detection Algorithm Based on Network Representation Learning [J]. Computer Science, 2021, 48(11A): 198-203.
[14] MA Lin, WANG Yun-xiao, ZHAO Li-na, HAN Xing-wang, NI Jin-chao, ZHANG Jie. Network Intrusion Detection System Based on Multi-model Ensemble [J]. Computer Science, 2021, 48(11A): 592-596.
[15] LIU Tian-xing, LI Wei, XU Zheng, ZHANG Li-hua, QI Xiao-ya, GAN Zhong-xue. Monte Carlo Tree Search for High-dimensional Continuous Control Space [J]. Computer Science, 2021, 48(10): 30-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!