计算机科学 ›› 2025, Vol. 52 ›› Issue (5): 91-100.doi: 10.11896/jsjkx.240800055

• 高性能计算 • 上一篇    下一篇

面向脉动阵列加速器的软硬件协同容错设计

魏晓辉1, 关泽宇1, 王晨洋1, 岳恒山1, 吴旗1,2   

  1. 1 吉林大学计算机科学与技术学院 长春 130012
    2 吉林大学高性能计算中心 长春 130012
  • 收稿日期:2024-08-09 修回日期:2025-03-03 出版日期:2025-05-15 发布日期:2025-05-12
  • 通讯作者: 吴旗(wuqi@jlu.edu.cn)
  • 作者简介:(weixh@jlu.edu.cn)
  • 基金资助:
    国家重点研发计划(2023YFB4502304);国家自然科学基金(62302190,62272190)

Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators

WEI Xiaohui1, GUAN Zeyu1, WANG Chenyang1, YUE Hengshan1, WU Qi1,2   

  1. 1 School of Computer Science and Technology,Jilin University,Changchun 130012,China
    2 High Performance Computing Center,Jilin University,Changchun 130012,China
  • Received:2024-08-09 Revised:2025-03-03 Online:2025-05-15 Published:2025-05-12
  • About author:WEI Xiaohui,born in 1972,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.13276D).His main research in-terests include distributed computing and systems,high-efficiency deep learning systems,cloud computing and big data processing,high-performance computing,approximate computing,advanced computing and system reliability.
    WU Qi,born in 1976,is a member of CCF(No.O2732M).His main research interests include high-performance computing and so on.
  • Supported by:
    National Key Research and Development Program of China(2023YFB4502304) and National Natural Science Foundation of China(62302190,62272190).

摘要: 近年来,随着模型推理精度的不断提高,卷积神经网络(CNN)在安全关键领域得到了广泛应用。为了满足CNN在实时性、高性能和低功耗计算方面的需求,领域专用架构的CNN加速器应运而生。其中,脉动阵列架构凭借其结构简单和高并行度等优势被广泛应用。然而,由于制程变异和设备老化等因素的影响,脉动阵列容易发生Stuck-At故障(SAF),进而可能导致灾难性事故。因此,制定针对脉动阵列的容错策略显得尤为重要。然而,现有的容错策略存在时间和资源开销大、网络参数修改过多等问题。为实现高效且低开销的轻量级容错策略,拟挖掘CNN的固有容错能力,对部分影响较小的SAF进行松弛处理,以减少整体容错开销。同时,充分考虑脉动阵列的计算特性,提出了行(列)交换和权重拆分两种软硬件协同容错设计,有效缓解SAF对模型推理精度的影响。实验结果表明,相较于传统行(列)跳过策略和选择保护策略,所提软硬件协同容错策略在执行效率和模型精度恢复方面更具优势。

关键词: 卷积神经网络, 容错设计, Stuck-At故障, 脉动阵列, 卷积神经网络加速器

Abstract: In recent years,with the continuous improvement in model inference accuracy,convolutional neural networks(CNNs) have been widely applied in safety-critical fields.To meet the demands of CNNs for real-time,high-performance,and low-power computing,domain-specific CNN accelerators is proposed.Among these,systolic array architectures have been extensively used due to their simple structure and high parallelism.However,factors such as process variations and equipment aging make systolic arrays prone to Stuck-At faults(SAF),which can lead to catastrophic accidents.Therefore,fault-tolerant strategies for systolic arrays are critically important.Existing fault-tolerant strategies,however,suffer from high time and resource costs,as well as excessive modifications to network parameters.To achieve an efficient and low-overhead lightweight fault-tolerant strategy,this paper aims to exploit the inherent fault tolerance of CNNs by relaxing the handling of minor SAFs,thereby reducing overall fault-tolerance overhead.Additionally,by fully considering the computational characteristics of systolic arrays,this paper proposes two hardware-software co-design fault-tolerant strategies:row(column) swapping and weight splitting.These strategies effectively mitigate the impact of SAF on model inference accuracy.Experimental results show that,compared to traditional row(column) bypass and selective protection strategies,the proposed hardware-software co-design fault-tolerant strategies offer superior execution efficiency and model accuracy recovery.

Key words: Convolutional neural networks, Fault-tolerant design, Stuck-At faults, Systolic arrays, CNN accelerators

中图分类号: 

  • TP183
[1]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].Communications of the ACM,2017,60(6):84-90.
[2]AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Learning.PMLR,2016:173-182.
[3]ZHANG Y,WALLACE B.A sensitivity analysis of(and practitioners' guide to) convolutional neural networks for sentence classification[J].arXiv:1510.03820,2015.
[4]CHEN Y H,KRISHNA T,EMER J S,et al.Eyeriss:An energyefficient reconfigurable accelerator for deep convolutional neural networks[J].IEEE Journal of Solid-state Sircuits,2016,52(1):127-138.
[5]ALWANI M,CHEN H,FERDMAN M,et al.Fused-layer CNN accelerators[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2016:1-12.
[6]CHEN T,DU Z,SUN N,et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[7]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[8]CHUNG E,FOWERS J,OVTCHAROV K,et al.Serving DNNsin real time at datacenter scale with project brainwave[J].IEEE Micro,2018,38(2):8-20.
[9]ZHOU X,LI Y,LIANG W.CNN-RNN based intelligent recommendation for online medical pre-diagnosis support[J].IEEE/ACM Transactions on Computational Biology and Bioinforma-tics,2020,18(3):912-921.
[10]CHISHTI S O A,RIAZ S,BILALZAIB M,et al.Self-drivingcars using CNN and Q-learning [C]//2018 IEEE 21st International Multi-Topic Conference(INMIC).IEEE,2018:1-7.
[11]ZHANG J J,GU T,BASU K,et al.Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator [C]//2018 IEEE 36th VLSI Test Symposium(VTS).IEEE,2018:1-6.
[12]TAKANAMI I,FUKUSHI M.A built-in circuit for self-repairing mesh-connected processor arrays with spares on diagonal[C]//2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing(PRDC).IEEE,2017:110-117.
[13]ZHAO Y,WANG K,LOURI A.FSA:An efficient fault-tolerant systolic array-based DNN accelerator architecture[C]//2022 IEEE 40th International Conference on Computer Design(ICCD).IEEE,2022:545-552.
[14]STOJANOVIĆ N M,MILOVANOVIĆ E I,STOJMENOVIĆ I,et al.Mapping matrix multiplication algorithm onto fault-tole-rant systolic array[J].Computers & Mathematics with Applications,2004,48(1/2):275-289.
[15]SIDDIQUE A,HOQUE K A.Exposing Reliability Degradation and Mitigation in Approximate DNNs Under Permanent Faults[J].IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2023,31(4):555-566.
[16]RUOSPO A,GAVARINI G,DE SIO C,et al.Assessing convolutional neural networks reliability through statistical fault injections[C]//2023 Design,Automation & Test in Europe Conference & Exhibition(DATE).IEEE,2023:1-6.
[17]ZHOU F Y,JIN L P,DONG J.A Survey of Convolutional Neural Network Research [J].Chinese Journal of Computers,2017,40(6):1229-1251.
[18]JU X,CAO Y S,WEN M,et al.An Optimization Strategy of Systolic Array with Early Switching Between Matrix Blocks [J].Computer Engineering and Science,2023,45(1):1-9.
[19]PAPPALARDO S,RUOSPO A,O'CONNOR I,et al.A Fault Injection Framework for AI Hardware Accelerators[C]//2023 IEEE 24th Latin American Test Symposium(LATS).IEEE,2023:1-6.
[20]NARDI A,ARMATO A.Functional safety methodologies forautomotive applications [C]//2017 IEEE/ACM International
Conference on Computer-Aided Design(ICCAD).IEEE,2017:970-975.
[21]SCHREIBER T.Extremely simple nonlinear noise-reductionmethod[J].Physical Review E,1993,47(4):2401.
[22]SAMAJDAR A,JOSEPH J M,ZHU Y,et al.A systematicmethodology for characterizing scalability of dnn accelerators using scale-sim[C]//2020 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2020:58-68.
[23]CHOI W,SHIN D,PARK J,et al.Sensitivity based error resilient techniques for energy efficient deep neural network accelerators[C]//Proceedings of the 56th Annual Design Automation Conference 2019.2019:1-6.
[24]LEE H,KIM J,PARK J,et al.STRAIT:Self-Test and Self-Recovery for AI Accelerator[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2023,42(9):3092-3104.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!