Computer Science ›› 2025, Vol. 52 ›› Issue (5): 91-100.doi: 10.11896/jsjkx.240800055

• High Performance Computing • Previous Articles     Next Articles

Hardware-Software Co-design Fault-tolerant Strategies for Systolic Array Accelerators

WEI Xiaohui1, GUAN Zeyu1, WANG Chenyang1, YUE Hengshan1, WU Qi1,2   

  1. 1 School of Computer Science and Technology,Jilin University,Changchun 130012,China
    2 High Performance Computing Center,Jilin University,Changchun 130012,China
  • Received:2024-08-09 Revised:2025-03-03 Online:2025-05-15 Published:2025-05-12
  • About author:WEI Xiaohui,born in 1972,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.13276D).His main research in-terests include distributed computing and systems,high-efficiency deep learning systems,cloud computing and big data processing,high-performance computing,approximate computing,advanced computing and system reliability.
    WU Qi,born in 1976,is a member of CCF(No.O2732M).His main research interests include high-performance computing and so on.
  • Supported by:
    National Key Research and Development Program of China(2023YFB4502304) and National Natural Science Foundation of China(62302190,62272190).

Abstract: In recent years,with the continuous improvement in model inference accuracy,convolutional neural networks(CNNs) have been widely applied in safety-critical fields.To meet the demands of CNNs for real-time,high-performance,and low-power computing,domain-specific CNN accelerators is proposed.Among these,systolic array architectures have been extensively used due to their simple structure and high parallelism.However,factors such as process variations and equipment aging make systolic arrays prone to Stuck-At faults(SAF),which can lead to catastrophic accidents.Therefore,fault-tolerant strategies for systolic arrays are critically important.Existing fault-tolerant strategies,however,suffer from high time and resource costs,as well as excessive modifications to network parameters.To achieve an efficient and low-overhead lightweight fault-tolerant strategy,this paper aims to exploit the inherent fault tolerance of CNNs by relaxing the handling of minor SAFs,thereby reducing overall fault-tolerance overhead.Additionally,by fully considering the computational characteristics of systolic arrays,this paper proposes two hardware-software co-design fault-tolerant strategies:row(column) swapping and weight splitting.These strategies effectively mitigate the impact of SAF on model inference accuracy.Experimental results show that,compared to traditional row(column) bypass and selective protection strategies,the proposed hardware-software co-design fault-tolerant strategies offer superior execution efficiency and model accuracy recovery.

Key words: Convolutional neural networks, Fault-tolerant design, Stuck-At faults, Systolic arrays, CNN accelerators

CLC Number: 

  • TP183
[1]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].Communications of the ACM,2017,60(6):84-90.
[2]AMODEI D,ANANTHANARAYANAN S,ANUBHAI R,et al.Deep speech 2:End-to-end speech recognition in english and mandarin[C]//International Conference on Machine Learning.PMLR,2016:173-182.
[3]ZHANG Y,WALLACE B.A sensitivity analysis of(and practitioners' guide to) convolutional neural networks for sentence classification[J].arXiv:1510.03820,2015.
[4]CHEN Y H,KRISHNA T,EMER J S,et al.Eyeriss:An energyefficient reconfigurable accelerator for deep convolutional neural networks[J].IEEE Journal of Solid-state Sircuits,2016,52(1):127-138.
[5]ALWANI M,CHEN H,FERDMAN M,et al.Fused-layer CNN accelerators[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2016:1-12.
[6]CHEN T,DU Z,SUN N,et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[7]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[8]CHUNG E,FOWERS J,OVTCHAROV K,et al.Serving DNNsin real time at datacenter scale with project brainwave[J].IEEE Micro,2018,38(2):8-20.
[9]ZHOU X,LI Y,LIANG W.CNN-RNN based intelligent recommendation for online medical pre-diagnosis support[J].IEEE/ACM Transactions on Computational Biology and Bioinforma-tics,2020,18(3):912-921.
[10]CHISHTI S O A,RIAZ S,BILALZAIB M,et al.Self-drivingcars using CNN and Q-learning [C]//2018 IEEE 21st International Multi-Topic Conference(INMIC).IEEE,2018:1-7.
[11]ZHANG J J,GU T,BASU K,et al.Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator [C]//2018 IEEE 36th VLSI Test Symposium(VTS).IEEE,2018:1-6.
[12]TAKANAMI I,FUKUSHI M.A built-in circuit for self-repairing mesh-connected processor arrays with spares on diagonal[C]//2017 IEEE 22nd Pacific Rim International Symposium on Dependable Computing(PRDC).IEEE,2017:110-117.
[13]ZHAO Y,WANG K,LOURI A.FSA:An efficient fault-tolerant systolic array-based DNN accelerator architecture[C]//2022 IEEE 40th International Conference on Computer Design(ICCD).IEEE,2022:545-552.
[14]STOJANOVIĆ N M,MILOVANOVIĆ E I,STOJMENOVIĆ I,et al.Mapping matrix multiplication algorithm onto fault-tole-rant systolic array[J].Computers & Mathematics with Applications,2004,48(1/2):275-289.
[15]SIDDIQUE A,HOQUE K A.Exposing Reliability Degradation and Mitigation in Approximate DNNs Under Permanent Faults[J].IEEE Transactions on Very Large Scale Integration(VLSI) Systems,2023,31(4):555-566.
[16]RUOSPO A,GAVARINI G,DE SIO C,et al.Assessing convolutional neural networks reliability through statistical fault injections[C]//2023 Design,Automation & Test in Europe Conference & Exhibition(DATE).IEEE,2023:1-6.
[17]ZHOU F Y,JIN L P,DONG J.A Survey of Convolutional Neural Network Research [J].Chinese Journal of Computers,2017,40(6):1229-1251.
[18]JU X,CAO Y S,WEN M,et al.An Optimization Strategy of Systolic Array with Early Switching Between Matrix Blocks [J].Computer Engineering and Science,2023,45(1):1-9.
[19]PAPPALARDO S,RUOSPO A,O'CONNOR I,et al.A Fault Injection Framework for AI Hardware Accelerators[C]//2023 IEEE 24th Latin American Test Symposium(LATS).IEEE,2023:1-6.
[20]NARDI A,ARMATO A.Functional safety methodologies forautomotive applications [C]//2017 IEEE/ACM International
Conference on Computer-Aided Design(ICCAD).IEEE,2017:970-975.
[21]SCHREIBER T.Extremely simple nonlinear noise-reductionmethod[J].Physical Review E,1993,47(4):2401.
[22]SAMAJDAR A,JOSEPH J M,ZHU Y,et al.A systematicmethodology for characterizing scalability of dnn accelerators using scale-sim[C]//2020 IEEE International Symposium on Performance Analysis of Systems and Software(ISPASS).IEEE,2020:58-68.
[23]CHOI W,SHIN D,PARK J,et al.Sensitivity based error resilient techniques for energy efficient deep neural network accelerators[C]//Proceedings of the 56th Annual Design Automation Conference 2019.2019:1-6.
[24]LEE H,KIM J,PARK J,et al.STRAIT:Self-Test and Self-Recovery for AI Accelerator[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,2023,42(9):3092-3104.
[1] SUN Yang, DING Jianwei, ZHANG Qi, WEI Huiwen, TIAN Bowen. Study on Super-resolution Image Reconstruction Using Residual Feature Aggregation NetworkBased on Attention Mechanism [J]. Computer Science, 2024, 51(6A): 230600039-6.
[2] LIU Hui, JI Ke, CHEN Zhenxiang, SUN Runyuan, MA Kun, WU Jun. Malicious Attack Detection in Recommendation Systems Combining Graph Convolutional Neural Networks and Ensemble Methods [J]. Computer Science, 2024, 51(6A): 230700003-9.
[3] JIANG Sheng, ZHU Jianhong. Face Micro-expression Recognition Method Based on ME-ResNet [J]. Computer Science, 2024, 51(11A): 231000053-7.
[4] AN Yang, WANG Xiuqing, ZHAO Minghua. Mobile Robots' Path Planning Method Based on Policy Fusion and Spiking Deep ReinforcementLearning [J]. Computer Science, 2024, 51(11A): 240100211-11.
[5] LUO Huilan, LONG Jun, LIANG Miaomiao. Attentional Feature Fusion Approach for Siamese Network Based Object Tracking [J]. Computer Science, 2023, 50(6A): 220300237-9.
[6] HUANG Yujiao, CHEN Mingkai, ZHENG Yuan, FAN Xinggang, XIAO Jie, LONG Haixia. Text Classification Based on Weakened Graph Convolutional Networks [J]. Computer Science, 2023, 50(6A): 220700039-5.
[7] WANG Xiaofei, FAN Xueqiang, LI Zhangwei. Improving RNA Base Interactions Prediction Based on Transfer Learning and Multi-view Feature Fusion [J]. Computer Science, 2023, 50(3): 164-172.
[8] MEI Pengcheng, YANG Jibin, ZHANG Qiang, HUANG Xiang. Sound Event Joint Estimation Method Based on Three-dimension Convolution [J]. Computer Science, 2023, 50(3): 191-198.
[9] CHEN Qiaosong, WU Jiliang, JIANG Bo, TAN Chongchong, SUN Kaiwei, DEN Xin, WANG Jin. Coupling Local Features and Global Representations for 2D Human Pose Estimation [J]. Computer Science, 2023, 50(11A): 221100007-5.
[10] ZHANG Kaixuan, CAI Guoyong, ZHU Kunri. Image Aesthetics-enhanced Visual Perception Recommendation System [J]. Computer Science, 2023, 50(11A): 221100083-8.
[11] YU Yunjun, ZHANG Pengfei, GONG Hancheng, CHEN Min. Lightweight Network Hardware Acceleration Design for Edge Computing [J]. Computer Science, 2023, 50(11A): 220800045-7.
[12] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[13] WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.
[14] SUN Jie-qi, LI Ya-feng, ZHANG Wen-bo, LIU Peng-hui. Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation [J]. Computer Science, 2022, 49(6A): 434-440.
[15] WANG Xian-sheng, YAN Ke. Fault Detection and Diagnosis of HVAC System Based on Federated Learning [J]. Computer Science, 2022, 49(12): 74-80.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!