Computer Science ›› 2026, Vol. 53 ›› Issue (6): 193-202.doi: 10.11896/jsjkx.251000093

• High Performance Computing • Previous Articles     Next Articles

High-performance Image Preprocessing Operators for Cambricon MLU Accelerator Card

LI Fei1, LIU Song1, GUO Songjian1, LIU Jiazheng1, ZHANG Ying1, HONG Longwei2, ZHANG Boxuan2   

  1. 1 School of Computer Science and Technology,Xi'an Jiaotong University,Xi'an 710049,China
    2 School of Software Engineering,Xi'an Jiaotong University,Xi'an 710049,China
  • Received:2025-10-23 Revised:2025-12-22 Online:2026-06-15 Published:2026-06-09
  • About author:LI Fei,born in 1999,Ph.D candidate,is a member of CCF(No.L1866G).His main research interests include high performance computing and parallel algorithm.
    LIU Song,born in 1987,Ph.D,associate professor,Ph.D supervisor,is a member of CCF(No.A0055M).His main research interests include parallel computing and code optimization.
  • Supported by:
    National Key R & D Program of China(2022YFB4501604).

Abstract: Image preprocessing is a critical component in machine learning tasks,and its computational efficiency directly impacts the performance of model training and inference.Traditional CPU computations struggle to meet the real-time processing demands of high-resolution images and large-scale datasets,whereas Neural Processing Units(NPUs),with their high performance,emerge as ideal platforms for accelerating image preprocessing.However,the diversity and memory access intensiveness of image preprocessing operations do not fully align with the matrix operation optimization mode of NPUs,posing significant challenges for their adaptation.Based on the Phytium CPU and Cambricon MLU heterogeneous computing platform,this paper proposes an efficient acceleration method for image preprocessing operators.By thoroughly analyzing the multi-core parallel architecture and sto-rage system of the MLU,the computational logic and task partitioning of the operators are redesigned.Combined with optimization strategies such as multi-core parallel scheduling,vectorized computation,three-level storage structure,and double buffering mechanism,the computational performance of the operators is significantly enhanced,demonstrating the potential of NPUs in accelerating image preprocessing.Experimental results show that the optimized ten common image preprocessing operators achieve a performance improvement of 33.22% to 234.46% compared to the native operators in PaddlePaddle.In deep learning and traditional machine learning tasks,end-to-end performance improvements of 12.47% and 48.63% are achieved,respectively.

Key words: Image preprocessing, Cambrian MLU, High performance computing, Parallel computing, Performance optimization

CLC Number: 

  • TP338.6
[1]SHORTEN C,KHOSHGOFTAAR T M.A survey on image data augmentation for deep learning[J].Journal of Big Data,2019,6(1):1-48.
[2]MINH T N,SINN M,LAM H T,et al.Automated image data preprocessing with deep reinforcement learning[J].arXiv:1806.05886,2018.
[3]GYAWALI D.Comparative analysis ofcpu and gpu profiling for deep learning models[J].arXiv:2309.02521,2023.
[4]XIAO H,SUN L P,LI C L,et al.Histogram statistical image enhancement parallel algorithm for GPU[J].Journal of Frontiers of Computer Science & Technology,2022,16(10):2273-2285.
[5]LIU B,ZHOU H,BIAN C J,et al.Target detection systembased on lightweight Yolov5 algorithm based on aerospace-grade NPU[J].Chinese Journal of Space Science,2025,45(4):1-11.
[6]WENG X,IVANOVIC B,WANG Y,et al.Para-drive:Paralle-lized architecture for real-time autonomous driving[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:15449-15458.
[7]DE SILVA U,FERNANDO L,LIK B L P,et al.Large language models for video surveillance applications[C]//2024 IEEE Region 10 Conference(TENCON).IEEE,2024:563-566.
[8]CHEN T,DU Z,SUN N,et al.Diannao:A small-footprint high-throughput accelerator for ubiquitous machine-learning[J].ACM SIGARCH Computer Architecture News,2014,42(1):269-284.
[9]LU W Z,ZHANG F,HE Y X,et al.Performance evaluation and optimization of Huawei Ascend neural network accelerator[J].Chinese Journal of Computers,2022,45(8):1618-1637.
[10]ZHANG S,DU Z,ZHANG L,et al.Cambricon-X:An accelerator for sparse neural networks[C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2016:1-12.
[11]JIAO Y,HAN L,LONG X.Hanguang 800 NPU-the ultimate AI inference solution for data centers[C]//2020 IEEE Hot Chips 32 Symposium(HCS).IEEE Computer Society,2020:1-29.
[12]LI S.China's largest AI computing chip is launched:A look at Suiyuan Technology's Suisi chip and Yunsui accelerator card [J].Microcomputer,2021(24):89-93.
[13]LEE K J.Architecture of neural processing unit for deep neural networks[M]//Advances in Computers.Elsevier,2021:217-245.
[14]CHEN J,BAI G,LIANG S,et al.Automatic image cropping:A computational complexity study[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:507-515.
[15]ZHANG Y,GUO R Q,SHENG Y Y.CV-CUDA High-Perfor-mance Image Processing Acceleration Library [EB/OL].(2022-11-22)[2025-06-03].https://developer.nvidia.com/zh-cn/blog/cv-cuda-high-performance-image-processing.
[16]MA Y J,YU D H,WU T,et al.PaddlePaddle:An open source deep learning platform derived from industrial practice [J].Frontiers of Data and Computing Development,2019,1(5):105-115.
[17]LIAO H,TU J,XIA J,et al.Ascend:a scalable and unified architecture for ubiquitous deep neural network computing:Industry track paper[C]//2021 IEEE International Symposium on High-Performance Computer Architecture(HPCA).IEEE,2021:789-801.
[18]LIU S,DU Z,TAO J,et al.Cambricon:An instruction set architecture for neural networks[J].ACM SIGARCH Computer Architecture News,2016,44(3):393-405.
[19]GUO H,ZHAO Y,LI Z,et al.Cambricon-u:A systolic random increment memory architecture for unary computing[C]//Proceedings of the 56th Annual IEEE/ACM International Sympo-sium on Microarchitecture.2023:424-437.
[20]HAO Y,ZHAO Y,LIU C,et al.Cambricon-p:A bitflow architecture for arbitrary precision computing[C]//2022 55th IEEE/ACM International Symposium on Microarchitecture(MICRO).IEEE,2022:57-72.
[21]SONG X,WEN Y,HU X,et al.Cambricon-r:A fully fused accelerator for real-time learning of neural scene representation[C]//Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture.2023:1305-1318.
[22]ZHAO Y,LIU C,DU Z,et al.Cambricon-Q:A hybrid architecture for efficient training[C]//2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture(ISCA).IEEE,2021:706-719.
[23]JOUPPI N P,YOUNG C,PATIL N,et al.In-datacenter performance analysis of a tensor processing unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture.2017:1-12.
[24]MARKIDIS S,DER CHIEN S W,LAURE E,et al.Nvidia tensor core programmability,performance & precision[C]//2018 IEEE International Parallel and Distributed Processing Sympo-sium Workshops(IPDPSW).IEEE,2018:522-531.
[25]HICKMANN B,CHEN J,ROTZIN M,et al.Intelnervana neural network processor-t(nnp-t)fused floating point many-term dot product[C]//2020 IEEE 27th Symposium on Computer Arithmetic(ARITH).IEEE,2020:133-136.
[26]BBI R,XU T,XU M,et al.Paddlepaddle:A production-oriented deep learning platform facilitating the competency of enterprises[C]//2022 IEEE 24th Int Conf on High Performance Computing & Communications;8th Int Conf on Data Science & Systems;20th Int Conf on Smart City;8th Int Conf on Dependability in Sensor,Cloud & Big Data Systems & Application(HPCC/DSS/SmartCity/DependSys).IEEE,2022:92-99.
[27]VASILE C E,ULMĂMEI A A,BÎRĂ C.Image ProcessingHardware Acceleration-A Review of Operations Involved and Current Hardware Approaches[J].Journal of Imaging,2024,10(12):298.
[28]YANG H Y,LI C M,WANG X P,et al.Image collaborativeparallel processing model in CPU/GPU heterogeneous environment[J].Integration Technology,2017,6(5):8-18.
[29]NAZ N,HASEEB MALIK A,KHURSHID A B,et al.Efficient processing of image processing applications on CPU/GPU[J].Mathematical Problems in Engineering,2020,2020(1):4839876.
[30]ALHUMAIDAN B,ALGHOFAILY S,AL QHAHTANI M,et al.Parallel image processing:Taking grayscale conversion using openmp as an example[J].Journal of Computer and Communications,2024,12(2):1-10.
[31]XIAO S Y,WANG L,DU Y,et al.OpenCL acceleration algorithm for image median filtering based on heterogeneousplatforms[J].Journal of Hebei University(Natural Science Edition),2024,44(1):92.
[32]MÁNDI Á,MÁTÉ J,RÓZSA D,et al.Hardware acceleratedimage processing on FPGA based PYNQ-Z2 board[J].CarpathianJournal of Electronic and Computer Engineering,2021,14(1):20-23.
[33]YUAN H,DING D,FAN Z,et al.A real-time image processing hardware acceleration method based onfpga[C]//2021 6th International Conference on Computational Intelligence and Applications(ICCIA).IEEE,2021:200-205.
[34]CHEN W,ZHANG C S,LIU S.An image processing acceleration method based on domestic accelerator card:CN 202410455829.4 [P].2024-07-05.
[35]LI Y.The Investigation of DeiT model Based on PaddlePaddleFramework on CIFAR-10 Dataset Image Classification[C]//2023 International Conference on Image,Algorithms and Artificial Intelligence(ICIAAI 2023).Atlantis Press,2023:1062-1067.
[36]CAMBRICON TECHNOLOGIES.CAMBRICON BANG C/C++Programming Guide [EB/OL].(2023-09-12)[2025-06-03].https://www.cambricon.com/docs/sdk_1.15.0/cntoolkit_3.7.2/programming_guide_1.7.0/hardware_implementation/index.html.
[37]ABOUELNAGA Y,ALI O S,RADY H,et al.Cifar-10:Knn-based ensemble of classifiers[C]//2016 International Confe-rence on Computational Science and Computational Intelligence(CSCI).IEEE,2016:1192-1195.
[38]MOKHAIRI M,ENGKU FADZLI HASAN S A,NURSHAZWANI K.Comparison of image classification techniques using CALTECH 101 dataset[J].Journal of Theoretical and Applied Information Technology,2015,71(1):79-86.
[39]BAIDU PADDLEPADDLE.PaddlePaddle Deep Learning Plat-form User Guide [EB/OL].[2025-7-18].https://www.paddlepaddle.org.cn/documentation/docs/zh/3.0-beta/guides/hardware_support/mlu/support_cn.html.
[40]HE K,ZHANG X,REN S,et al.Deep residual learning for image recognition[C]//Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[41]RIGATTI S J.Random forest[J].Journal of Insurance Medi-cine,2017,47(1):31-39.
[42]STEINBACH M,TAN P N.kNN:k-nearest neighbors[M]//The Top Ten Algorithms in Data Mining.Chapman and Hall/CRC,2009:165-176.
[43]DE VILLE B.Decision trees[J].Wiley Interdisciplinary Re-views:Computational Statistics,2013,5(6):448-455.
[1] SUN Xiaoxue, JIA Haipeng, ZHANG Yunquan, YU Yue, QIN Pinle. GPU-based Implementation and Optimization of Banded Matrix LU Factorization [J]. Computer Science, 2026, 53(6): 117-127.
[2] LI Jinyou, ZHANG Wenshuai, SHEN Yu, ZHANG Yundong, LI Huimin, LI Jing. Machine Learning-based Parallel Parameter Optimization in High-performance ComputingApplications [J]. Computer Science, 2026, 53(6): 153-162.
[3] WU Can, XIAO Haili, WANG Xiaoning, ZHAO Yining, LU Shasha, HE Rong. Workload Analysis and Modeling Method for High-performance Computing [J]. Computer Science, 2026, 53(6): 171-184.
[4] JI Liguang, ZHOU Bei, YANG Hongru, ZHOU Yuchang, CUI Mengqi, XU Jinchen. Parallel Detection Method of Maximum Floating-point Error Based on Gridding Particle SwarmOptimization Algorithm [J]. Computer Science, 2026, 53(2): 124-132.
[5] LIAO Zeming, LIU Guikai, HU Yonghua, XIE Anxing. Research on Efficient Code Generation Techniques for Array Computation for Vector DSPs [J]. Computer Science, 2025, 52(6A): 240300156-7.
[6] ZUO Xianyu, ZHOU Xiaohu, ZHOU Liming, XIE Yi, LIU Cheng. Efficient Remote Sensing Common Product Production Algorithm Based on Product Reuse Model [J]. Computer Science, 2025, 52(6): 316-323.
[7] XIE Zhenjie, LIU Yiming, CAI Ruijie, LUO Youqiang. Performance Optimization Method for Domestic Cryptographic Algorithm SM9 [J]. Computer Science, 2025, 52(6): 390-396.
[8] TAN Zhengyuan, ZHONG Jiaqing, CHEN Juan. AI+HPC:An Overview of Supercomputing System Software and Application Technology Development Driven by “AI+” [J]. Computer Science, 2025, 52(5): 1-10.
[9] LIAO Qiucheng, ZHOU Yang, LIN Xinhua. Metrics and Tools for Evaluating the Deviation in Parallel Timing [J]. Computer Science, 2025, 52(5): 41-49.
[10] HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108.
[11] LI Qing, JIA Haipeng, ZHANG Yunquan, ZHANG Sijia. Input-aware Generalized Matrix-Vector Product Algorithm for Adaptative PerformanceOptimization of Hygon DCU [J]. Computer Science, 2025, 52(4): 291-300.
[12] ZHANG Manjing, HE Yulin, LI Xu, HUANG Zhexue. Distributed Two-stage Clustering Method Based on Node Sampling [J]. Computer Science, 2025, 52(2): 134-144.
[13] CHEN Yiyang, WANG Xiaoning, YAN Xiaoting, LI Guanlong ZHAO Yining, LU Shasha, XIAO Haili. Study on High Performance Computing Container Checkpoint Technology Based on CRIU [J]. Computer Science, 2024, 51(9): 40-50.
[14] XU He, ZHOU Tao, LI Peng, QIN Fangfang, JI Yimu. LU Parallel Decomposition Optimization Algorithm Based on Kunpeng Processor [J]. Computer Science, 2024, 51(9): 51-58.
[15] YAN Xiaoting, WANG Xiaoning, DONG Sheng, ZHAO Yining, XIAO Haili. Review on the Development and Application of Checkpointing Technology in High-performanceComputing [J]. Computer Science, 2024, 51(9): 1-14.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!