Computer Science ›› 2024, Vol. 51 ›› Issue (9): 112-120.doi: 10.11896/jsjkx.230900143

• High Performance Computing • Previous Articles     Next Articles

FP8 Quantization and Inference Memory Optimization Based on MLIR

XU Jinlong1,3, GUI Zhonghua2, LI Jia'nan2, LI Yingying3, HAN Lin1   

  1. 1 National Supercomputing Center in Zhengzhou,Zhengzhou University,Zhengzhou 450001,China
    2 School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China
    3 Fourth School,Information Engineering University,Zhengzhou 450001,China
  • Received:2023-09-25 Revised:2024-01-27 Online:2024-09-15 Published:2024-09-10
  • About author:XU Jinlong,born in 1985,Ph.D,master'ssupervisor.His main research interests include high-performance computing and paralle compilation.
    HAN Lin,born in 1978,Ph.D,associate professor,is a senior member of CCF(No.16416M).His main research interests include compiler optimization and high-performance computing.
  • Supported by:
    Major Science and Technology Projects in Henan Province for 2022(221100210600).

Abstract: With the development of object detection models and language models,network models are becoming increasingly large.In order to better deploy the model on the end-to-end hardware,model quantization technology is usually used to compress the model.The existing model quantization strategies are mainly implemented based on FP16,BF16,INT8,and other types.Among them,the 8-bit data type is the most significant in reducing inference memory usage and deployment costs,but the INT8 type relies on specific calibration algorithms and fails to handle models with large dynamic ranges and multiple outliers well.The FP8 type can better fit the data distribution in neural networks,and has multiple formats that can be flexibly adjusted in terms of expression range and accuracy.However,the current MLIR lacks support for quantifying the FP8 type.To this end,a FP8 quantization simulation strategy based on MLIR is proposed,which includes two formats:FP8E4M3 and FP8E5M2.By quantifying and simulating the operators in the network,the impact of the two formats on the inference accuracy of the model is evaluated.A memory reuse strategy based on define use chain is proposed to address the issue of memory allocation redundancy in inference engines,further reducing the peak memory usage during the model inference process.Typical Yolov5s and Resnet50 models are selected for testing and verification,and the results show that,compared to the existing INT8 quantization strategy,the FP8 quantization strategy can maintain better model accuracy,and does not rely on specific calibration algorithms,making deployment more convenient.In terms of model accuracy,the test cases achieve an accuracy of 55.5% and 77.8%,respectively.After memory reuse optimization,the peak memory usage is reduced by about 15%~20%.

Key words: Model compression, Deep learning compiler, FP8 quantification, MLIR, Yolov5s model

CLC Number: 

  • TP311
[1]KRISHNAMOORTHI R.Quantizing deep convolutional net-works for efficient inference:A whitepaper[J].arXiv:1806.08342,2018.
[2]NOUNE B,JONES P,JUSTUS D,et al.8-bit Numerical For-mats for Deep Neural Networks[J].arXiv:2206.02915,2022.
[3]KUZMIN A,VAN BAALEN M,REN Y W,et al.FP8 Quantization:The Power of the Exponent[C]//NeurIPS 2022.2022.
[4]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[C]//Advances in Neural Information Processing Systems.2019.
[5]ABADI M,AGARWAL A,BARHAM P,et al.Tensorflow:Large-scale machine learning on heterogeneous distributed systems[J].arXiv:1603.04467,2016.
[6] LATTNER C,PIENAAR J A,AMINI M,et al.MLIR:A Compiler Infrastructure for the End of Moore's Law[J].arXiv:2002.11054,2020.
[7]CHEN T,MOREAU T,JIANG Z,et al.TVM:end-to-end optimization stack for deep learning[J].arXiv:1802.04799,2018.
[8]LEARY C,WANG T.XLA:TensorFlow,compiled[C]//google tensorflow 2017.TensorFlow Dev Summit,2017.
[9]HU P C,LU M,WANG L,et al.TPU-MLIR:A Compiler For TPU Using MLIR[J].arXiv:2210.15016,2022.
[10] KALAMKAR D D,MUDIGERE D,MELLEMPUDI N,et al.A Study of BFLOAT16 for Deep Learning Training[J].arXiv:1905.12322,2019.
[11] JACOB B,KLIGYS S,CHEN B,et al.Quantization and Trai-ning of Neural Networks for Efficient Integer-Arithmetic-Only Inference[C]//CVPR.2018:2704-2713.
[12]GASKILL B.Onnx:the open neural network exchange format[J].Linux Journal,2018(Apr.TN.285):157-161.
[13]LI J H,QIN Z N,MEI Y J,et al.oneDNN Graph Compiler:A Hybrid Approach for High-Performance Deep Learning Compi-lation [J].arXiv:2301.01333,2023.
[14] SUN X,CHOI J W,CHEN C Y,et al.Hybrid 8-bit Floating Point(HFP8) Training and Inference for Deep Neural Networks[C]//NeurIPS.2019:4901-4910.
[15]MICIKEVICIUS P,STOSIC D,BURGESS N,et al.FP8 Formats for Deep Learning[J].arXiv:2209.05433,2022.
[16]ZHOU X D,MENG Y,XIN X,et al.Research of YOLOv5s Model Acceleration Strategy in AI Chip[C]//ICCCS.2023:791-794.
[17]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//CVPR.2016:770-778.
[18] LIN T Y,MAIRE M,BELONGIE S J,etal.Microsoft COCO:Common Objects in Context[C]//ECCV.2014:740-755.
[19] DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//CVPR.2009:248-255.
[1] XU Xiaohua, ZHOU Zhangbing, HU Zhongxu, LIN Shixun, YU Zhenjie. Lightweight Deep Neural Network Models for Edge Intelligence:A Survey [J]. Computer Science, 2024, 51(7): 257-271.
[2] GAO Yang, CAO Yangjie, DUAN Pengsong. Lightweighting Methods for Neural Network Models:A Review [J]. Computer Science, 2024, 51(6A): 230600137-11.
[3] SUN Jing, WANG Xiaoxia. Convolutional Neural Network Model Compression Method Based on Cloud Edge Collaborative Subclass Distillation [J]. Computer Science, 2024, 51(5): 313-320.
[4] LIU Yubo, GUO Bin, MA Ke, QIU Chen, LIU Sicong. Design of Visual Context-driven Interactive Bot System [J]. Computer Science, 2023, 50(9): 260-268.
[5] CHU Yu-chun, GONG Hang, Wang Xue-fang, LIU Pei-shun. Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4 [J]. Computer Science, 2022, 49(6A): 337-344.
[6] CHENG Xiang-ming, DENG Chun-hua. Compression Algorithm of Face Recognition Model Based on Unlabeled Knowledge Distillation [J]. Computer Science, 2022, 49(6): 245-253.
[7] ZOU Sai-lan, LI Zhuo, CHEN Xin. Study on Transmission Optimization for Hierarchical Federated Learning [J]. Computer Science, 2022, 49(12): 5-16.
[8] DENG Peng-fei, GUAN Zheng, WANG Yu-yang, WANG Xue. Identification Method of Maize Disease Based on Transfer Learning and Model Compression [J]. Computer Science, 2022, 49(11A): 211200009-6.
[9] HUANG Yu-jiao, ZHAN Li-chao, FAN Xing-gang, XIAO Jie, LONG Hai-xia. Text Classification Based on Knowledge Distillation Model ELECTRA-base-BiLSTM [J]. Computer Science, 2022, 49(11A): 211200181-6.
[10] HUANG Zhong-hao, YANG Xing-yao, YU Jiong, GUO Liang, LI Xiang. Mutual Learning Knowledge Distillation Based on Multi-stage Multi-generative Adversarial Network [J]. Computer Science, 2022, 49(10): 169-175.
[11] CHEN Zhi-wen, WANG Kun, ZHOU Guang-yun, WANG Xu, ZHANG Xiao-dan, ZHU Hu-ming. SAR Image Change Detection Method Based on Capsule Network with Weight Pruning [J]. Computer Science, 2021, 48(7): 190-198.
[12] ZHOU Zhi-yi, SHONG Bing, DUAN Peng-song, CAO Yang-jie. LWID:Lightweight Gait Recognition Model Based on WiFi Signals [J]. Computer Science, 2020, 47(11): 25-31.
[13] LI Qing-hua, LI Cui-ping, ZHANG Jing, CHEN Hong, WANG Shao-qing. Survey of Compressed Deep Neural Network [J]. Computer Science, 2019, 46(9): 1-14.
[14] ZENG Yan, CHEN Yue-lin, CAI Xiao-dong. Deep Face Recognition Algorithm Based on Weighted Hashing [J]. Computer Science, 2019, 46(6): 277-281.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!