基于MLIR的FP8量化模拟与推理内存优化

doi:10.11896/jsjkx.230900143

计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 112-120.doi: 10.11896/jsjkx.230900143

基于MLIR的FP8量化模拟与推理内存优化

徐金龙^1,3, 桂中华², 李嘉楠², 李颖颖³, 韩林¹

1 郑州大学国家超级计算郑州中心郑州 450001
2 郑州大学计算机与人工智能学院郑州 450001
3 信息工程大学四院郑州 450001

收稿日期:2023-09-25 修回日期:2024-01-27 出版日期:2024-09-15 发布日期:2024-09-10
通讯作者: 韩林(hanlin@zzu.edu.cn)
作者简介:(longkaizh@163.com)
基金资助:
2022年河南省重大科技专项(221100210600)

FP8 Quantization and Inference Memory Optimization Based on MLIR

XU Jinlong^1,3, GUI Zhonghua², LI Jia'nan², LI Yingying³, HAN Lin¹

1 National Supercomputing Center in Zhengzhou,Zhengzhou University,Zhengzhou 450001,China
2 School of Computer and Artificial Intelligence,Zhengzhou University,Zhengzhou 450001,China
3 Fourth School,Information Engineering University,Zhengzhou 450001,China

Received:2023-09-25 Revised:2024-01-27 Online:2024-09-15 Published:2024-09-10
About author:XU Jinlong,born in 1985,Ph.D,master'ssupervisor.His main research interests include high-performance computing and paralle compilation.
HAN Lin,born in 1978,Ph.D,associate professor,is a senior member of CCF(No.16416M).His main research interests include compiler optimization and high-performance computing.
Supported by:
Major Science and Technology Projects in Henan Province for 2022(221100210600).

摘要/Abstract

摘要： 随着目标检测模型和语言大模型的迅速发展,网络模型正变得越来越庞大。为了更好地在端侧硬件上进行模型部署,通常采用模型量化技术对模型进行压缩。现有的模型量化策略主要基于FP16,BF16和INT8等类型实现。其中,8bit数据类型在降低推理内存占用与部署开销方面最为显著,但INT8类型依赖特定的校准算法,未能很好地处理动态范围大、离群点多的模型。FP8类型能够更好地拟合神经网络中的数据分布,同时具有多种数制,可在表达范围和表达精度上灵活调整。然而,当前MLIR系统缺乏对FP8类型量化的支持。为此,提出了一种基于MLIR系统的FP8量化模拟策略,包含FP8E4M3和FP8E5M2两种数制,通过对网络中的算子进行量化模拟,评估FP8两种数制对模型推理精度的影响。同时,针对推理引擎中存在的内存分配冗余问题,提出了一种基于定义使用链的内存复用策略,使得模型推理过程中的内存占用峰值进一步减小。实验选取了典型的Yolov5s和Resnet50模型进行测试,结果表明相较于现有的INT8量化策略,FP8量化策略能够保持更好的模型精度,同时不依赖特定校准算法,部署更为简便。在模型精度上,测试用例分别达到了55.5%和77.8%的准确度,经过内存复用优化,内存占用峰值降低了约15%~20%。

关键词: 模型压缩, 深度学习编译器, FP8量化, MLIR, Yolov5s模型

Abstract: With the development of object detection models and language models,network models are becoming increasingly large.In order to better deploy the model on the end-to-end hardware,model quantization technology is usually used to compress the model.The existing model quantization strategies are mainly implemented based on FP16,BF16,INT8,and other types.Among them,the 8-bit data type is the most significant in reducing inference memory usage and deployment costs,but the INT8 type relies on specific calibration algorithms and fails to handle models with large dynamic ranges and multiple outliers well.The FP8 type can better fit the data distribution in neural networks,and has multiple formats that can be flexibly adjusted in terms of expression range and accuracy.However,the current MLIR lacks support for quantifying the FP8 type.To this end,a FP8 quantization simulation strategy based on MLIR is proposed,which includes two formats:FP8E4M3 and FP8E5M2.By quantifying and simulating the operators in the network,the impact of the two formats on the inference accuracy of the model is evaluated.A memory reuse strategy based on define use chain is proposed to address the issue of memory allocation redundancy in inference engines,further reducing the peak memory usage during the model inference process.Typical Yolov5s and Resnet50 models are selected for testing and verification,and the results show that,compared to the existing INT8 quantization strategy,the FP8 quantization strategy can maintain better model accuracy,and does not rely on specific calibration algorithms,making deployment more convenient.In terms of model accuracy,the test cases achieve an accuracy of 55.5% and 77.8%,respectively.After memory reuse optimization,the peak memory usage is reduced by about 15%~20%.

Key words: Model compression, Deep learning compiler, FP8 quantification, MLIR, Yolov5s model

中图分类号:

TP311

徐金龙, 桂中华, 李嘉楠, 李颖颖, 韩林. 基于MLIR的FP8量化模拟与推理内存优化[J]. 计算机科学, 2024, 51(9): 112-120. https://doi.org/10.11896/jsjkx.230900143

XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR[J]. Computer Science, 2024, 51(9): 112-120. https://doi.org/10.11896/jsjkx.230900143

参考文献

[1]KRISHNAMOORTHI R.Quantizing deep convolutional net-works for efficient inference:A whitepaper[J].arXiv:1806.08342,2018.
[2]NOUNE B,JONES P,JUSTUS D,et al.8-bit Numerical For-mats for Deep Neural Networks[J].arXiv:2206.02915,2022.
[3]KUZMIN A,VAN BAALEN M,REN Y W,et al.FP8 Quantization:The Power of the Exponent[C]//NeurIPS 2022.2022.
[4]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[C]//Advances in Neural Information Processing Systems.2019.
[5]ABADI M,AGARWAL A,BARHAM P,et al.Tensorflow:Large-scale machine learning on heterogeneous distributed systems[J].arXiv:1603.04467,2016.
[6] LATTNER C,PIENAAR J A,AMINI M,et al.MLIR:A Compiler Infrastructure for the End of Moore's Law[J].arXiv:2002.11054,2020.
[7]CHEN T,MOREAU T,JIANG Z,et al.TVM:end-to-end optimization stack for deep learning[J].arXiv:1802.04799,2018.
[8]LEARY C,WANG T.XLA:TensorFlow,compiled[C]//google tensorflow 2017.TensorFlow Dev Summit,2017.
[9]HU P C,LU M,WANG L,et al.TPU-MLIR:A Compiler For TPU Using MLIR[J].arXiv:2210.15016,2022.
[10] KALAMKAR D D,MUDIGERE D,MELLEMPUDI N,et al.A Study of BFLOAT16 for Deep Learning Training[J].arXiv:1905.12322,2019.
[11] JACOB B,KLIGYS S,CHEN B,et al.Quantization and Trai-ning of Neural Networks for Efficient Integer-Arithmetic-Only Inference[C]//CVPR.2018:2704-2713.
[12]GASKILL B.Onnx:the open neural network exchange format[J].Linux Journal,2018(Apr.TN.285):157-161.
[13]LI J H,QIN Z N,MEI Y J,et al.oneDNN Graph Compiler:A Hybrid Approach for High-Performance Deep Learning Compi-lation [J].arXiv:2301.01333,2023.
[14] SUN X,CHOI J W,CHEN C Y,et al.Hybrid 8-bit Floating Point(HFP8) Training and Inference for Deep Neural Networks[C]//NeurIPS.2019:4901-4910.
[15]MICIKEVICIUS P,STOSIC D,BURGESS N,et al.FP8 Formats for Deep Learning[J].arXiv:2209.05433,2022.
[16]ZHOU X D,MENG Y,XIN X,et al.Research of YOLOv5s Model Acceleration Strategy in AI Chip[C]//ICCCS.2023:791-794.
[17]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//CVPR.2016:770-778.
[18] LIN T Y,MAIRE M,BELONGIE S J,etal.Microsoft COCO:Common Objects in Context[C]//ECCV.2014:740-755.
[19] DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//CVPR.2009:248-255.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于MLIR的FP8量化模拟与推理内存优化

FP8 Quantization and Inference Memory Optimization Based on MLIR

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0