计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 112-120.doi: 10.11896/jsjkx.230900143
徐金龙1,3, 桂中华2, 李嘉楠2, 李颖颖3, 韩林1
XU Jinlong1,3, GUI Zhonghua2, LI Jia'nan2, LI Yingying3, HAN Lin1
摘要: 随着目标检测模型和语言大模型的迅速发展,网络模型正变得越来越庞大。为了更好地在端侧硬件上进行模型部署,通常采用模型量化技术对模型进行压缩。现有的模型量化策略主要基于FP16,BF16和INT8等类型实现。其中,8bit数据类型在降低推理内存占用与部署开销方面最为显著,但INT8类型依赖特定的校准算法,未能很好地处理动态范围大、离群点多的模型。FP8类型能够更好地拟合神经网络中的数据分布,同时具有多种数制,可在表达范围和表达精度上灵活调整。然而,当前MLIR系统缺乏对FP8类型量化的支持。为此,提出了一种基于MLIR系统的FP8量化模拟策略,包含FP8E4M3和FP8E5M2两种数制,通过对网络中的算子进行量化模拟,评估FP8两种数制对模型推理精度的影响。同时,针对推理引擎中存在的内存分配冗余问题,提出了一种基于定义使用链的内存复用策略,使得模型推理过程中的内存占用峰值进一步减小。实验选取了典型的Yolov5s和Resnet50模型进行测试,结果表明相较于现有的INT8量化策略,FP8量化策略能够保持更好的模型精度,同时不依赖特定校准算法,部署更为简便。在模型精度上,测试用例分别达到了55.5%和77.8%的准确度,经过内存复用优化,内存占用峰值降低了约15%~20%。
中图分类号:
[1]KRISHNAMOORTHI R.Quantizing deep convolutional net-works for efficient inference:A whitepaper[J].arXiv:1806.08342,2018. [2]NOUNE B,JONES P,JUSTUS D,et al.8-bit Numerical For-mats for Deep Neural Networks[J].arXiv:2206.02915,2022. [3]KUZMIN A,VAN BAALEN M,REN Y W,et al.FP8 Quantization:The Power of the Exponent[C]//NeurIPS 2022.2022. [4]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An imperative style,high-performance deep learning library[C]//Advances in Neural Information Processing Systems.2019. [5]ABADI M,AGARWAL A,BARHAM P,et al.Tensorflow:Large-scale machine learning on heterogeneous distributed systems[J].arXiv:1603.04467,2016. [6] LATTNER C,PIENAAR J A,AMINI M,et al.MLIR:A Compiler Infrastructure for the End of Moore's Law[J].arXiv:2002.11054,2020. [7]CHEN T,MOREAU T,JIANG Z,et al.TVM:end-to-end optimization stack for deep learning[J].arXiv:1802.04799,2018. [8]LEARY C,WANG T.XLA:TensorFlow,compiled[C]//google tensorflow 2017.TensorFlow Dev Summit,2017. [9]HU P C,LU M,WANG L,et al.TPU-MLIR:A Compiler For TPU Using MLIR[J].arXiv:2210.15016,2022. [10] KALAMKAR D D,MUDIGERE D,MELLEMPUDI N,et al.A Study of BFLOAT16 for Deep Learning Training[J].arXiv:1905.12322,2019. [11] JACOB B,KLIGYS S,CHEN B,et al.Quantization and Trai-ning of Neural Networks for Efficient Integer-Arithmetic-Only Inference[C]//CVPR.2018:2704-2713. [12]GASKILL B.Onnx:the open neural network exchange format[J].Linux Journal,2018(Apr.TN.285):157-161. [13]LI J H,QIN Z N,MEI Y J,et al.oneDNN Graph Compiler:A Hybrid Approach for High-Performance Deep Learning Compi-lation [J].arXiv:2301.01333,2023. [14] SUN X,CHOI J W,CHEN C Y,et al.Hybrid 8-bit Floating Point(HFP8) Training and Inference for Deep Neural Networks[C]//NeurIPS.2019:4901-4910. [15]MICIKEVICIUS P,STOSIC D,BURGESS N,et al.FP8 Formats for Deep Learning[J].arXiv:2209.05433,2022. [16]ZHOU X D,MENG Y,XIN X,et al.Research of YOLOv5s Model Acceleration Strategy in AI Chip[C]//ICCCS.2023:791-794. [17]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//CVPR.2016:770-778. [18] LIN T Y,MAIRE M,BELONGIE S J,etal.Microsoft COCO:Common Objects in Context[C]//ECCV.2014:740-755. [19] DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//CVPR.2009:248-255. |
|