计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241000041-6.doi: 10.11896/jsjkx.241000041

• 人工智能 • 上一篇    下一篇

基于信息融合的结构化剪枝算法研究

黄海新1, 徐成龙1, 付垚2   

  1. 1 沈阳理工大学自动化与电气工程学院 沈阳 110159
    2 沈阳理工大学信息科学与工程学院 沈阳 110159
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 黄海新(huanghaixin@sylu.edu.cn)

Research on Structured Pruning Algorithm Based on Information Fusion

HUANG Haixin1, XU Chenglong1, FU Yao2   

  1. 1 School of Automation and Electrical Engineering,Shenyang Ligong University,Shenyang 110159,China
    2 School of Information Science and Engineering,Shenyang Ligong University,Shenyang 110159,China
  • Online:2025-11-15 Published:2025-11-10

摘要: 针对现有大型语言模型在经过剪枝算法处理后在Zero-shot Performance中PPL(困惑度)高、文本生成精度低、模型推理速度慢等问题,提出了一种基于损失联合量级为核心的剪枝度量算法(Loss And Magnitude,LAM)。在对权重重要性估计过程中,将损失函数信息、权重激活信息进行信息融合,使用LAM算法消除在权重重要性评估过程中对梯度信息进行Taylor展开时为提高计算效率省略二阶导数所造成的局限性,提高模型剪枝过程的准确率和鲁棒性,增强剪枝算法的泛用性。在建立耦合结构时,提出单向耦合结构,选择激活Transformer块中的多层感知机(MLP)中的神经元作为初始触发器,只需考虑向注意力层,查询向量、键向量、值向量层方向激活神经元建立耦合结构,从而降低了识别耦合结构组所需的参数量,提高剪枝速度和吞吐量。在WikiText2数据集和PTB数据集进行的Zero-shot Performance实验表明:在剪枝率为25%时对LLaMA-7B进行剪枝处理,其PPL分数分别为20.24和36.05,显著低于其他剪枝算法,在对Vicuna-7B剪枝后的PPL分数为21.24与85.81,也优于其他剪枝算法,表现出更高的泛用性和准确性。

关键词: 大语言模型, 模型剪枝, Taylor+重要性估计, LoRa

Abstract: Aiming at the problems of high PPL( perplexity ),low text generation accuracy and slow model reasoning speed in Zero-shot Performance after the existing large-scale language model is processed by pruning algorithm,this paper proposes a pruning metric algorithm LAM based on the joint magnitude of loss.In the process of estimating the weight importance,the loss function information and the weight activation information are fused.By using the LAM algorithm,the limitations caused by the omission of the second derivative in the Taylor expansion of the gradient information in the process of weight importance evaluation are eliminated,and the accuracy and robustness of the model pruning process are improved.Enhance the versatility of the pruning algorithm.When establishing the coupling structure,a single coupling structure is proposed,and the neurons in the multi-layer perceptron( MLP ) in the Transformer block are selected as the initial trigger.Only the attention layer,the query vector,the key vector,and the value vector layer are considered to activate the neurons to establish the coupling structure.Thus,the number of parameters required to identify the coupling structure group is reduced,and the pruning speed and throughput are improved.The Zero-shot Performance experiments on WikiText2 dataset and PTB dataset show that when the pruning rate is 25 %,the PPL scores of LLaMA-7B are 20.24 and 36.05,respectively,which are lower than other pruning algorithms.The PPL scores of Vicuna-7B after pruning are 21.24 and 85.81,which are also better than other pruning algorithms,showing that the algorithm has higher universality and accuracy.

Key words: Large language models, Model pruning, Taylor+ importance estimates, LoRa

中图分类号: 

  • TP391
[1]FRANTAR E,ALISTARH D.Sparsegpt:Massive languagemodels can be accurately pruned in one-shot[C]//International Conference on Machine Learning.PMLR,2023:10323-10337.
[2]SUN M,LIU Z,BAI R A,et al.A simple and effective pruning approach for large language models[J].arXiv:2306.11695,2023.
[3]MA X,FANG G,WANG X.Llm-pruner:On the structuralpruning of large language models[J].Advances in Neural Information Processing Systems,2023,36:21702-21720.
[4]KIM B K,KIM G,KIM T H,et al.Shortened llama:A simple depth pruning for large language models[J].arXiv:2402.02834,2024.
[5]ZHU X P,YAO H D,LIU J,et al.Review of Evolution of Large Language Model Algorithms[J].ZTE Technology Journal,2024,30(2):9-20.
[6]FANG G,MA X,SONG M,et al.Depgraph:Towards any structural pruning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:16091-16101.
[7]MOLCHANOV P,MALLYA A,TYREE S,et al.Importance estimation for neural network pruning[C]//CVPR.2019.
[8]DETTMERS T,LEWIS M,BELKADA Y,et al.LLM.int8():8-bit matrix multiplication for transformersat scale[J].arXiv:2208.07339,2022.
[9]LI H,KADAV A,DURDANOVIC I,et al.Pruning filters for efficient convnets[C]//ICLR.2017.
[10]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-rank adaptation of large language models[J].arXiv:2106.09685,2021.
[11]HE T W,WANG H.Evaluating Perplexity of Chinese Sentences Based on Grammar & Semantics Analysis[J].ApplicationResearch of Computers,2017,34(12):3538-3542,3546.
[12]KIM B K,KIM G,KIM T H,et al.Shortened llama:A simple depth pruning for large language models[J].arXiv:2402.02834,2024.
[13]ZHU Y K,KIROS R,ZEMEL R,et al.Aligning books and movies:Towards story-like visual explanations by watching movies and reading books[C]//ICCV.2015.
[14]TAORI R,GULRAJANI I,ZHANG T Y,et al.Stanford alpa-ca:An instruction-following llama model[EB/OL].https://github.com/tatsu-lab/stanford_alpaca.
[15]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models[J].arXiv:2307.09288,2023.
[16]CHIANG W L,LI Z,LIN Z,et al.Vicuna:An open-source chatbot impressing gpt-4 with 90%* chatgpt quality[J].See https://vicuna.lmsys.org(accessed 14 April 2023),2023,2(3):6.
[17]SUN M J,LIU Z,BAIR A,et al.A simple and effective pruning approach for large language models[C]//ICLR.2024.
[18]AN Y,ZHAO X,YU T,et al.Fluctuation-based adaptive structured pruning for large language models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:10865-10873.
[19]LV B,ZHOU Q,DING X,et al.KVPruner:Structural Pruning for Faster and Memory-Efficient Large Language Models[J].arXiv:2409.11057,2024.
[20]CHENG H,ZHANG M,SHI J Q.MINI-LLM:Memory-Efficient Structured Pruning for Large Language Models[J].arXiv:2407.11681,2024.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!