Computer Science ›› 2026, Vol. 53 ›› Issue (3): 136-142.doi: 10.11896/jsjkx.250600087

• Database & Big Data & Data Science • Previous Articles     Next Articles

Data Compression of Instruction Fine-tuning for Large Models:Refinement Based on Inference Contribution

LI Hao, DING Lizhong, FU Jiarun, LINGHU Zhaohuan   

  1. Department of Computer Science, Beijing Institute of Technology, Beijing 100081, China
  • Received:2025-06-12 Revised:2026-01-09 Published:2026-03-12
  • About author:LI Hao,born in 2003,postgraduate.His main research interests include emergence mechanisms and reasoning processes of large models.
    DING Lizhong,born in 1986,Ph.D,professor,Ph.D supervisor.His main research interests include deep statistical learning theory,deep kernel learning and deep generative models.
  • Supported by:
    National Key Research and Development Program of China(2022YFB2703100),National Natural Science Foundation of China(62376028),National Natural Science Foundation of China(U22A2099) and Excellent Young Scientists Fund(Overseas) of the National Natural Science Foundation of China.

Abstract: The fine-tuning of large model instructions based on reasoning data significantly improves the reasoning accuracy of the model by explicitly modeling the multi-step logical correlations of complex tasks.However,the fine-tuning process relies on massive high-quality data,resulting in a sharp increase in computing power overhead.The existing data compression techniques mainly focus on the reduction of the original scale.Generally,there is a lack of compression method design for reasoning data,ignoring multi-step logical associations,semantic dependency relationships in reasoning data,resulting in the damage of the integrity of the key reasoning chain and thereby reducing the reasoning performance.To this end,refinement based on inference contribution(RBIC) is proposed.The knowledge domain graph is constructed by analyzing and inferring the semantic similarity of the data to accurately locate the core information.It combines the semantics of data samples with the reasoning accuracy of large models,divides the difficulty gradient,and covers the reasoning requirements of all scenarios.The reasoning contribution is quantified through the logical complexity of multi-step reasoning data,and the data samples that contribute the most to the model’s reaso-ning are refined.Experimental results show that after fine-tuning based on the reasoning data refined by RBIC,the average reaso-ning performance of the model only decreases by 1.13%,while the training time is shortened to 16% of the original time consumption.This verifies that RBIC achieves the optimal balance between model efficiency and resource consumption,and is expected to promote the efficient deployment and fine-tuning optimization of multi-domain large models in resource-constrained environments.

Key words: Large models, Instruction fine-tuning, Data compression, Inference contribution, Similarity analysis, Refinement

CLC Number: 

  • TP391.1
[1]ZHANG B N,LI C X,FAN K.MARIO Eval:Evaluate yourmath LLM with your math LLM-A mathematical dataset eva-luation toolkit[J].arXiv:2404.13925,2024.
[2]GAO B F,CAI Z F,XU R X,et al.LLM critics help catch bugs in mathematics:towards a better mathematical verifier with natural language feedback[J].arXiv:2406.14024,2024.
[3]LUO Z Y,XU C,ZHAO P,et al.WizardCoder:Empoweringcode large language models with evol-instruct[J].arXiv:2306.08568,2023.
[4]COIGNION T,QUINTON C,ROUVOY R.A performancestudy of LLM-generated code on LeetCode[C]//Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering.ACM,2024:79-89.
[5]ZHANG Q T,WANG Y C,WANG H X,et al.A Survey on Fine-Tuning Techniques for Large Language Models[J].Computer Engineering and Applications,2024,60(17):17-33.
[6]SHEN S Y,PENG J L,ZHANG X L,et al.The chinese dataset distilled from deepseek-r1-671b[EB/OL].https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k.
[7]CHEN H,ZHANG Y M,ZHANG Q,et al.Maybe only 0.5% data is needed:a preliminary exploration of low training data instruction tuning[J].arXiv:2305.09246,2023.
[8]ZHOU C,LIU P,XU P,et al.Lima:Less is more for alignment[J].Advances in Neural Information Processing Systems,2023,36:55006-55021.
[9]WEI Y X,CASSANO F,LIU J W,et al.SelfCodeAlign:Self-alignment for code generation[J].arXiv:2410.24198,2024.
[10]ZHANG Z S,ZHANG A,LI M,et al.Automatic chain ofthought prompting in large language models[J].arXiv:2210.03493,2022.
[11]YU R N,LIU S H,WANG X C.Dataset distillation:a comprehensive review[J].arXiv:2301.07014,2023.
[12]WEI Y,WANG Z,LIU J,et al.Magicoder:Empowering codegeneration with oss-instruct[J].arXiv:2312.02120,2023.
[13]ZHANG J,ZHANG C X,LIU Y,et al.D3:Diversity,Difficulty,and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning[J].arXiv:2503.11441,2025.
[14]LI Y Q,LI W J.Data distillation for text classification[J].ar-Xiv:2104.08448,2021.
[15]CHAI C L,WANG J Y,TANG N,et al.Efficient coreset selection with cluster-based methods[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2023:167-178.
[16]MIRZASOLEIMAN B,CAO K D,LESKOVEC J.Coresets forrobust training of deep neural networks against noisy labels[J].Advances in Neural Information Processing Systems,2020,33:11465-11477.
[17]POOLADZANDI O,DAVINI D,MIRZASOLEIMAN B.Adaptive second order coresets for data-efficient machine learning[C]//Proceedings of the 39th International Conference on Machine Learning.Cambridge,MA:PMLR,2022:17848-17869.
[18]LI X F,ZOU H Y,LIU P F.LIMR:Less is more for RL scaling[J].arXiv:2502.11886,2025.
[19]LI M,ZHANG Y,LI Z,et al.From quantity to quality:Boosting llm performance with self-guided data selection for instruction tuning[J].arXiv:2308.12032,2023.
[20]GUO C C,ZHAO B,BAI Y B.DeepCore:A comprehensive library for coreset selection in deep learning[C]//Proceedings of the International Conference on Database and Expert Systems Applications.Cham:Springer,2022:181-195.
[21]SUCHOLUTSKY I,SCHONLAU M.Soft-label dataset distil-lation and text dataset distillation[C]//Proceedings of the 2021 International Joint Conference on Neural Networks(IJCNN).New York:IEEE,2021:1-8.
[22]MAEKAWA A,KOSUGI S,FUNAKOSHI K,et al.DiLM:Distilling dataset into language model for text-level dataset distillation[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.ACL,2024.
[23]LU H,ISONUMA M,MORI J,et al.Unidetox:Universal de-toxification of large language models via dataset distillation[J].arXiv:2504.20500,2025.
[24]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence em-beddings using Siamese BERT-networks[J].arXiv:1908.10084,2019.
[25]HENDRYCKS D,BURNS C,KADAVATH S,et al.Measuring mathematical problem solving with the MATH dataset[J].ar-Xiv:2103.03874,2021.
[26]YANG A,YANG B S,ZHANG B C,et al.Qwen2.5 technical report[J].arXiv:2412.15115,2024.
[27]JAIN N,CHIANG P,WEN Y X,et al.NeFTune:Noisy embeddings improve instruction finetuning[J].arXiv:2310.05914,2023.
[28]DAO T,FU D,ERMON S,et al.FlashAttention:Fast and me-mory-efficient exact attention with IO-awareness[J].Advances in Neural Information Processing Systems,2022,35:16344-16359.
[29]LIGHTMAN H,KOSARAJU V,BURDA Y,et al.Let’s verify step by step[C]//Proceedings of the 12th International Confe-rence on Learning Representations.2023.
[30]KWON W,LI Z H,ZHUANG S Y,et al.Efficient memory ma-nagement for large language model serving with Paged Attention[C]//Proceedings of the 29th ACM Symposium on Operat-ing Systems Principles.New York:ACM,2023.
[31]YE Y X,HUANG Z,XIAO Y,et al.LIMO:Less is more for reasoning[J].arXiv:2502.03387,2025.
[32]MUENNIGHOFF N,YANG Z T,SHI W J,et al.S1:Simpletest-time scaling[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.2025:20275-20321.
[33]XIA M,MALLADI S,GURURANGAN S,et al.Less:Selecting influential data for targeted instruction tuning[J].arXiv:2402.04333,2024.
[34]CAO Y,KANG Y,WANG C,et al.Instruction mining:Instruction data selection for tuning large language models[J].arXiv:2307.06290,2023.
[1] ZHAI Jie, CHEN Lexuan, PANG Zhiyu. Survey on Graph Neural Network-based Methods for Academic Performance Prediction [J]. Computer Science, 2026, 53(2): 16-30.
[2] WAN Shenghua, XU Xingye, GAN Le, ZHAN Dechuan. Pre-training World Models from Videos with Generated Actions by Multi-modal Large Models [J]. Computer Science, 2026, 53(1): 51-57.
[3] LI Jiawei , DENG Yuandan, CHEN Bo. Domain UML Model Automatic Construction Based on Fine-tuning Qwen2 [J]. Computer Science, 2025, 52(6A): 240900155-4.
[4] HUANG Chenxi, LI Jiahui, YAN Hui, ZHONG Ying, LU Yutong. Investigation on Load Balancing Strategies for Lattice Boltzmann Method with Local Grid Refinement [J]. Computer Science, 2025, 52(5): 101-108.
[5] HUANG Qian, SU Xinkai, LI Chang, WU Yirui. Hypergraph Convolutional Network with Multi-perspective Topology Refinement forSkeleton-based Action Recognition [J]. Computer Science, 2025, 52(5): 220-226.
[6] CHEN Guangyuan, WANG Zhaohui, CHENG Ze. Multi-view Stereo Reconstruction with Context-guided Cost Volume and Depth Refinemen [J]. Computer Science, 2025, 52(3): 231-238.
[7] LIAO Rui, TANG Jie, LIANG Tongjia, ZHENG Xinlei, WANG Binyi, QI Zhiqiang. Label Data Compression Algorithms Based on BWT,MTF and ANS [J]. Computer Science, 2025, 52(11A): 241000081-6.
[8] WU Chunming, WANG Tiaojun. Study on Defect Detection Algorithm of Transmission Line in Complex Background [J]. Computer Science, 2024, 51(6A): 230500178-6.
[9] DONG Yan, WEI Minghong, GAO Guangshuai, LIU Zhoufeng, LI Chunlei. Remote Sensing Orineted Object Detection Method Based on Dual-label Assignment [J]. Computer Science, 2024, 51(11A): 240100058-9.
[10] WU Chenglong, HU Minghao, LIAO Jinzhi, YANG Hui, ZHAO Xiang. Study on Fake News Detection Technology in Resource-constrained Environments [J]. Computer Science, 2024, 51(11): 15-22.
[11] XING Ying. Review of Software Engineering Techniques and Methods Based on Explainable Artificial Intelligence [J]. Computer Science, 2023, 50(5): 3-11.
[12] CHEN Xingtian, XIONG Xiaofu, BAI Yong, HU Haiyang. High Speed Data Compression Method of Merge Unit Based on SCD File [J]. Computer Science, 2023, 50(12): 123-129.
[13] TAO Xiao-yan, YAN Chun-gang, LIU Guan-jun. Dynamic Data Refining Strategy for Soundness Verification Based on WFT-net [J]. Computer Science, 2021, 48(7): 99-104.
[14] LIU Xiang-yu, JIAN Mu-wei, LU Xiang-wei, HE Wei-kai, LI Xiao-feng, YIN Yi-long. Saliency Detection Based on Eye Fixation Prediction and Boundary Optimization [J]. Computer Science, 2021, 48(6A): 107-112.
[15] ZHAO Nan, PI Wen-chao, XU Chang-qiao. Video Recommendation Algorithm for Multidimensional Feature Analysis and Filtering [J]. Computer Science, 2020, 47(4): 103-107.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!