计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 136-142.doi: 10.11896/jsjkx.250600087

• 数据库 & 大数据 & 数据科学 • 上一篇    下一篇

大模型指令微调的数据压缩:基于推理贡献度的精化

李昊, 丁立中, 傅稼润, 令狐赵桓   

  1. 北京理工大学计算机学院 北京 100081
  • 收稿日期:2025-06-12 修回日期:2026-01-09 发布日期:2026-03-12
  • 通讯作者: 丁立中(lizhong.ding@outlook.com)
  • 作者简介:(3220251322@bit.edu.cn)
  • 基金资助:
    国家重点研发计划(2022YFB2703100);国家自然科学基金面上项目(62376028);国家自然科学基金联合基金(U22A2099);国家自然科学基金优秀青年科学基金(海外)

Data Compression of Instruction Fine-tuning for Large Models:Refinement Based on Inference Contribution

LI Hao, DING Lizhong, FU Jiarun, LINGHU Zhaohuan   

  1. Department of Computer Science, Beijing Institute of Technology, Beijing 100081, China
  • Received:2025-06-12 Revised:2026-01-09 Online:2026-03-12
  • About author:LI Hao,born in 2003,postgraduate.His main research interests include emergence mechanisms and reasoning processes of large models.
    DING Lizhong,born in 1986,Ph.D,professor,Ph.D supervisor.His main research interests include deep statistical learning theory,deep kernel learning and deep generative models.
  • Supported by:
    National Key Research and Development Program of China(2022YFB2703100),National Natural Science Foundation of China(62376028),National Natural Science Foundation of China(U22A2099) and Excellent Young Scientists Fund(Overseas) of the National Natural Science Foundation of China.

摘要: 基于推理数据的大模型指令微调,通过显式建模复杂任务的多步逻辑关联,显著提升模型的推理准确性,然而微调过程依赖海量高质量数据,导致算力开销急剧攀升。已有数据压缩技术主要聚焦于原始规模缩减,普遍缺乏针对推理数据的压缩方法设计,忽视了推理数据中的多步逻辑关联、语义依存关系等,致使关键推理链完整性受损,进而降低了推理性能。为此,提出基于推理贡献度的精化(Refinement Based on Inference Contribution,RBIC)。通过分析推理数据的语义相似性构建知识领域图谱,精准定位核心信息。将数据样本的语义与大模型的推理准确率相结合,划分难度梯度,覆盖全场景推理需求。通过多步推理数据的逻辑复杂度量化推理贡献度,精化对模型推理贡献度最高的数据样本。实验结果表明,基于RBIC精化的推理数据进行微调后,模型平均推理性能仅下降1.13%,而训练时间缩短为原耗时的 16%,验证了RBIC在模型效能与资源消耗间实现了最优平衡,有望推动多领域大模型在资源受限环境下的高效部署与微调优化。

关键词: 大模型, 指令微调, 数据压缩, 推理贡献度, 相似性分析, 精化

Abstract: The fine-tuning of large model instructions based on reasoning data significantly improves the reasoning accuracy of the model by explicitly modeling the multi-step logical correlations of complex tasks.However,the fine-tuning process relies on massive high-quality data,resulting in a sharp increase in computing power overhead.The existing data compression techniques mainly focus on the reduction of the original scale.Generally,there is a lack of compression method design for reasoning data,ignoring multi-step logical associations,semantic dependency relationships in reasoning data,resulting in the damage of the integrity of the key reasoning chain and thereby reducing the reasoning performance.To this end,refinement based on inference contribution(RBIC) is proposed.The knowledge domain graph is constructed by analyzing and inferring the semantic similarity of the data to accurately locate the core information.It combines the semantics of data samples with the reasoning accuracy of large models,divides the difficulty gradient,and covers the reasoning requirements of all scenarios.The reasoning contribution is quantified through the logical complexity of multi-step reasoning data,and the data samples that contribute the most to the model’s reaso-ning are refined.Experimental results show that after fine-tuning based on the reasoning data refined by RBIC,the average reaso-ning performance of the model only decreases by 1.13%,while the training time is shortened to 16% of the original time consumption.This verifies that RBIC achieves the optimal balance between model efficiency and resource consumption,and is expected to promote the efficient deployment and fine-tuning optimization of multi-domain large models in resource-constrained environments.

Key words: Large models, Instruction fine-tuning, Data compression, Inference contribution, Similarity analysis, Refinement

中图分类号: 

  • TP391.1
[1]ZHANG B N,LI C X,FAN K.MARIO Eval:Evaluate yourmath LLM with your math LLM-A mathematical dataset eva-luation toolkit[J].arXiv:2404.13925,2024.
[2]GAO B F,CAI Z F,XU R X,et al.LLM critics help catch bugs in mathematics:towards a better mathematical verifier with natural language feedback[J].arXiv:2406.14024,2024.
[3]LUO Z Y,XU C,ZHAO P,et al.WizardCoder:Empoweringcode large language models with evol-instruct[J].arXiv:2306.08568,2023.
[4]COIGNION T,QUINTON C,ROUVOY R.A performancestudy of LLM-generated code on LeetCode[C]//Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering.ACM,2024:79-89.
[5]ZHANG Q T,WANG Y C,WANG H X,et al.A Survey on Fine-Tuning Techniques for Large Language Models[J].Computer Engineering and Applications,2024,60(17):17-33.
[6]SHEN S Y,PENG J L,ZHANG X L,et al.The chinese dataset distilled from deepseek-r1-671b[EB/OL].https://huggingface.co/datasets/Congliu/Chinese-DeepSeek-R1-Distill-data-110k.
[7]CHEN H,ZHANG Y M,ZHANG Q,et al.Maybe only 0.5% data is needed:a preliminary exploration of low training data instruction tuning[J].arXiv:2305.09246,2023.
[8]ZHOU C,LIU P,XU P,et al.Lima:Less is more for alignment[J].Advances in Neural Information Processing Systems,2023,36:55006-55021.
[9]WEI Y X,CASSANO F,LIU J W,et al.SelfCodeAlign:Self-alignment for code generation[J].arXiv:2410.24198,2024.
[10]ZHANG Z S,ZHANG A,LI M,et al.Automatic chain ofthought prompting in large language models[J].arXiv:2210.03493,2022.
[11]YU R N,LIU S H,WANG X C.Dataset distillation:a comprehensive review[J].arXiv:2301.07014,2023.
[12]WEI Y,WANG Z,LIU J,et al.Magicoder:Empowering codegeneration with oss-instruct[J].arXiv:2312.02120,2023.
[13]ZHANG J,ZHANG C X,LIU Y,et al.D3:Diversity,Difficulty,and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning[J].arXiv:2503.11441,2025.
[14]LI Y Q,LI W J.Data distillation for text classification[J].ar-Xiv:2104.08448,2021.
[15]CHAI C L,WANG J Y,TANG N,et al.Efficient coreset selection with cluster-based methods[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.New York:ACM,2023:167-178.
[16]MIRZASOLEIMAN B,CAO K D,LESKOVEC J.Coresets forrobust training of deep neural networks against noisy labels[J].Advances in Neural Information Processing Systems,2020,33:11465-11477.
[17]POOLADZANDI O,DAVINI D,MIRZASOLEIMAN B.Adaptive second order coresets for data-efficient machine learning[C]//Proceedings of the 39th International Conference on Machine Learning.Cambridge,MA:PMLR,2022:17848-17869.
[18]LI X F,ZOU H Y,LIU P F.LIMR:Less is more for RL scaling[J].arXiv:2502.11886,2025.
[19]LI M,ZHANG Y,LI Z,et al.From quantity to quality:Boosting llm performance with self-guided data selection for instruction tuning[J].arXiv:2308.12032,2023.
[20]GUO C C,ZHAO B,BAI Y B.DeepCore:A comprehensive library for coreset selection in deep learning[C]//Proceedings of the International Conference on Database and Expert Systems Applications.Cham:Springer,2022:181-195.
[21]SUCHOLUTSKY I,SCHONLAU M.Soft-label dataset distil-lation and text dataset distillation[C]//Proceedings of the 2021 International Joint Conference on Neural Networks(IJCNN).New York:IEEE,2021:1-8.
[22]MAEKAWA A,KOSUGI S,FUNAKOSHI K,et al.DiLM:Distilling dataset into language model for text-level dataset distillation[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.ACL,2024.
[23]LU H,ISONUMA M,MORI J,et al.Unidetox:Universal de-toxification of large language models via dataset distillation[J].arXiv:2504.20500,2025.
[24]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence em-beddings using Siamese BERT-networks[J].arXiv:1908.10084,2019.
[25]HENDRYCKS D,BURNS C,KADAVATH S,et al.Measuring mathematical problem solving with the MATH dataset[J].ar-Xiv:2103.03874,2021.
[26]YANG A,YANG B S,ZHANG B C,et al.Qwen2.5 technical report[J].arXiv:2412.15115,2024.
[27]JAIN N,CHIANG P,WEN Y X,et al.NeFTune:Noisy embeddings improve instruction finetuning[J].arXiv:2310.05914,2023.
[28]DAO T,FU D,ERMON S,et al.FlashAttention:Fast and me-mory-efficient exact attention with IO-awareness[J].Advances in Neural Information Processing Systems,2022,35:16344-16359.
[29]LIGHTMAN H,KOSARAJU V,BURDA Y,et al.Let’s verify step by step[C]//Proceedings of the 12th International Confe-rence on Learning Representations.2023.
[30]KWON W,LI Z H,ZHUANG S Y,et al.Efficient memory ma-nagement for large language model serving with Paged Attention[C]//Proceedings of the 29th ACM Symposium on Operat-ing Systems Principles.New York:ACM,2023.
[31]YE Y X,HUANG Z,XIAO Y,et al.LIMO:Less is more for reasoning[J].arXiv:2502.03387,2025.
[32]MUENNIGHOFF N,YANG Z T,SHI W J,et al.S1:Simpletest-time scaling[C]//Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing.2025:20275-20321.
[33]XIA M,MALLADI S,GURURANGAN S,et al.Less:Selecting influential data for targeted instruction tuning[J].arXiv:2402.04333,2024.
[34]CAO Y,KANG Y,WANG C,et al.Instruction mining:Instruction data selection for tuning large language models[J].arXiv:2307.06290,2023.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!