计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 251-260.doi: 10.11896/jsjkx.220500100

• 人工智能 • 上一篇    下一篇

融合抗噪和双重蒸馏的文本分类方法

郭伟, 黄嘉晖, 侯晨煜, 曹斌   

  1. 浙江工业大学计算机科学与技术学院 杭州 310023
  • 收稿日期:2022-05-12 修回日期:2022-10-12 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 侯晨煜(houcy@zjut.edu.cn)
  • 作者简介:(weiguo1014@zjut.edu.cn)
  • 基金资助:
    国家自然科学基金(62276233);浙江省重点研发计划(2022C01145)

Text Classification Method Based on Anti-noise and Double Distillation Technology

GUO Wei, HUANG Jiahui, HOU Chenyu, CAO Bin   

  1. College of Computer Science & Technology,Zhejiang University of Technology,Hangzhou 310023,China
  • Received:2022-05-12 Revised:2022-10-12 Online:2023-06-15 Published:2023-06-06
  • About author:GUO Wei,born in 2001,postgraduate,is a member of China Computer Federation.Her main research interest is natural language processing.HOU Chenyu,born in 1994,Ph.D,lecturer,Ph.D supervisor,is a member of China Computer Federation.His main research interests include data mining and so on.
  • Supported by:
    National Natural Science Foundation of China(62276233) and Key R&D Program of Zhejiang Province(2022C01145).

摘要: 文本分类是自然语言处理中重要且经典的问题,常被应用于新闻分类、情感分析等场景。目前,基于深度学习的分类方法已经取得了较大的成功,但在实际应用中仍然存在以下3个方面的问题:1)现实生活中的文本数据存在大量的噪声标签,直接用这些数据训练模型会严重影响模型的性能;2)随着预训练模型的提出,模型分类准确率有所提升,但模型的规模和推理计算量也随之提升明显,使得在资源有限的设备上使用预训练模型成为一项挑战;3)预训练模型存在大量的冗余计算,当数据量较大时会导致模型出现预测效率低下的问题。针对上述问题,提出了一个融合抗噪和双重蒸馏(包括知识蒸馏和自蒸馏)的文本分类方法,通过基于置信学习的阈值抗噪方法和一种新的主动学习样例选择算法,以少量的标注成本提升数据的质量。同时,通过知识蒸馏结合自蒸馏的方式,减小了模型规模和冗余计算,进而使其可以根据需求灵活调整推理速度。在真实数据集上进行了大量实验来评估该方法的性能,实验结果表明所提方法在抗噪后准确率提升了1.18%,在较小的精度损失下相比BERT可以加速4~8倍。

关键词: 噪声标签, 置信学习, 主动学习, 知识蒸馏, 自蒸馏

Abstract: Text classification is an important and classic problem in the field of natural language processing,and it is often used in news classification,sentiment analysis and other scenarios.The existing deep learning-based classification methods have the following three problems:1)There are a large number of noisy labels in real-life datasets,and directly using these data to train the model will seriously affect the performance of the model.2)With the introduction of the pre-training model,the accuracy of model classification has improved,but the scale of the model and the number of inference calculations have also increased significantly,which make it a challenge to use pre-training models on devices with limited resources.3)The pre-training model has a large number of redundant calculations,which will lead to low prediction efficiency when the amount of data is large.To address these issues,this paper proposes a text classification method that combines anti-noise and double distillation(including knowledge distillation and self-distillation).Through the threshold anti-noise method based on confidence learning and a new active learning sample selection algorithm,the quality of the data is improved with a small amount of labeling cost.Meanwhile,the combination of knowledge distillation and self-fistillation reduces the scale of the model and redundant calculation,thereby it can flexibly adjust the inference speed according to the demand.Extensive experiments are performed on real datasets to evaluate the performance of the proposed method.Experimental results show that the accuracy of the proposed method after anti-noise increases by 1.18%,and it can be 4~8 times faster than BERT under small accuracy losses.

Key words: Noise label, Confidence learning, Active learning, Knowledge distillation, Self-distillation

中图分类号: 

  • TP391
[1]HAN X,ZHAO W,DING N,et al.Ptr:Prompt tuning with rules for text classification[J].AI Open,2022,3,182-192.
[2]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[J].Advances in Neural Information Processing Systems,2013,26,3111-3119.
[3]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30,6000-6010.
[4]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pretraining ofdeep bidirectionaltransformers for language understanding[J].arXiv:1810.04805,2019.
[5]KOVALEVA O,ROMANOV A,ROGERS A,et al.Revealing the Dark Secrets of BERT[C]//Proceedings of the 2019 Confe-rence on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).Hongkong:Association for Computational Linguistics,2019:4365-4374.
[6]NORTHCUTT C,JIANG L,CHUANG I.Confident learning:Estimating uncertainty in dataset labels[J].Journal of Artificial Intelligence Research,2021,70:1373-1411.
[7]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[C]//International Conference on Learning Representations.Canada:OpenReview.net,2018:1-13.
[8]TIAN X X.An Improved Algorithm of Active Learning Based on Multiclass Classification[D].Baoding:Hebei University,2017.
[9]GORDON M,DUH K,ANDREWS N.Compressing BERT:Studying the Effects of Weight Pruning on Transfer Learning[C]//Proceedings of the 5thWorkshop on Representation Learning for NLP.On-line:Association for Computational Linguistics,2020:143-155.
[10]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A LiteBERT for Self-supervised Learning of La-nguage Representations[C]//International Conference on Learning Representations.Formerly Addis Ababa ETHIOPIA:2019:1-17.
[11]GOU J,YU B,MAYBANK S J,et al.Knowle-dge distillation:A survey[J].International Journal of Computer Vision,2021,34:1-31.
[12]LIU W,ZHOU P,ZHAO Z,et al.Fastbert:a self-distilling bert with adaptive inference time[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Online:Association for Computational Linguistics,2020:6035-6044.
[13]TANAKA D,IKAMI D,YAMASAKI T,et al.Joint optimization framework for learning withnois-y labels[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City:IEEE,2018:5552-5560.
[14]LIN S,JI R,CHEN C,et al.ESPACE:Acceler-ating convolu-tional neural networks via eliminating spatial and channel redundancy[C]//Thirty-First AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2017:1424-1430.
[15]ZAFRIR O,BOUDOUKH G,IZSAK P,et al.Q8bert:Quantized 8bit bert[C]//2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition(EMC2-NIPS).Vancouver:IEEE,2019:36-39.
[16]JIAO X Q,YIN Y C,SHANG L F,et al.TinyBERT:Distilling BERT for Natural Language Understanding[C]//Findings of the Association for Computational Linguistics(EMNLP 2020)2020:4163-4174.
[17]SUN S,CHENG Y,GAN Z,et al.Patient Knowledge Distil-lation for BERT Model Compression[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).Hongkong:Association for Computational Linguistics,2019:4323-4332.
[18]SANH V,DEBUTL,CHAUMONDJ,et al.DistilBERT,a distilled version of BERT:smaller,faster,cheaper and lighter[J].arXiv:1910.01108,2019.
[19]QIU Y Y,LI H Z,LI S,et al.Revisiting correl-ations between intrinsic and extrinsic evaluations of word embeddings[C]//Chinese Computational Linguistics and Natural Language Proces-sing Based on Naturally Annotated Big Data.Cham:Springer,2018.209-221.
[20]SCHÜTZE H,MANNING C D,RAGHAVAN P.Introductionto information retrieval[M].Cambridge:Cambridge University Press,2008.
[21]CUI Y,CHE W,LIU T,et al.Pre-training with whole wordmasking for chinese bert[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:3504-3514.
[22]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A Robustly Optimized BERT Pretraining Approach[J].arXiv:1907.11692,2019.
[23]LI J,LIU X,ZHAO H,et al.BERT-EMD:Many-to-Many Layer Mapping for BERT Compression with Earth Mover's Distance[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing(EMNLP).Online:Association for Computational Linguistics,2020:3009-3018.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!