Computer Science ›› 2023, Vol. 50 ›› Issue (1): 185-193.doi: 10.11896/jsjkx.211100278

• Artificial Intelligence • Previous Articles     Next Articles

Study on Short Text Classification with Imperfect Labels

LIANG Haowei, WANG Shi, CAO Cungen   

  1. Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2021-11-29 Revised:2022-09-01 Online:2023-01-15 Published:2023-01-09
  • About author:LIANG Haowei,born in 1996,postgra-duate.His main research interests include natural language processing and deep learning.
    CAO Cungen,born in 1964,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include large-scale knowledge processing and machine learning.
  • Supported by:
    Development of an Open and Intelligent TCM Inheritance Information Management and Mining Platform for Key Research and Development Projects of the Ministry of Science and Technology(2017YFC1700302).

Abstract: Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.

Key words: Imperfect multi-classification label system, Fine-grained short text classification, Class labeling, Data cleaning

CLC Number: 

  • TP391
[1]MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning--based text classification:a comprehensive review[J].ACM Computing Surveys(CSUR),2021,54(3):1-40.
[2]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[3]ZHU Y,TING K M,ZHOU Z H.Multi-label learning withemerging new labels[J].IEEE Transactions on Knowledge and Data Engineering,2018,30(10):1901-1914.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.MIT Press,2017:5998-6008.
[5]ZHOU J,MA C,LONG D,et al.Hierarchy-aware global model for hierarchical text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.ACL,2020:1106-1117.
[6]SONG H,KIM M,PARK D,et al.Learning from noisy labels with deep neural networks:a survey[J].arXiv:2007.08199,2020.
[7]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels[C]//Advances in Neural Information Processing Systems.MIT Press,2013:1196-1204.
[8]REED S,LEE H,ANGUELOV D,et al.Training deep neuralnetworks on noisy labels with bootstrapping[J].arXiv:1412.6596,2014.
[9]REN M,ZENG W,YANG B,et al.Learning to reweight examples for robust deep learning[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:4334-4343.
[10]ZHANG Z,ZHANG H,ARIK S O,et al.Distilling effective supervision from severe label noise[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:9294-9303.
[11]LI Y,YANG J,SONG Y,et al.Learning from noisy labels with distillation[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.IEEE,2017:1928-1936.
[12]JIANG L,ZHOU Z,LEUNG T,et al.Mentornet:learning data-driven curriculum for very deep neural networks on corrupted labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:2304-2313.
[13]HAN B,YAO Q,YU X,et al.Co-teaching:robust training of deep neural networks with extremely noisy labels[J].arXiv:1804.06872,2018.
[14]WANG Y,LIU W,MA X,et al.Iterative learning with open-set noisy labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:8688-8696.
[15]KRISHNAN S,WANG J,WU E,et al.Activeclean:interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959.
[16]SETTLES B.Active learning literature survey[J].Science,1995,10(3):237-304.
[17]ABE N.Query learning strategies using boosting and bagging[C]//Proceedings of the Fifteenth International Conference on Machine Learning.JMLR,1998:1-9.
[18]YAKOUT M,BERTI-ÉQUILLE L,ELMAGARMID A K.Don't be scared:use scalable automatic repairing with maximal likelihood and bounded changes[C]//Proceedings of the 2013 International Conference on Management of Data.ACM Press,2013:553-564.
[19]NGUYEN H T,SMEULDERS A W M.Active learning using pre-clustering[C]//Proceedings of the International Conference on Machine Learning.JMLR,2004:623-630.
[20]CHEN P,LIAO B B,CHEN G,et al.Understanding and utilizing deep neural networks trained with noisy labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2019:1062-1070.
[21]LI J,SUN M,ZHANG X.A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.ACL,2006:545-552.
[1] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[2] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
[3] XU He, WU Hao, LI Peng. Design of Temporal-spatial Data Processing Algorithm for IoT [J]. Computer Science, 2020, 47(11): 310-315.
[4] LIU Jin-shuo, LIU Bi-wei, ZHANG Mi, LIU Qing. Fault Prediction of Power Metering Equipment Based on GBDT [J]. Computer Science, 2019, 46(6A): 392-396.
[5] WANG Xiao-xia, SUN De-cai. Q-sample-based Local Similarity Join Parallel Algorithm [J]. Computer Science, 2019, 46(12): 38-44.
[6] SUN De-cai and WANG Xiao-xia. MapReduce Based Similarity Self-join Algorithm for Big Dataset [J]. Computer Science, 2017, 44(5): 20-25.
[7] GU Yun-hua, GAO Bao, ZHANG Jun-yong and DU Jie. RFID Data Cleaning Algorithm Based on Tag Velocity and Sliding Sub-window [J]. Computer Science, 2015, 42(1): 144-148.
[8] WANG Wan-liang,GU Xi-ren and ZHAO Yan-wei. RFID Uncertain Data Cleaning Algorithm Based on Dynamic Tags [J]. Computer Science, 2014, 41(Z6): 383-386.
[9] CHEN Jing-yun,ZHOU Liang and DING Qiu-lin. Cleaning Method Research of RFID Data Stream Based on Improved Kalman Filter [J]. Computer Science, 2014, 41(3): 202-204.
[10] . Data Cleaning and its General System Framework [J]. Computer Science, 2012, 39(Z11): 207-211.
[11] . Realization of Data Cleaning Based on Editing Rules and Master Data [J]. Computer Science, 2012, 39(Z11): 174-176.
[12] CAO Jian-jun,DIAO Xing-chun,WANG Ting,WANG Fang-xiao. Research on Domain-independent Data Cleaning: A Survey [J]. Computer Science, 2010, 37(5): 26-29.
[13] HU Yan-li,ZHANG Wei-ming. Theory of Conditional Functional Dependencies and its Application for Improving Data Quality [J]. Computer Science, 2009, 36(12): 115-118.
[14] HU Yan-li , ZHANG Wei-ming, LUO Xu-hui ,XIAO Wei-dong , TANG Da-quan. Dependencies Theory and its Application for Repairing Inconsistent Data [J]. Computer Science, 2009, 36(10): 11-15.
[15] . [J]. Computer Science, 2007, 34(3): 141-144.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!