计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 185-193.doi: 10.11896/jsjkx.211100278
梁浩玮, 王石, 曹存根
LIANG Haowei, WANG Shi, CAO Cungen
摘要: 近年来,短文本分类技术获得了广泛的研究。但在实际应用中,随着文本数据的积累,人们经常会遇到分类体系问题及其引起的数据分类标注问题,原因在于分类标签体系通常具有动态性,以及体系中的分类标签具有不易区分性。为此,文中结合分类标签数量众多的某省电信投诉工单分析业务进行了具体分析,并提出了一种非完美多分类标签体系的概念模型。在此基础上,针对数据集中的分类标注冲突与遗漏,提出了一种基于高质量种子训练集的检测和半自动修复方法,用于修复分类体系动态性和人工标注错误导致的标注冲突和遗漏,经过6个月的线上运行,在过滤掉10%的分类置信度过低的投诉工单后,基于BERT的分类模型的F1值可达0.9。
中图分类号:
| [1]MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning--based text classification:a comprehensive review[J].ACM Computing Surveys(CSUR),2021,54(3):1-40. [2]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [3]ZHU Y,TING K M,ZHOU Z H.Multi-label learning withemerging new labels[J].IEEE Transactions on Knowledge and Data Engineering,2018,30(10):1901-1914. [4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.MIT Press,2017:5998-6008. [5]ZHOU J,MA C,LONG D,et al.Hierarchy-aware global model for hierarchical text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.ACL,2020:1106-1117. [6]SONG H,KIM M,PARK D,et al.Learning from noisy labels with deep neural networks:a survey[J].arXiv:2007.08199,2020. [7]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels[C]//Advances in Neural Information Processing Systems.MIT Press,2013:1196-1204. [8]REED S,LEE H,ANGUELOV D,et al.Training deep neuralnetworks on noisy labels with bootstrapping[J].arXiv:1412.6596,2014. [9]REN M,ZENG W,YANG B,et al.Learning to reweight examples for robust deep learning[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:4334-4343. [10]ZHANG Z,ZHANG H,ARIK S O,et al.Distilling effective supervision from severe label noise[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:9294-9303. [11]LI Y,YANG J,SONG Y,et al.Learning from noisy labels with distillation[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.IEEE,2017:1928-1936. [12]JIANG L,ZHOU Z,LEUNG T,et al.Mentornet:learning data-driven curriculum for very deep neural networks on corrupted labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:2304-2313. [13]HAN B,YAO Q,YU X,et al.Co-teaching:robust training of deep neural networks with extremely noisy labels[J].arXiv:1804.06872,2018. [14]WANG Y,LIU W,MA X,et al.Iterative learning with open-set noisy labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:8688-8696. [15]KRISHNAN S,WANG J,WU E,et al.Activeclean:interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959. [16]SETTLES B.Active learning literature survey[J].Science,1995,10(3):237-304. [17]ABE N.Query learning strategies using boosting and bagging[C]//Proceedings of the Fifteenth International Conference on Machine Learning.JMLR,1998:1-9. [18]YAKOUT M,BERTI-ÉQUILLE L,ELMAGARMID A K.Don't be scared:use scalable automatic repairing with maximal likelihood and bounded changes[C]//Proceedings of the 2013 International Conference on Management of Data.ACM Press,2013:553-564. [19]NGUYEN H T,SMEULDERS A W M.Active learning using pre-clustering[C]//Proceedings of the International Conference on Machine Learning.JMLR,2004:623-630. [20]CHEN P,LIAO B B,CHEN G,et al.Understanding and utilizing deep neural networks trained with noisy labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2019:1062-1070. [21]LI J,SUN M,ZHANG X.A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.ACL,2006:545-552. | 
| [1] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 | 
| [2] | 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲. FS-CRF:基于特征切分与级联随机森林的异常点检测模型 FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest 计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162 | 
| [3] | 徐鹤, 吴昊, 李鹏. 面向物联网的时空数据处理算法设计 Design of Temporal-spatial Data Processing Algorithm for IoT 计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045 | 
| [4] | 刘金硕, 刘必为, 张密, 刘卿. 基于GBDT的电力计量设备故障预测 Fault Prediction of Power Metering Equipment Based on GBDT 计算机科学, 2019, 46(6A): 392-396. | 
| [5] | 王晓霞, 孙德才. 一种基于Q-sample的局部相似连接并行算法 Q-sample-based Local Similarity Join Parallel Algorithm 计算机科学, 2019, 46(12): 38-44. https://doi.org/10.11896/jsjkx.190100240 | 
| [6] | 孙德才,王晓霞. 一种基于MapReduce的大数据集相似自连接算法 MapReduce Based Similarity Self-join Algorithm for Big Dataset 计算机科学, 2017, 44(5): 20-25. https://doi.org/10.11896/j.issn.1002-137X.2017.05.004 | 
| [7] | 顾韵华,高宝,张俊勇,杜杰. 基于标签速度和滑动子窗口的RFID数据清洗算法 RFID Data Cleaning Algorithm Based on Tag Velocity and Sliding Sub-window 计算机科学, 2015, 42(1): 144-148. https://doi.org/10.11896/j.issn.1002-137X.2015.01.034 | 
| [8] | 王万良,顾熙仁,赵燕伟. 一种基于动态标签的RFID不确定性数据清洗算法 RFID Uncertain Data Cleaning Algorithm Based on Dynamic Tags 计算机科学, 2014, 41(Z6): 383-386. | 
| [9] | 陈静云,周良,丁秋林. 基于改进卡尔曼滤波的RFID数据清洗方法研究 Cleaning Method Research of RFID Data Stream Based on Improved Kalman Filter 计算机科学, 2014, 41(3): 202-204. | 
| [10] | 曹建军,刁兴春,陈 爽,邵衍振. 数据清洗及其一般性系统框架 Data Cleaning and its General System Framework 计算机科学, 2012, 39(Z11): 207-211. | 
| [11] | 林印华,张春海,刘 洁. 基于清洗规则和主数据的数据修复算法实现 Realization of Data Cleaning Based on Editing Rules and Master Data 计算机科学, 2012, 39(Z11): 174-176. | 
| [12] | 曹建军,刁兴春,汪挺,王芳潇. 领域无关数据清洗研究综述 Research on Domain-independent Data Cleaning: A Survey 计算机科学, 2010, 37(5): 26-29. | 
| [13] | 杨梦宁,赵鹏,张小洪,李朋. 一种基于总线模型的数据清洗方法 Data Clean Method Based on Bus Model 计算机科学, 2010, 37(4): 224-. | 
| [14] | 胡艳丽,张维明. 条件依赖理论及其应用展望 Theory of Conditional Functional Dependencies and its Application for Improving Data Quality 计算机科学, 2009, 36(12): 115-118. | 
| [15] | 胡艳丽,张维明,罗旭辉,肖卫东,汤大权. 基于数据依赖的数据修复研究进展 Dependencies Theory and its Application for Repairing Inconsistent Data 计算机科学, 2009, 36(10): 11-15. | 
| 
 | ||