计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 185-193.doi: 10.11896/jsjkx.211100278

• 人工智能 • 上一篇    下一篇

非完美多分类标签体系下的领域短文本分类方法研究

梁浩玮, 王石, 曹存根   

  1. 中国科学院计算技术研究所 北京 100190
  • 收稿日期:2021-11-29 修回日期:2022-09-01 出版日期:2023-01-15 发布日期:2023-01-09
  • 通讯作者: 曹存根(cgcao@ict.ac.cn)
  • 作者简介:liang199611@outlook.com
  • 基金资助:
    科技部重点研发计划课题:开放式智能化中医传承信息管理和挖掘平台的研制(2017YFC1700302)

Study on Short Text Classification with Imperfect Labels

LIANG Haowei, WANG Shi, CAO Cungen   

  1. Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
  • Received:2021-11-29 Revised:2022-09-01 Online:2023-01-15 Published:2023-01-09
  • About author:LIANG Haowei,born in 1996,postgra-duate.His main research interests include natural language processing and deep learning.
    CAO Cungen,born in 1964,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include large-scale knowledge processing and machine learning.
  • Supported by:
    Development of an Open and Intelligent TCM Inheritance Information Management and Mining Platform for Key Research and Development Projects of the Ministry of Science and Technology(2017YFC1700302).

摘要: 近年来,短文本分类技术获得了广泛的研究。但在实际应用中,随着文本数据的积累,人们经常会遇到分类体系问题及其引起的数据分类标注问题,原因在于分类标签体系通常具有动态性,以及体系中的分类标签具有不易区分性。为此,文中结合分类标签数量众多的某省电信投诉工单分析业务进行了具体分析,并提出了一种非完美多分类标签体系的概念模型。在此基础上,针对数据集中的分类标注冲突与遗漏,提出了一种基于高质量种子训练集的检测和半自动修复方法,用于修复分类体系动态性和人工标注错误导致的标注冲突和遗漏,经过6个月的线上运行,在过滤掉10%的分类置信度过低的投诉工单后,基于BERT的分类模型的F1值可达0.9。

关键词: 非完美多分类标签体系, 细粒度短文本分类, 分类标注, 数据清洗

Abstract: Short text classification techniques have been widely studied.When these techniques are applied to domain short text forproduction,as textual data accumulates,people often encounter problems mainly in two aspects:the imperfect labels and mistakenly-labeled training dataset.First,the class label set is generally dynamic in nature.Second,when domain annotators label textual data,it is hard to distinguish some fine-grained class label from others.For the above problems,this paper analyzes the shortcomings of an actual and complex telecom domain label set with numerous classes in depth and proposes a conceptual model for the imperfect multi-classification label system.Based on the conceptual model,for repairing the conflicts and omissions in a labeled dataset,we introduce a semi-automatic method for detecting these problems iteratively with the help of a seed dataset.After repairing the conflicts and omissions caused by a dynamic label set and mistakes of annotators,after about six months of iteration,the F1-score of the BERT-based classification model is above 0.9 after filtering out 10% tickets with low classification confidence.

Key words: Imperfect multi-classification label system, Fine-grained short text classification, Class labeling, Data cleaning

中图分类号: 

  • TP391
[1]MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning--based text classification:a comprehensive review[J].ACM Computing Surveys(CSUR),2021,54(3):1-40.
[2]DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[3]ZHU Y,TING K M,ZHOU Z H.Multi-label learning withemerging new labels[J].IEEE Transactions on Knowledge and Data Engineering,2018,30(10):1901-1914.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.MIT Press,2017:5998-6008.
[5]ZHOU J,MA C,LONG D,et al.Hierarchy-aware global model for hierarchical text classification[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.ACL,2020:1106-1117.
[6]SONG H,KIM M,PARK D,et al.Learning from noisy labels with deep neural networks:a survey[J].arXiv:2007.08199,2020.
[7]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels[C]//Advances in Neural Information Processing Systems.MIT Press,2013:1196-1204.
[8]REED S,LEE H,ANGUELOV D,et al.Training deep neuralnetworks on noisy labels with bootstrapping[J].arXiv:1412.6596,2014.
[9]REN M,ZENG W,YANG B,et al.Learning to reweight examples for robust deep learning[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:4334-4343.
[10]ZHANG Z,ZHANG H,ARIK S O,et al.Distilling effective supervision from severe label noise[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:9294-9303.
[11]LI Y,YANG J,SONG Y,et al.Learning from noisy labels with distillation[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.IEEE,2017:1928-1936.
[12]JIANG L,ZHOU Z,LEUNG T,et al.Mentornet:learning data-driven curriculum for very deep neural networks on corrupted labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2018:2304-2313.
[13]HAN B,YAO Q,YU X,et al.Co-teaching:robust training of deep neural networks with extremely noisy labels[J].arXiv:1804.06872,2018.
[14]WANG Y,LIU W,MA X,et al.Iterative learning with open-set noisy labels[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2018:8688-8696.
[15]KRISHNAN S,WANG J,WU E,et al.Activeclean:interactive data cleaning for statistical modeling[J].Proceedings of the VLDB Endowment,2016,9(12):948-959.
[16]SETTLES B.Active learning literature survey[J].Science,1995,10(3):237-304.
[17]ABE N.Query learning strategies using boosting and bagging[C]//Proceedings of the Fifteenth International Conference on Machine Learning.JMLR,1998:1-9.
[18]YAKOUT M,BERTI-ÉQUILLE L,ELMAGARMID A K.Don't be scared:use scalable automatic repairing with maximal likelihood and bounded changes[C]//Proceedings of the 2013 International Conference on Management of Data.ACM Press,2013:553-564.
[19]NGUYEN H T,SMEULDERS A W M.Active learning using pre-clustering[C]//Proceedings of the International Conference on Machine Learning.JMLR,2004:623-630.
[20]CHEN P,LIAO B B,CHEN G,et al.Understanding and utilizing deep neural networks trained with noisy labels[C]//Proceedings of the International Conference on Machine Learning.JMLR,2019:1062-1070.
[21]LI J,SUN M,ZHANG X.A comparison and semi-quantitative analysis of words and character-bigrams as features in chinese text categorization[C]//Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics.ACL,2006:545-552.
[1] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[2] 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲.
FS-CRF:基于特征切分与级联随机森林的异常点检测模型
FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest
计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162
[3] 徐鹤, 吴昊, 李鹏.
面向物联网的时空数据处理算法设计
Design of Temporal-spatial Data Processing Algorithm for IoT
计算机科学, 2020, 47(11): 310-315. https://doi.org/10.11896/jsjkx.200400045
[4] 刘金硕, 刘必为, 张密, 刘卿.
基于GBDT的电力计量设备故障预测
Fault Prediction of Power Metering Equipment Based on GBDT
计算机科学, 2019, 46(6A): 392-396.
[5] 王晓霞, 孙德才.
一种基于Q-sample的局部相似连接并行算法
Q-sample-based Local Similarity Join Parallel Algorithm
计算机科学, 2019, 46(12): 38-44. https://doi.org/10.11896/jsjkx.190100240
[6] 孙德才,王晓霞.
一种基于MapReduce的大数据集相似自连接算法
MapReduce Based Similarity Self-join Algorithm for Big Dataset
计算机科学, 2017, 44(5): 20-25. https://doi.org/10.11896/j.issn.1002-137X.2017.05.004
[7] 顾韵华,高宝,张俊勇,杜杰.
基于标签速度和滑动子窗口的RFID数据清洗算法
RFID Data Cleaning Algorithm Based on Tag Velocity and Sliding Sub-window
计算机科学, 2015, 42(1): 144-148. https://doi.org/10.11896/j.issn.1002-137X.2015.01.034
[8] 王万良,顾熙仁,赵燕伟.
一种基于动态标签的RFID不确定性数据清洗算法
RFID Uncertain Data Cleaning Algorithm Based on Dynamic Tags
计算机科学, 2014, 41(Z6): 383-386.
[9] 陈静云,周良,丁秋林.
基于改进卡尔曼滤波的RFID数据清洗方法研究
Cleaning Method Research of RFID Data Stream Based on Improved Kalman Filter
计算机科学, 2014, 41(3): 202-204.
[10] 曹建军,刁兴春,陈 爽,邵衍振.
数据清洗及其一般性系统框架
Data Cleaning and its General System Framework
计算机科学, 2012, 39(Z11): 207-211.
[11] 林印华,张春海,刘 洁.
基于清洗规则和主数据的数据修复算法实现
Realization of Data Cleaning Based on Editing Rules and Master Data
计算机科学, 2012, 39(Z11): 174-176.
[12] 曹建军,刁兴春,汪挺,王芳潇.
领域无关数据清洗研究综述
Research on Domain-independent Data Cleaning: A Survey
计算机科学, 2010, 37(5): 26-29.
[13] 杨梦宁,赵鹏,张小洪,李朋.
一种基于总线模型的数据清洗方法
Data Clean Method Based on Bus Model
计算机科学, 2010, 37(4): 224-.
[14] 胡艳丽,张维明.
条件依赖理论及其应用展望
Theory of Conditional Functional Dependencies and its Application for Improving Data Quality
计算机科学, 2009, 36(12): 115-118.
[15] 胡艳丽,张维明,罗旭辉,肖卫东,汤大权.
基于数据依赖的数据修复研究进展
Dependencies Theory and its Application for Repairing Inconsistent Data
计算机科学, 2009, 36(10): 11-15.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!