计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 87-99.doi: 10.11896/jsjkx.221100264

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于异常检测的标签噪声过滤框架

许茂龙1, 姜高霞1, 王文剑1,2   

  1. 1 山西大学计算机与信息技术学院 太原030006
    2 计算智能与中文信息处理教育部重点实验室(山西大学) 太原030006
  • 收稿日期:2022-11-30 修回日期:2023-04-03 出版日期:2024-02-15 发布日期:2024-02-22
  • 通讯作者: 王文剑(wjwang@ sxu.edu.cn)
  • 作者简介:(xumaolong4094@foxmail.com)
  • 基金资助:
    国家自然科学基金(U21A20513,62076154,61906113);山西省高等学校科技创新项目(2020L0007)

Label Noise Filtering Framework Based on Outlier Detection

XU Maolong1, JIANG Gaoxia1, WANG Wenjian1,2   

  1. 1 College of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan 030006,China
  • Received:2022-11-30 Revised:2023-04-03 Online:2024-02-15 Published:2024-02-22
  • About author:XU Maolong,born in 1996,master.His main research interest is machine lear-ning.WANG Wenjian,born in 1968,Ph.D,professor,is an outstanding member of CCF(No.16143D).Her main research interests include image processing,machine learning and computing intelligence.
  • Supported by:
    National Natural Science Foundation of China(U21A20513,62076154,61906113) and Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi Province(2020L0007).

摘要: 噪声是影响机器学习模型可靠性的重要因素,而标签噪声相比特征噪声对模型训练更具决定性的影响。噪声过滤是处理标签噪声的一种有效方法,它不需要估计噪声率,也不需要依赖任何损失函数,然而目前大多数标签噪声过滤算法都会面临过度清洗问题。针对此问题,文中提出了基于异常检测的标签噪声过滤框架,并在此框架下给出了一种自适应近邻聚类的标签噪声过滤算法AdNN(Label Noise Filtering via Adaptive Nearest Neighbor Clustering)。该算法分别考虑分类问题中的每一个类别,把标签噪声检测问题转化成离群点检测问题,识别出每一个类别的离群点,然后根据相对密度去除离群点中的非噪声样本,得到噪声备选集,最后通过噪声因子对噪声备选集中的离群点进行噪声识别和过滤。实验结果表明,在合成数据集和公开数据集上,所提噪声过滤方法可以减轻过度清洗现象,同时能够得到很好的噪声过滤效果和分类预测性能。

关键词: 标签噪声过滤, 离群点检测, 自适应k近邻, 相对密度, 噪声因子

Abstract: Noise is an important factor affecting the reliability of machine learning models,and label noise has more decisive in-fluence on model training than feature noise.Reducing label noise is a key step in classification tasks.Filtering noise is an effective way to deal with label noise,and it neither requires estimating the noise rate nor relies on any loss function.However,most filtering algorithms may cause overcleaning phenomenon.To solve this problem,a label noise filtering framework based on outlier detection is proposed firstly,and a label noise filtering algorithm via adaptive nearest neighbor clustering(AdNN) is then presented.AdNN transforms the label noise detection into the outlier detection problem.It considers samples in each category separately,and all outliers will be identified.Samples belong to outliers will be ignored according to relative density,and real label noise belong to outliers will be found and removed by defined noise factor.Experiments on some synthetic and benchmark datasets show that the proposed noise filtering method can not only alleviate the overcleaning phenomenon,but also obtain good noise filtering effect and classification prediction performance.

Key words: Label noise filtering, Outlier detection, Adaptive k-nearest neighbors, Relative density, Noise factor

中图分类号: 

  • TP181
[1]VERLEYSEN M,FRENAY B.Classification in the Presence ofLabel Noise:A Survey [J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869.
[2]ZHU X,WU X.Class Noise vs.Attribute Noise:A Quantitative Study [J].Artificial Intelligence Review,2004,22(3):177-210.
[3]BRODLEY C E,FRIEDL M A.Identifying Mislabeled Training Data [J].Journal of Artificial Intelligence Research,2011,11(1):131-167.
[4]GARCIA L,DE C,ANDRE CPLF,et al.Effect of label noise in the complexity of classification problems [J].Neurocomputing,2015,160:108-119.
[5]LIU L,LIANG Q.A high-performing comprehensive learningalgorithm for text classification without pre-labeled training set [J].Knowledge & Information Systems,2011,29(3):727-738.
[6]MELIN P,AMEZCUA J,VALDEZ F,et al.A newneural net-work model based on the LVQ algorithm for multi-class classification of arrhythmias [J].Information Sciences,2014,279:483-497.
[7]JIANG G X,WANG W J,QIAN Y H,et al.A unified sample selection framework for output noise filtering:an error-bound perspective [J].Journal of Machine Learning Research,2021,22(18):1-66.
[8]ZHANG Z H,JIANG G X,WANG W J.Label noise filtering method based on dynamic probability sampling [J].Journal of Computer Applications,2021,41(12):3485-3491.
[9]GANG K,YI P,CHEN Z,et al.Multiple criteria mathematical programming for multi-class classification and application in network intrusion detection [J].Information Sciences an International Journal,2009,179(4):371-381.
[10]DENIZCAN V N,SAYIN M O,MOHAMMADREZA M N,et al.Nonlinear Regression via Incremental Decision Trees [J].Pattern Recognition,2018,86:1-13.
[11]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels [J].Advances in Neural Information Processing Systems,2013,26:1196-1204.
[12]YU X,LIU T,GONG M,et al.An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2018:4480-4489.
[13]MUANDET K,FUKUMIZU K,SRIPERUMBUDUR B,et al.Kernel Mean Embedding of Distributions:A Review and Beyond [J].Foundations and Trends in Machine Learning,2017,10(1/2):1-141.
[14]WILSON D L.Asymptotic Properties of Nearest NeighborRules Using Edited Data [J].IEEE Transactions on Systems Man and Cybernetics,2007,2(3):408-421.
[15]TOMEK I.An Experiment with the Edited Nearest-Neighbor Rule [J].IEEE Transactions on Systems Man & Cybernetics,2007,SMC-6(6):448-452.
[16]ZHU X,WU X,CHEN Q.Eliminating Class Noise in Large Datasets [C]//Proc. 20th Int.Conf.Mach.Learn.DBLP,2003:920-927.
[17]GAMBERGER D,LAVRAC,GROSELJ C.Experiments with Noise Filtering in a Medical Domain [C]//Proceedings of the International Conference on Machine Learning.Berlin,Germany:Springer,1999:143 -151.
[18]SUN J,ZHAO F,WANG C,et al.Identifying and Correcting Mislabeled Training Instances [C]//Future Generation Communication and Networking(FGCN 2007).2007:244-250.
[19]SLUBAN B,GAMBERGER D,LAVRAC N.Ensemble-Based-Noise Detection:Noise Ranking and Visual Performance Evaluation [J].Data Mining and Knowledge Discovery,2014,28(2):265-303.
[20]GARCIA L,LORENA A C,MATWIN S,et al.Ensembles of label noise filters:a ranking approach [J].Data Mining & Know-ledge Discovery,2016,30:1192-1216.
[21]KHOSHGOFTAAR T M,REBOURS P.Improving SoftwareQuality Prediction by Noise Filtering Techniques [J].Journal of Computer Science & Technology,2007,22(3):387-396.
[22]JOSÉ A.SÁEZ A,MIKEL GALAR C,et al.INFFC:An iterative class noise filter based on the fusion of classifiers with noise sensitivity control [J].Information Fusion,2016,27:19-32.
[23]LIU Y,XIA S Y,YU H,et al.Prediction of Aluminum Electro-lysis Superheat Based on Improved Relative Density Noise Filter SMO [C]//2018 IEEE International Conference on Big Know-ledge(ICBK).IEEE,2018:376-381.
[24]XIA S Y,CHEN B Y,WANG G Y,et al.mCRF and mRD:Two Classification Methods Based on a Novel Multiclass Label Noise Filtering Learning Framework [J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(7):2916-2930.
[25]KARMAKERA,KWEK S.A boosting approach to remove class label noise [J].International Journal of Hybrid Intelligent Systems,2006,3(3):169-177.
[26]MALOSSINI A,BLANZIERI E,NG R.Detecting potential la-beling errors in microarrays by dataperturbation [J].Bioinformatics,2006,22(17):2114-2121.
[27]JIANG G X,FAN R X,WANG W J.Label noise filtering viaperception of nearest neighbors [J].Pattern Recognition and Artificial Intelligence,2020,33(6):518-529.
[28]HAWKINS D M.Identification of outliers [M].London:Chapman and Hall,1980.
[29]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference onManagement of Data.2000:93-104.
[30]ZHANG K,HUTTER M,JIN H.A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data[C]//Pacific-Asia Conference on Knowledge Discovery & Data Mining.Berlin,Heidelberg:Springer,2009:813-822.
[31]JIN W,TUNG A K H,HAN J,etal.Ranking outliers using symmetric neighborhood relationship [J].Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),2006,39(18):577-593.
[32]TANG J,CHEN Z,FU A,et al.Enhancing Effectiveness of Outlier Detections for Low Density Patterns [C]//Advances in Knowledge Discovery and Data Mining,6th Pacific-Asia Confe-rence(PAKDD 2002).Taipei,Taiwan,Springer-Verlag,2002,23(36):535-548.
[33]HE Z,XU X,DENG S.Discovering cluster-basedlocal outliers [J].Pattern Recognition Letters,2003,24(9/10):1641-1650.
[34]ZENGYOU H E,XIAOFEI X U,DENG S.Squeezer:An Efficient Algorithm for Clustering Categorical Data [J].Journal of Computer Science & Technology,2002,17(5):611-624.
[35]LIAN D,XU L,LIU Y,et al.Cluster-based outlier detection[J].Microelectronics & Computer,2008,168(1):151-168.
[36]DUAN L,XU L,GUO F,et al.A local-density based spatial clustering algorithm with noise[J].Information Systems,2007,32(7):978-986.
[37]HUANG J,ZHU Q,YANG L,et al.A non-parameter outlier detection algorithm based on Natural Neighbor [J].Knowledge-Based Systems,2016,92(15):71-77.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!