计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 87-99.doi: 10.11896/jsjkx.221100264
许茂龙1, 姜高霞1, 王文剑1,2
XU Maolong1, JIANG Gaoxia1, WANG Wenjian1,2
摘要: 噪声是影响机器学习模型可靠性的重要因素,而标签噪声相比特征噪声对模型训练更具决定性的影响。噪声过滤是处理标签噪声的一种有效方法,它不需要估计噪声率,也不需要依赖任何损失函数,然而目前大多数标签噪声过滤算法都会面临过度清洗问题。针对此问题,文中提出了基于异常检测的标签噪声过滤框架,并在此框架下给出了一种自适应近邻聚类的标签噪声过滤算法AdNN(Label Noise Filtering via Adaptive Nearest Neighbor Clustering)。该算法分别考虑分类问题中的每一个类别,把标签噪声检测问题转化成离群点检测问题,识别出每一个类别的离群点,然后根据相对密度去除离群点中的非噪声样本,得到噪声备选集,最后通过噪声因子对噪声备选集中的离群点进行噪声识别和过滤。实验结果表明,在合成数据集和公开数据集上,所提噪声过滤方法可以减轻过度清洗现象,同时能够得到很好的噪声过滤效果和分类预测性能。
中图分类号:
[1]VERLEYSEN M,FRENAY B.Classification in the Presence ofLabel Noise:A Survey [J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869. [2]ZHU X,WU X.Class Noise vs.Attribute Noise:A Quantitative Study [J].Artificial Intelligence Review,2004,22(3):177-210. [3]BRODLEY C E,FRIEDL M A.Identifying Mislabeled Training Data [J].Journal of Artificial Intelligence Research,2011,11(1):131-167. [4]GARCIA L,DE C,ANDRE CPLF,et al.Effect of label noise in the complexity of classification problems [J].Neurocomputing,2015,160:108-119. [5]LIU L,LIANG Q.A high-performing comprehensive learningalgorithm for text classification without pre-labeled training set [J].Knowledge & Information Systems,2011,29(3):727-738. [6]MELIN P,AMEZCUA J,VALDEZ F,et al.A newneural net-work model based on the LVQ algorithm for multi-class classification of arrhythmias [J].Information Sciences,2014,279:483-497. [7]JIANG G X,WANG W J,QIAN Y H,et al.A unified sample selection framework for output noise filtering:an error-bound perspective [J].Journal of Machine Learning Research,2021,22(18):1-66. [8]ZHANG Z H,JIANG G X,WANG W J.Label noise filtering method based on dynamic probability sampling [J].Journal of Computer Applications,2021,41(12):3485-3491. [9]GANG K,YI P,CHEN Z,et al.Multiple criteria mathematical programming for multi-class classification and application in network intrusion detection [J].Information Sciences an International Journal,2009,179(4):371-381. [10]DENIZCAN V N,SAYIN M O,MOHAMMADREZA M N,et al.Nonlinear Regression via Incremental Decision Trees [J].Pattern Recognition,2018,86:1-13. [11]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels [J].Advances in Neural Information Processing Systems,2013,26:1196-1204. [12]YU X,LIU T,GONG M,et al.An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2018:4480-4489. [13]MUANDET K,FUKUMIZU K,SRIPERUMBUDUR B,et al.Kernel Mean Embedding of Distributions:A Review and Beyond [J].Foundations and Trends in Machine Learning,2017,10(1/2):1-141. [14]WILSON D L.Asymptotic Properties of Nearest NeighborRules Using Edited Data [J].IEEE Transactions on Systems Man and Cybernetics,2007,2(3):408-421. [15]TOMEK I.An Experiment with the Edited Nearest-Neighbor Rule [J].IEEE Transactions on Systems Man & Cybernetics,2007,SMC-6(6):448-452. [16]ZHU X,WU X,CHEN Q.Eliminating Class Noise in Large Datasets [C]//Proc. 20th Int.Conf.Mach.Learn.DBLP,2003:920-927. [17]GAMBERGER D,LAVRAC,GROSELJ C.Experiments with Noise Filtering in a Medical Domain [C]//Proceedings of the International Conference on Machine Learning.Berlin,Germany:Springer,1999:143 -151. [18]SUN J,ZHAO F,WANG C,et al.Identifying and Correcting Mislabeled Training Instances [C]//Future Generation Communication and Networking(FGCN 2007).2007:244-250. [19]SLUBAN B,GAMBERGER D,LAVRAC N.Ensemble-Based-Noise Detection:Noise Ranking and Visual Performance Evaluation [J].Data Mining and Knowledge Discovery,2014,28(2):265-303. [20]GARCIA L,LORENA A C,MATWIN S,et al.Ensembles of label noise filters:a ranking approach [J].Data Mining & Know-ledge Discovery,2016,30:1192-1216. [21]KHOSHGOFTAAR T M,REBOURS P.Improving SoftwareQuality Prediction by Noise Filtering Techniques [J].Journal of Computer Science & Technology,2007,22(3):387-396. [22]JOSÉ A.SÁEZ A,MIKEL GALAR C,et al.INFFC:An iterative class noise filter based on the fusion of classifiers with noise sensitivity control [J].Information Fusion,2016,27:19-32. [23]LIU Y,XIA S Y,YU H,et al.Prediction of Aluminum Electro-lysis Superheat Based on Improved Relative Density Noise Filter SMO [C]//2018 IEEE International Conference on Big Know-ledge(ICBK).IEEE,2018:376-381. [24]XIA S Y,CHEN B Y,WANG G Y,et al.mCRF and mRD:Two Classification Methods Based on a Novel Multiclass Label Noise Filtering Learning Framework [J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(7):2916-2930. [25]KARMAKERA,KWEK S.A boosting approach to remove class label noise [J].International Journal of Hybrid Intelligent Systems,2006,3(3):169-177. [26]MALOSSINI A,BLANZIERI E,NG R.Detecting potential la-beling errors in microarrays by dataperturbation [J].Bioinformatics,2006,22(17):2114-2121. [27]JIANG G X,FAN R X,WANG W J.Label noise filtering viaperception of nearest neighbors [J].Pattern Recognition and Artificial Intelligence,2020,33(6):518-529. [28]HAWKINS D M.Identification of outliers [M].London:Chapman and Hall,1980. [29]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference onManagement of Data.2000:93-104. [30]ZHANG K,HUTTER M,JIN H.A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data[C]//Pacific-Asia Conference on Knowledge Discovery & Data Mining.Berlin,Heidelberg:Springer,2009:813-822. [31]JIN W,TUNG A K H,HAN J,etal.Ranking outliers using symmetric neighborhood relationship [J].Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),2006,39(18):577-593. [32]TANG J,CHEN Z,FU A,et al.Enhancing Effectiveness of Outlier Detections for Low Density Patterns [C]//Advances in Knowledge Discovery and Data Mining,6th Pacific-Asia Confe-rence(PAKDD 2002).Taipei,Taiwan,Springer-Verlag,2002,23(36):535-548. [33]HE Z,XU X,DENG S.Discovering cluster-basedlocal outliers [J].Pattern Recognition Letters,2003,24(9/10):1641-1650. [34]ZENGYOU H E,XIAOFEI X U,DENG S.Squeezer:An Efficient Algorithm for Clustering Categorical Data [J].Journal of Computer Science & Technology,2002,17(5):611-624. [35]LIAN D,XU L,LIU Y,et al.Cluster-based outlier detection[J].Microelectronics & Computer,2008,168(1):151-168. [36]DUAN L,XU L,GUO F,et al.A local-density based spatial clustering algorithm with noise[J].Information Systems,2007,32(7):978-986. [37]HUANG J,ZHU Q,YANG L,et al.A non-parameter outlier detection algorithm based on Natural Neighbor [J].Knowledge-Based Systems,2016,92(15):71-77. |
|