Computer Science ›› 2024, Vol. 51 ›› Issue (2): 87-99.doi: 10.11896/jsjkx.221100264

• Database & Big Data & Data Science • Previous Articles     Next Articles

Label Noise Filtering Framework Based on Outlier Detection

XU Maolong1, JIANG Gaoxia1, WANG Wenjian1,2   

  1. 1 College of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education,Shanxi University,Taiyuan 030006,China
  • Received:2022-11-30 Revised:2023-04-03 Online:2024-02-15 Published:2024-02-22
  • About author:XU Maolong,born in 1996,master.His main research interest is machine lear-ning.WANG Wenjian,born in 1968,Ph.D,professor,is an outstanding member of CCF(No.16143D).Her main research interests include image processing,machine learning and computing intelligence.
  • Supported by:
    National Natural Science Foundation of China(U21A20513,62076154,61906113) and Scientific and Technological Innovation Programs of Higher Education Institutions in Shanxi Province(2020L0007).

Abstract: Noise is an important factor affecting the reliability of machine learning models,and label noise has more decisive in-fluence on model training than feature noise.Reducing label noise is a key step in classification tasks.Filtering noise is an effective way to deal with label noise,and it neither requires estimating the noise rate nor relies on any loss function.However,most filtering algorithms may cause overcleaning phenomenon.To solve this problem,a label noise filtering framework based on outlier detection is proposed firstly,and a label noise filtering algorithm via adaptive nearest neighbor clustering(AdNN) is then presented.AdNN transforms the label noise detection into the outlier detection problem.It considers samples in each category separately,and all outliers will be identified.Samples belong to outliers will be ignored according to relative density,and real label noise belong to outliers will be found and removed by defined noise factor.Experiments on some synthetic and benchmark datasets show that the proposed noise filtering method can not only alleviate the overcleaning phenomenon,but also obtain good noise filtering effect and classification prediction performance.

Key words: Label noise filtering, Outlier detection, Adaptive k-nearest neighbors, Relative density, Noise factor

CLC Number: 

  • TP181
[1]VERLEYSEN M,FRENAY B.Classification in the Presence ofLabel Noise:A Survey [J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869.
[2]ZHU X,WU X.Class Noise vs.Attribute Noise:A Quantitative Study [J].Artificial Intelligence Review,2004,22(3):177-210.
[3]BRODLEY C E,FRIEDL M A.Identifying Mislabeled Training Data [J].Journal of Artificial Intelligence Research,2011,11(1):131-167.
[4]GARCIA L,DE C,ANDRE CPLF,et al.Effect of label noise in the complexity of classification problems [J].Neurocomputing,2015,160:108-119.
[5]LIU L,LIANG Q.A high-performing comprehensive learningalgorithm for text classification without pre-labeled training set [J].Knowledge & Information Systems,2011,29(3):727-738.
[6]MELIN P,AMEZCUA J,VALDEZ F,et al.A newneural net-work model based on the LVQ algorithm for multi-class classification of arrhythmias [J].Information Sciences,2014,279:483-497.
[7]JIANG G X,WANG W J,QIAN Y H,et al.A unified sample selection framework for output noise filtering:an error-bound perspective [J].Journal of Machine Learning Research,2021,22(18):1-66.
[8]ZHANG Z H,JIANG G X,WANG W J.Label noise filtering method based on dynamic probability sampling [J].Journal of Computer Applications,2021,41(12):3485-3491.
[9]GANG K,YI P,CHEN Z,et al.Multiple criteria mathematical programming for multi-class classification and application in network intrusion detection [J].Information Sciences an International Journal,2009,179(4):371-381.
[10]DENIZCAN V N,SAYIN M O,MOHAMMADREZA M N,et al.Nonlinear Regression via Incremental Decision Trees [J].Pattern Recognition,2018,86:1-13.
[11]NATARAJAN N,DHILLON I S,RAVIKUMAR P,et al.Learning with noisy labels [J].Advances in Neural Information Processing Systems,2013,26:1196-1204.
[12]YU X,LIU T,GONG M,et al.An Efficient and Provable Approach for Mixture Proportion Estimation Using Linear Independence Assumption [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2018:4480-4489.
[13]MUANDET K,FUKUMIZU K,SRIPERUMBUDUR B,et al.Kernel Mean Embedding of Distributions:A Review and Beyond [J].Foundations and Trends in Machine Learning,2017,10(1/2):1-141.
[14]WILSON D L.Asymptotic Properties of Nearest NeighborRules Using Edited Data [J].IEEE Transactions on Systems Man and Cybernetics,2007,2(3):408-421.
[15]TOMEK I.An Experiment with the Edited Nearest-Neighbor Rule [J].IEEE Transactions on Systems Man & Cybernetics,2007,SMC-6(6):448-452.
[16]ZHU X,WU X,CHEN Q.Eliminating Class Noise in Large Datasets [C]//Proc. 20th Int.Conf.Mach.Learn.DBLP,2003:920-927.
[17]GAMBERGER D,LAVRAC,GROSELJ C.Experiments with Noise Filtering in a Medical Domain [C]//Proceedings of the International Conference on Machine Learning.Berlin,Germany:Springer,1999:143 -151.
[18]SUN J,ZHAO F,WANG C,et al.Identifying and Correcting Mislabeled Training Instances [C]//Future Generation Communication and Networking(FGCN 2007).2007:244-250.
[19]SLUBAN B,GAMBERGER D,LAVRAC N.Ensemble-Based-Noise Detection:Noise Ranking and Visual Performance Evaluation [J].Data Mining and Knowledge Discovery,2014,28(2):265-303.
[20]GARCIA L,LORENA A C,MATWIN S,et al.Ensembles of label noise filters:a ranking approach [J].Data Mining & Know-ledge Discovery,2016,30:1192-1216.
[21]KHOSHGOFTAAR T M,REBOURS P.Improving SoftwareQuality Prediction by Noise Filtering Techniques [J].Journal of Computer Science & Technology,2007,22(3):387-396.
[22]JOSÉ A.SÁEZ A,MIKEL GALAR C,et al.INFFC:An iterative class noise filter based on the fusion of classifiers with noise sensitivity control [J].Information Fusion,2016,27:19-32.
[23]LIU Y,XIA S Y,YU H,et al.Prediction of Aluminum Electro-lysis Superheat Based on Improved Relative Density Noise Filter SMO [C]//2018 IEEE International Conference on Big Know-ledge(ICBK).IEEE,2018:376-381.
[24]XIA S Y,CHEN B Y,WANG G Y,et al.mCRF and mRD:Two Classification Methods Based on a Novel Multiclass Label Noise Filtering Learning Framework [J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(7):2916-2930.
[25]KARMAKERA,KWEK S.A boosting approach to remove class label noise [J].International Journal of Hybrid Intelligent Systems,2006,3(3):169-177.
[26]MALOSSINI A,BLANZIERI E,NG R.Detecting potential la-beling errors in microarrays by dataperturbation [J].Bioinformatics,2006,22(17):2114-2121.
[27]JIANG G X,FAN R X,WANG W J.Label noise filtering viaperception of nearest neighbors [J].Pattern Recognition and Artificial Intelligence,2020,33(6):518-529.
[28]HAWKINS D M.Identification of outliers [M].London:Chapman and Hall,1980.
[29]BREUNIG M M,KRIEGEL H P,NG R T,et al.LOF:identi-fying density based local outliers[C]//Proceedings of the 2000 ACM SIGMOD International Conference onManagement of Data.2000:93-104.
[30]ZHANG K,HUTTER M,JIN H.A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data[C]//Pacific-Asia Conference on Knowledge Discovery & Data Mining.Berlin,Heidelberg:Springer,2009:813-822.
[31]JIN W,TUNG A K H,HAN J,etal.Ranking outliers using symmetric neighborhood relationship [J].Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics),2006,39(18):577-593.
[32]TANG J,CHEN Z,FU A,et al.Enhancing Effectiveness of Outlier Detections for Low Density Patterns [C]//Advances in Knowledge Discovery and Data Mining,6th Pacific-Asia Confe-rence(PAKDD 2002).Taipei,Taiwan,Springer-Verlag,2002,23(36):535-548.
[33]HE Z,XU X,DENG S.Discovering cluster-basedlocal outliers [J].Pattern Recognition Letters,2003,24(9/10):1641-1650.
[34]ZENGYOU H E,XIAOFEI X U,DENG S.Squeezer:An Efficient Algorithm for Clustering Categorical Data [J].Journal of Computer Science & Technology,2002,17(5):611-624.
[35]LIAN D,XU L,LIU Y,et al.Cluster-based outlier detection[J].Microelectronics & Computer,2008,168(1):151-168.
[36]DUAN L,XU L,GUO F,et al.A local-density based spatial clustering algorithm with noise[J].Information Systems,2007,32(7):978-986.
[37]HUANG J,ZHU Q,YANG L,et al.A non-parameter outlier detection algorithm based on Natural Neighbor [J].Knowledge-Based Systems,2016,92(15):71-77.
[1] XING Kaiyan, CHEN Wen. Multi-generator Active Learning Algorithm Based on Reverse Label Propagation and ItsApplication in Outlier Detection [J]. Computer Science, 2024, 51(4): 359-365.
[2] LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao. Locality and Consistency Based Sequential Ensemble Method for Outlier Detection [J]. Computer Science, 2022, 49(1): 146-152.
[3] LIU Li-cheng, XU Yi-fan, XIE Gui-cai, DUAN Lei. Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database [J]. Computer Science, 2021, 48(2): 93-99.
[4] ZHONG Ying-yu, CHEN Song-can. High-order Multi-view Outlier Detection [J]. Computer Science, 2020, 47(9): 99-104.
[5] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
[6] LI Chang-jing,ZHAO Shu-liang,CHI Yun-xian. Outlier Detection Algorithm Based on Spectral Embedding and Local Density [J]. Computer Science, 2019, 46(3): 260-266.
[7] FENG Gui-lan, ZHOU Wen-gang. Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor [J]. Computer Science, 2018, 45(11A): 349-352.
[8] YING Yi, REN Kai, LIU Ya-jun. Network Log Analysis Technology Based on Big Data [J]. Computer Science, 2018, 45(11A): 353-355.
[9] XU Dong, WANG Yan-jun, MENG Yu-long, ZHANG Zi-ying. Improved Data Anomaly Detection Method Based on Isolation Forest [J]. Computer Science, 2018, 45(10): 155-159.
[10] GOU Jie, MA Zi-tang and ZHANG Zhe-cheng. PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset [J]. Computer Science, 2016, 43(7): 251-254.
[11] GU Ling-lan and PENG Li-min. Clustering Algorithm Based on Relative Density and k-nearest Neighbors over Manifolds [J]. Computer Science, 2016, 43(12): 213-217.
[12] PAN Dong-ming and HUANG De-cai. Relative Density-based Clustering Algorithm over Uncertain Data [J]. Computer Science, 2015, 42(Z11): 72-74.
[13] HONG Sha, LIN Jia-li and ZHANG Yue-liang. Density-based Outlier Detection on Uncertain Data [J]. Computer Science, 2015, 42(5): 230-233.
[14] JIANG Yuan-kai, ZHENG Hong-yuan and DING Qiu-lin. On Density Based Outlier Detection for Uncertain Data [J]. Computer Science, 2015, 42(4): 172-176.
[15] ZHANG Xian-ji and WANG Lun-wen. Outlier Detection Method Based on Constructive Neural Networks [J]. Computer Science, 2014, 41(7): 297-300.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!