计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 151-158.doi: 10.11896/jsjkx.220600130

• 数据库&大数据&数据科学 • 上一篇    下一篇

极限距离噪声估计与过滤方法

姜高霞1, 秦佩1, 王文剑1,2   

  1. 1 山西大学计算机与信息技术学院 太原 030006
    2 计算智能与中文信息处理教育部重点实验室(山西大学) 太原 030006
  • 收稿日期:2022-06-14 修回日期:2022-11-23 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 王文剑(wjwang@sxu.edu.cn)
  • 作者简介:(jianggaoxia@sxu.edu.cn)
  • 基金资助:
    国家自然科学基金(U21A20513,62276161,62076154,61906113,U1805263);山西省国际合作重点研发计划(201903D421050)

Noise Estimation and Filtering Methods with Limit Distance

JIANG Gaoxia1, QIN Pei1, WANG Wenjian1,2   

  1. 1 School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
    2 Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education (Shanxi University),Taiyuan 030006,China
  • Received:2022-06-14 Revised:2022-11-23 Online:2023-06-15 Published:2023-06-06
  • About author:JIANG Gaoxia,born in 1987,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include machine lear-ning and data mining.WANG Wenjian,born in 1968,Ph.D,professor,is a outstanding member of China Computer Federation.Her main research interests include machine learning and computing intelligence.
  • Supported by:
    National Natural Science Foundation of China(U21A20513,62276161,62076154,61906113,U1805263) and Key R & D Program of Shanxi Province International Cooperation(201903D421050).

摘要: 近年来,机器学习不断取得显著性进展并被成功应用于诸多领域,然而很多学习模型或算法高度依赖数据的标签质量。实际应用中大量数据集普遍存在复杂的标签噪声,因此机器学习在低质数据建模和标签噪声处理方面面临严峻挑战。文中针对回归中的数值型标签噪声,从理论分析和仿真实验的角度研究了标签估计区间与噪声的关联性,提出了一种极限距离噪声估计方法。在最优样本选择框架下,基于此噪声估计方法提出了一种极限距离噪声过滤(Limit Distance Noise Filtering,LDNF)算法。实验结果表明,所提噪声估计方法与真实标签噪声具有更高的相关性和更低的估计偏差。在标准数据集和真实年龄估计数据集上证实了所提过滤算法可以在不同噪声环境下有效识别标签噪声并减小模型的测试误差,其表现优于最新的其他过滤算法。

关键词: 数值型标签噪声, 回归, 噪声估计, 极限距离噪声过滤

Abstract: Machine learning has made remarkable progress and has been successfully applied to many fields in recent years.However,many learning models or algorithms are highly dependent on data quality.Complex label noise usually exists in a large number of datasets in practical applications,so machine learning faces severe challenges in low-quality data modeling and label noise processing.To solve the numerical label noise problem in regression,this paper studies the correlation between label estimation interval and the noise from the perspectives of theoretical analysis and simulation experiments,and proposes a limit distance noise estimation method.Under the optimal sample selection framework,a limit distance noise filtering(LDNF) algorithm is proposed based on this noise estimator.Experimental results show that the proposed noise estimation method has a higher correlation and a lower estimation bias with the true label noise.The proposed LDNF algorithm can effectively identify label noises and reduce the test error of the model in different noise environments on benchmark datasets and real-age estimation datasets,and it outperforms other latest filtering algorithms.

Key words: Numerical label noise, Regression, Noise estimation, Limit distance noise filtering

中图分类号: 

  • TP181
[1]ESTEVA,KUPREL B,NOVOA R A,et al.Dermatologist level classification of skin cancer with deep neural networks[J].Nature,2017,542(7639):115-118.
[2]MA W J,DONG H B.Face age classification method based on ensemble learning of convolutional neural networks[J].Compu-ter Science,2018,45(1):152-156.
[3]KERMANY D S,GOLDBAUM M,CAI W,et al.Identifyingmedical diagnoses and treatable diseases by image based deep learning[J].Cell,2018,172(5):1122-1131.
[4]NORTHCUTTC,JIANG L,CHUANG I.Confident learning:Estimating uncertainty in dataset labels[J].Journal of Artificial Intelligence Research,2021,70:1373-1411.
[5]KAHNEMAN D,SIBONY O,SUNSTEIN C R.Noise:A flaw in human judgment [M].New York:Little,Brown Spark,2021.
[6]GUAN D,YUAN W,LEE Y K,et al.Identifying mislabeled training data with the aid of unlabeled data[J].Applied Intelligence,2011,35(3):345-358.
[7]MALOSSINI A,BLANZIERI E,NG R T.Detecting potential labeling errors in microarrays by data perturbation[J].Bioinformatics,2006,22(17):2114-2121.
[8]ZHU X,WU X.Class noise vs attribute noise:a quantitative study[J].Artificial Intelligence Review,2004,22(3):177-210.
[9]LIU G F,ZHAO W Q.Attractors and Their Upper Semi-continuity of Stochastic Lorenz System Driven by Additive Noises[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2022,39(1):78-84.
[10]SAEZ J A,GALAR M,LUENGO J,et al.Analyzing the pre-sence of noise in multi-class problems:alleviating its influence with the One-vs-One decomposition[J].Knowledge and Information Systems,2014,38(1):179-206.
[11]FRENAY B,VERLEYSEN M.Classification in thepresence of label noise:a survey[J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869.
[12]PATRINI G,ROZZA A,MENON A K,et al.Making deep neural networks robust to label noise:a loss correction approach [C]//IEEE Conference on Computer Vision and Pattern Recognition.Piscataway,NJ:IEEE,2017:1944-1952.
[13]SABZEVARI M,MARTINEZ-MUNOZ G,SUAREZ A.Vote-boosting ensembles[J].Pattern Recognition,2018,83:119-133.
[14]SHU J,XIE Q,YI L X,et al.Meta- Weight-Net:learning an explicit mapping for sample weighting [C]//Advances in Neural Information Processing Systems.Cambridge,MA:MIT Press,2019:1917-1928.
[15]YAO J,WANG J,TSANG I W,et al.Deep learning from noisy image labels with quality embedding[J].IEEE Transactions on Image Processing,2018,28(4):1909-1922.
[16]HAN B,YAO Q,YU X,et al.Co-teaching:robust training of deep neural networks with extremely noisy labels [C]//Advances in Neural Information Processing Systems.Cambridge,MA:MIT Press,2018:8536-8546.
[17]CHEN Q Q,WANG W J,JIANG G X.Label noise filteringmethod based on data distribution[J].Journal of Tsinghua University(Science and Technology),2019,59(4):262-269.
[18]ZHANG Z H,JIANG G X,WANG W J.Label noise filtering method based on local probability sampling[J].Computer Application,2021,41(1):67-73.
[19]YU M C,MU J P,CAI J,et al.Noisy label classification learning based on relabeling method[J].Computer Science,2020,47(6):79-84.
[20]SEGATA N,BLANZIERI E,DELANY S J,et al.Noise reduction for instance based learning with alocal maximalmargin approach[J].Journal of Intelligent Information Systems,2010,35(2):301-331.
[21]HART P.The condensed nearest neighbor rule[J].IEEETransactions on Information Theory,1968,14(3):515-516.
[22]WILSON D L.Asymptotic properties of nearest neighbor rules using edited data[J].IEEE Transactions on Systems Man and Cybernetics,2007,2(3):408-421.
[23]CAO J,KWONG S,WANG R.A noise detection based adaboost algorithm for mislabeled data[J].Pattern Recognition,2012,45(12):4451-4465.
[24]KORDOS M,BIALKA S,BLACHNIK M.Instance selection in logical rule extraction for regression problems [C]//International Conference on Artificial Intelligence and Soft Computing,Berlin:Springer,2013:167-175.
[25]ARNAIZ-GONZALEZ A,DIEZ-PASTOR J F,RODRIGUEZ J J,et al.Instance selection for regression by discretization[J].Expert Systems with Applications,2016,54:340-350.
[26]GUILLEN A,HERRERA L J,RUBIO G,et al.New method for instance or prototype selection using mutual information in time series prediction[J].Neurocomputing,2010,73(10/11/12):2030-2038.
[27]BOZIC M,STOJANOVIC M,STAJICT Z,et al.Mutual information-based inputs selection for electric load time series forecasting[J].Entropy,2013,15(3):926-942.
[28]STOJANOVIC M M,BOZIC M M,STANKOVIC M M,et al.A methodology for training set instance selection using mutual information in time series prediction[J].Neurocomputing,2014,141:236-245.
[29]JIANG G X,WANG W J,QIAN Y H,et al.A unified sample selection framework for output noise filtering:an error bound perspective[J].Journal of Machine Learning Research,2021,22(18):1-66.
[30]JIANG G X,WANG W J.A numerical label noise filtering algorithm for regression[J].Journal of Computer Research and Development,2022,59(8):1639-1652.
[31]DUA D,GRAFF C.UCI machine learning repository [DB/OL].[2020-03-28].http://archive.ics.uci.edu/ml.
[32]HUO Z W,YANG X,XING C,et al.Deep age distributionlearning for apparent age estimation[C]//IEEE Conference on Computer Vision and Pattern Recognition Workshops.Pisca-taway,NJ:IEEE,2016:722-729.
[33]ROTHE R,TIMOFTE R,VAN GOOL L.Deep expectation of real and apparent age from a single image without facial landmarks[J].International Journal of Computer Vision,2018,126(2):144-157.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!