计算机科学 ›› 2020, Vol. 47 ›› Issue (6): 79-84.doi: 10.11896/jsjkx.190600041
余孟池, 牟甲鹏, 蔡剑, 徐建
YU Meng-chi, MU Jia-peng, CAI Jian, XU Jian
摘要: 样本标签的完整性对于有监督学习问题的分类精度有着显著影响,然而在现实数据中,由于标注过程的随机性和标注人员的不专业性等因素,数据标签不可避免地会受到噪声污染,即样本的观测标签不同于真实标签。为降低噪声标签对分类器分类精度的负面影响,文中提出一种噪声标签纠正方法,该方法利用基分类器对观测样本进行分类并估计噪声率,以识别噪声标签数据,再利用基分类器的分类结果对噪声标签样本进行重新标注,得到噪声标签样本被修正后的样本数据集。在合成数据集与真实数据集上的实验结果表明,该重标注算法在不同基分类器和不同噪声率干扰下对分类结果都有一定的提升作用,在合成数据集上对比无降噪声算法,其正确率提升5%左右,而在CIFAR和MNIST数据集上的高噪声率环境下,该重标注算法的F1值比Elk08和Nat13平均高7%以上,比无噪声算法高53%。
中图分类号:
[1]MIRYLENKA K,GIANNAKOPOULOS G,DO L M,et al.On classif-ier behavior in the presence of mislabeling noise [J].Data Mining and Knowledge Discovery,2017,31(3):661-701. [2]KHETA A,LIPTON Z C,ANANDKUMAR A.Learning From Noisy Singly-labeled Data [OL].https://arxiv.org/abs/1712.04577. [3]FRENAY B,VERLEYSEN M.Classification in the Presence of Label Noise:A Survey [J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869. [4]NICHOLSON B,SHENG V S,ZHANG J,et al.Label Noise Correction Methods [C]//IEEE International Conference on Data Science and Advanced Analytics.Shanghai:IEEE,2015:1-9. [5]QI Z A.Learning from Limited and Imperfect Tagging[D]. Hangzhou:Zhejiang University,2013. [6]LIU M J,WANG X F.Data Preprocessing in Data Mining[J].Computer Science,2000,27(4):54-57. [7]LI J,WONG Y,ZHAO Q,et al.Learning to Learn from Noisy Labeled Data[OL].https://arxiv.org/abs/1812.05214. [8]MANWANI N,SASTRY P S.Noise tolerance under risk minimization[J].IEEE Transactions on Cybernetics,2013,43(3):1146-1151. [9]LI Y,YANG J,SONG Y,et al.Learning from Noisy Labels with Distillation[J].IEEE International Conference on Computer Vision,2017,10(1):1928-1936. [10]NETTLETON D F,PUIG A O,FORNELLS A.A study of the effect of different types of noise on the precision of supervised learning techniques[J].Artificial Intelligence Review,2010,33(4):275-306. [11]WANG Y,LIU W,MA X,et al.Iterative Learning with Openset Noisy Labels[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018:8688-8696. [12]THULASIDASAN S,BHATTACHARYA T,BILMES J,et al.Combating Label Noise in Deep Learning Using Abstention[OL].https://arxiv.org/abs/1905.10964. [13]XIAO T,XIA T,YANG Y,et al.Learning from massive noisy labeled data for image classification[C]//IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:2691-2699. [14]MNIH V,HINTON G.Learning to Label Aerial Images from Noisy Data[C]//International Conference on Machine Lear-ning.Edinburgh,Scotland:Omnipress,2012:203-210. [15]SCOTT C.A Rate of Convergence for Mixture Proportion Estimation,with,Application to Learning from Noisy Labels[C]//Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.2015:838-846. [16]LIU T,TAO D.Classification with Noisy Labels by Importance Reweighting[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,38(3):447-461. [17]NATARAJAN N,DHILLON I S,RAVIKUMAR P K,et al.Learning with Noisy Labels[C]//International Conference on Neural Information Processing Systems.Lake Tahoe.USA:Curran Associates Inc,2013:1196-1204. [18]NORTHCUTT C G,WU T,CHUANG I L.Learning with Confident Examples:Rank Pruning for Robust Classification with Noisy Labels[OL].https://arxiv.org/abs/1705.01936. [19]ELKAN C,NOTO K.Learning classifiers from only positiveand unlabeled data[C]//International Conference on Knowledge Discovery and Data Mining.Las Vegas:ACM,2008:213-220. |
[1] | 胡艳梅, 杨波, 多滨. 基于网络结构的正则化逻辑回归 Logistic Regression with Regularization Based on Network Structure 计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106 |
[2] | 韩丽霞, 张占营. 基于树增益朴素贝叶斯网络的服务定价策略 TAN-based Service Pricing Strategy 计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024 |
[3] | 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204 |
[4] | 边玉宁, 陆利坤, 李业丽, 曾庆涛, 孙彦雄. 基于逻辑回归的金融风投评分卡模型实现 Implementation of Financial Venture Capital Score Card Model Based on Logistic Regression 计算机科学, 2020, 47(11A): 116-118. https://doi.org/10.11896/jsjkx.200400017 |
[5] | 刘梦娟,曾贵川,岳威,仇笠舟,王加昌. 面向展示广告的点击率预测模型综述 Review on Click-through Rate Prediction Models for Display Advertising 计算机科学, 2019, 46(7): 38-49. https://doi.org/10.11896/j.issn.1002-137X.2019.07.006 |
[6] | 钟熙, 孙祥娥. 基于Kmeans++聚类的朴素贝叶斯集成方法研究 Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering 计算机科学, 2019, 46(6A): 439-441. |
[7] | 张士翔, 李汪根, 李童, 朱楠楠. 一种改进的贝叶斯逻辑回归核心集构建算法 Improved CoreSets Construction Algorithm for Bayesian Logistic Regression 计算机科学, 2019, 46(11A): 98-102. |
[8] | 许召召, 李京华, 陈同林, 李昕洁. 融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用 Naive Bayesian Decision TreeAlgorithm Combining SMOTE and Filter-Wrapper and It’s Application 计算机科学, 2018, 45(9): 65-69. https://doi.org/10.11896/j.issn.1002-137X.2018.09.009 |
[9] | 张正卿,朱奕健,白瑞瑞,黄一清,严建峰. 服务号码捆绑特征在离网预测系统中的应用 Application of Service Bundling in Churn Predict System 计算机科学, 2016, 43(Z11): 585-590. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.133 |
[10] | 杨旭华,钟楠祎. 基于深度信念网络的医院门诊量预测 Forecasting of Hospital Outpatient Based on Deep Belief Network 计算机科学, 2016, 43(Z11): 26-30. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.006 |
[11] | 陈旋,刘健,冯新淇,赵雪美. 基于朴素贝叶斯的差分隐私合成数据集发布算法 Differential Private Synthesis Dataset Releasing Algorithm Based on Navie Bayes 计算机科学, 2015, 42(1): 236-238. https://doi.org/10.11896/j.issn.1002-137X.2015.01.052 |
[12] | 翟军昌,秦玉平,车伟伟. 垃圾邮件过滤中信息增益的改进研究 Improvement of Information Gain in Spam Filtering 计算机科学, 2014, 41(6): 214-216. https://doi.org/10.11896/j.issn.1002-137X.2014.06.042 |
[13] | 王辉,陈泓予,刘淑芬. 基于改进朴素贝叶斯算法的入侵检测系统 Intrusion Detection System Based on Improved Naive Bayesian Algorithm 计算机科学, 2014, 41(4): 111-115. |
[14] | 罗强,王国胤,储卫东. 复杂光照下的缩微道路车道线检测方法 Lane Detection in Micro-traffic under Complex Illumination 计算机科学, 2014, 41(3): 46-49. |
[15] | 徐光美,刘宏哲,张敬尊. 基于特征加权的多关系朴素贝叶斯分类模型 Multi-relational Nave Bayesian Classifier Using Feature Weighting 计算机科学, 2014, 41(10): 283-285. https://doi.org/10.11896/j.issn.1002-137X.2014.10.059 |
|