计算机科学 ›› 2020, Vol. 47 ›› Issue (6): 79-84.doi: 10.11896/jsjkx.190600041

• 数据库&大数据&数据科学 • 上一篇    下一篇

噪声标签重标注方法

余孟池, 牟甲鹏, 蔡剑, 徐建   

  1. 南京理工大学计算机科学与工程学院 南京210094
  • 收稿日期:2019-06-11 出版日期:2020-06-15 发布日期:2020-06-10
  • 通讯作者: 徐建(dolphin.xu@njust.edu.cn)
  • 作者简介:2246782556@qq.com
  • 基金资助:
    国家自然科学基金项目(61872186,61802205,91846104)

Noisy Label Classification Learning Based on Relabeling Method

YU Meng-chi, MU Jia-peng, CAI Jian, XU Jian   

  1. School of Computer Science and Engineering,Nanjing University of Science and Technology,Nanjing 210094,China
  • Received:2019-06-11 Online:2020-06-15 Published:2020-06-10
  • About author:YU Meng-chi,born in 1995,master.His major research interests include ticket mining and its applications.
    XU Jian,Ph.D,professor.His main research interests include event mining,log mining and their applications to system management.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China (61872186,61802205,91846104)

摘要: 样本标签的完整性对于有监督学习问题的分类精度有着显著影响,然而在现实数据中,由于标注过程的随机性和标注人员的不专业性等因素,数据标签不可避免地会受到噪声污染,即样本的观测标签不同于真实标签。为降低噪声标签对分类器分类精度的负面影响,文中提出一种噪声标签纠正方法,该方法利用基分类器对观测样本进行分类并估计噪声率,以识别噪声标签数据,再利用基分类器的分类结果对噪声标签样本进行重新标注,得到噪声标签样本被修正后的样本数据集。在合成数据集与真实数据集上的实验结果表明,该重标注算法在不同基分类器和不同噪声率干扰下对分类结果都有一定的提升作用,在合成数据集上对比无降噪声算法,其正确率提升5%左右,而在CIFAR和MNIST数据集上的高噪声率环境下,该重标注算法的F1值比Elk08和Nat13平均高7%以上,比无噪声算法高53%。

关键词: 逻辑回归, 朴素贝叶斯, 噪声标签学习, 重标注标签

Abstract: The integrity of sample labels has a significant impact on the accuracy of supervised learning algorithms.However,in real data,due to the unprofessional and random nature of the labeling process,the label of the dataset is inevitably polluted by noise,i.e.the assigned label of sample is different from its real label.In order to reduce the negative impact of noise labels on the classification accuracy of classifiers,this paper proposes a noise label correction approach.It firstly identifies the noise label data by applying the base classifier to classify the samples and estimating the noise rate to identify noisy label data,and then uses the base classifier to relabel the noisy samples.As a result,the noisy samples are relabeled to obtain a sample dataset in which the noisy samples are corrected.Experiments on synthetic datasets and real datasets show that the relabel algorithm has a certain improvement effect on classification results under different base classifiers and different types of noise rate interference.Compared with the base classifier,the accuracy of relabel algorithm is improved by about 5% in the synthetic dataset,while in the high noise environment of CIFAR and MNIST datasets,the F1 score of the proposed algorithm is 7% higher than that of Elk08 and Nat13 on average,and is improved by 53% compared with base classifier.

Key words: Logistic Regression, Naive Bayes, Noisy label learning, Relabeling label

中图分类号: 

  • TP301
[1]MIRYLENKA K,GIANNAKOPOULOS G,DO L M,et al.On classif-ier behavior in the presence of mislabeling noise [J].Data Mining and Knowledge Discovery,2017,31(3):661-701.
[2]KHETA A,LIPTON Z C,ANANDKUMAR A.Learning From Noisy Singly-labeled Data [OL].https://arxiv.org/abs/1712.04577.
[3]FRENAY B,VERLEYSEN M.Classification in the Presence of Label Noise:A Survey [J].IEEE Transactions on Neural Networks and Learning Systems,2014,25(5):845-869.
[4]NICHOLSON B,SHENG V S,ZHANG J,et al.Label Noise Correction Methods [C]//IEEE International Conference on Data Science and Advanced Analytics.Shanghai:IEEE,2015:1-9.
[5]QI Z A.Learning from Limited and Imperfect Tagging[D]. Hangzhou:Zhejiang University,2013.
[6]LIU M J,WANG X F.Data Preprocessing in Data Mining[J].Computer Science,2000,27(4):54-57.
[7]LI J,WONG Y,ZHAO Q,et al.Learning to Learn from Noisy Labeled Data[OL].https://arxiv.org/abs/1812.05214.
[8]MANWANI N,SASTRY P S.Noise tolerance under risk minimization[J].IEEE Transactions on Cybernetics,2013,43(3):1146-1151.
[9]LI Y,YANG J,SONG Y,et al.Learning from Noisy Labels with Distillation[J].IEEE International Conference on Computer Vision,2017,10(1):1928-1936.
[10]NETTLETON D F,PUIG A O,FORNELLS A.A study of the effect of different types of noise on the precision of supervised learning techniques[J].Artificial Intelligence Review,2010,33(4):275-306.
[11]WANG Y,LIU W,MA X,et al.Iterative Learning with Openset Noisy Labels[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.2018:8688-8696.
[12]THULASIDASAN S,BHATTACHARYA T,BILMES J,et al.Combating Label Noise in Deep Learning Using Abstention[OL].https://arxiv.org/abs/1905.10964.
[13]XIAO T,XIA T,YANG Y,et al.Learning from massive noisy labeled data for image classification[C]//IEEE Conference on Computer Vision and Pattern Recognition.Boston:IEEE,2015:2691-2699.
[14]MNIH V,HINTON G.Learning to Label Aerial Images from Noisy Data[C]//International Conference on Machine Lear-ning.Edinburgh,Scotland:Omnipress,2012:203-210.
[15]SCOTT C.A Rate of Convergence for Mixture Proportion Estimation,with,Application to Learning from Noisy Labels[C]//Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics.2015:838-846.
[16]LIU T,TAO D.Classification with Noisy Labels by Importance Reweighting[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,38(3):447-461.
[17]NATARAJAN N,DHILLON I S,RAVIKUMAR P K,et al.Learning with Noisy Labels[C]//International Conference on Neural Information Processing Systems.Lake Tahoe.USA:Curran Associates Inc,2013:1196-1204.
[18]NORTHCUTT C G,WU T,CHUANG I L.Learning with Confident Examples:Rank Pruning for Robust Classification with Noisy Labels[OL].https://arxiv.org/abs/1705.01936.
[19]ELKAN C,NOTO K.Learning classifiers from only positiveand unlabeled data[C]//International Conference on Knowledge Discovery and Data Mining.Las Vegas:ACM,2008:213-220.
[1] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[2] 韩丽霞, 张占营.
基于树增益朴素贝叶斯网络的服务定价策略
TAN-based Service Pricing Strategy
计算机科学, 2021, 48(6A): 203-. https://doi.org/10.11896/jsjkx.200900024
[3] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇.
基于整车EMC标准测试和机器学习的反向诊断方法
Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning
计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[4] 边玉宁, 陆利坤, 李业丽, 曾庆涛, 孙彦雄.
基于逻辑回归的金融风投评分卡模型实现
Implementation of Financial Venture Capital Score Card Model Based on Logistic Regression
计算机科学, 2020, 47(11A): 116-118. https://doi.org/10.11896/jsjkx.200400017
[5] 刘梦娟,曾贵川,岳威,仇笠舟,王加昌.
面向展示广告的点击率预测模型综述
Review on Click-through Rate Prediction Models for Display Advertising
计算机科学, 2019, 46(7): 38-49. https://doi.org/10.11896/j.issn.1002-137X.2019.07.006
[6] 钟熙, 孙祥娥.
基于Kmeans++聚类的朴素贝叶斯集成方法研究
Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering
计算机科学, 2019, 46(6A): 439-441.
[7] 张士翔, 李汪根, 李童, 朱楠楠.
一种改进的贝叶斯逻辑回归核心集构建算法
Improved CoreSets Construction Algorithm for Bayesian Logistic Regression
计算机科学, 2019, 46(11A): 98-102.
[8] 许召召, 李京华, 陈同林, 李昕洁.
融合SMOTE与Filter-Wrapper的朴素贝叶斯决策树算法及其应用
Naive Bayesian Decision TreeAlgorithm Combining SMOTE and Filter-Wrapper and It’s Application
计算机科学, 2018, 45(9): 65-69. https://doi.org/10.11896/j.issn.1002-137X.2018.09.009
[9] 张正卿,朱奕健,白瑞瑞,黄一清,严建峰.
服务号码捆绑特征在离网预测系统中的应用
Application of Service Bundling in Churn Predict System
计算机科学, 2016, 43(Z11): 585-590. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.133
[10] 杨旭华,钟楠祎.
基于深度信念网络的医院门诊量预测
Forecasting of Hospital Outpatient Based on Deep Belief Network
计算机科学, 2016, 43(Z11): 26-30. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.006
[11] 陈旋,刘健,冯新淇,赵雪美.
基于朴素贝叶斯的差分隐私合成数据集发布算法
Differential Private Synthesis Dataset Releasing Algorithm Based on Navie Bayes
计算机科学, 2015, 42(1): 236-238. https://doi.org/10.11896/j.issn.1002-137X.2015.01.052
[12] 翟军昌,秦玉平,车伟伟.
垃圾邮件过滤中信息增益的改进研究
Improvement of Information Gain in Spam Filtering
计算机科学, 2014, 41(6): 214-216. https://doi.org/10.11896/j.issn.1002-137X.2014.06.042
[13] 王辉,陈泓予,刘淑芬.
基于改进朴素贝叶斯算法的入侵检测系统
Intrusion Detection System Based on Improved Naive Bayesian Algorithm
计算机科学, 2014, 41(4): 111-115.
[14] 罗强,王国胤,储卫东.
复杂光照下的缩微道路车道线检测方法
Lane Detection in Micro-traffic under Complex Illumination
计算机科学, 2014, 41(3): 46-49.
[15] 徐光美,刘宏哲,张敬尊.
基于特征加权的多关系朴素贝叶斯分类模型
Multi-relational Nave Bayesian Classifier Using Feature Weighting
计算机科学, 2014, 41(10): 283-285. https://doi.org/10.11896/j.issn.1002-137X.2014.10.059
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!