计算机科学 ›› 2022, Vol. 49 ›› Issue (6): 127-133.doi: 10.11896/jsjkx.211100043

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于DECORATE集成学习与置信度评估的Tri-training算法

王宇飞, 陈文   

  1. 四川大学网络空间安全学院 成都 610065
  • 收稿日期:2021-11-03 修回日期:2022-03-02 出版日期:2022-06-15 发布日期:2022-06-08
  • 通讯作者: 陈文(wenchen@scu.edu.cn)
  • 作者简介:(wangyufei079@foxmail.com)
  • 基金资助:
    国家重点研发计划(020YFB1805405,2019QY0800);国家自然科学基金(U1736212,61872255,U19A2068);模式识别与智能信息处理四川省高校重点实验室(MSSB-2020-01)

Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment

WANG Yu-fei, CHEN Wen   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu 610065,China
  • Received:2021-11-03 Revised:2022-03-02 Online:2022-06-15 Published:2022-06-08
  • About author:WANG Yu-fei,born in 1996,postgra-duate.His main research interests include semi-supervised learning,cyber security and data mining.
    CHEN Wen,born in 1983,Ph.D,asso-ciate professor,master supervisor,is a member of China Computer Federation.His main research interests include network security,information hiding and data mining.
  • Supported by:
    National Key Research and Development Program of China(020YFB1805405,2019QY0800),National Natural Science Foundation of China(U1736212,61872255,U19A2068) and Key Laboratory of Pattern Recognition and Intelligent Information Proces-sing,Institutions of Higher Education of Sichuan Province(MSSB-2020-01).

摘要: Tri-training是一种基于分歧的半监督学习算法,同时利用了半监督学习和集成学习机制。Tri-training能有效地利用少量有标记样本和大量无标记样本,通过分类器间的相互协同和迭代来提升模型性能。但是在已标记样本量不足的情况下,Tri-training生成的初始分类器训练不足,并且在分类器间协同标记的过程中可能产生误标记的噪声数据。针对上述问题,提出了一种结合DECORATE集成学习、多样性度量与置信度评估的协同学习算法。该算法基于DECORATE集成学习方法,通过添加差异化的人工样本和标记来训练多种偏好的基分类器,以提升分类泛化能力。该算法还基于JS散度对分类器进行多样性度量和筛选,以最大化基分类器多样性,同时在迭代过程中基于标签传播算法对伪标记样本进行置信度评估,以减少噪声数据。在UCI数据集上进行了分类实验,结果表明,相比Tri-training算法及其改进算法,所提算法具有更高的分类准确率和F1分数。

关键词: 多样性度量, 基于分歧的半监督学习, 集成学习, 置信度评估

Abstract: Tri-training is a disagreement-based semi-supervised learning algorithm,in which both semi-supervised learning and ensemble learning mechanisms are simultaneously applied.It can improve the model performance by effectively leveraging some labeled samples along with a large amount of unlabeled ones through collaborations and iterations among basic classifiers.How-ever,when the labeled sample size is insufficient,the initial classifiers generated by Tri-training are not sufficiently trained.Furthermore,mislabeled noisy data might be generated during the collaborative labeling process among the classifiers.Aiming at these problems,a collaborative learning algorithm is proposed,which combines DECORATE ensemble learning,diversity mea-sure and credibility assessment.In our method,to improve the generalization performance,multiple preference classifiers are generated based on DECORATE with differentiated artificial data and labels,and the diversities of classifiers are measured and selected by Jensen-Shannon divergence to maxmize the diversity of the classifiers.At the same time,the credibility of the pseudo labeled samples is assessed during the iterations by a label propagation algorithm to reduce the noisy data.The results of classification experiment on UCI data sets demonstrate that the proposed algorithm achieves higher accuracy and F1-score than Tri-trai-ning algorithm and its improved versions.

Key words: Credibility assessment, Disagreement-based semi-supervised learning, Diversity measure, Ensemble learning

中图分类号: 

  • TP181
[1] GONG S,ZHAO C.Intrusion detection system based on classification[C]//IEEE International Conference on Intelligent Control.IEEE,2012:78-83.
[2] MAZEL J,CASAS P,LABIT Y,et al.Sub-Space clustering,Inter-Clustering Results Association & anomaly correlation for unsupervised network anomaly detection[C]//7th International Conference on Network and Service Management(CNSM 2011).IEEE,Paris,France,2011:1-8.
[3] ZHOU Z H,LI M.Semi-supervised learning by disagreement[J].Knowledge & Information Systems,2010,24(3):415-439.
[4] ZHU X J,GHAHRAMANI Z,LAFFERTY J D.Semi-Super-vised Learning Using Gaussian Fields and Harmonic Functions[C]//Machine Learning,Proceedings of the Twentieth International Conference(ICML 2003).Washington,DC,USA.2003:912-919.
[5] BLUM A,MITCHELL T.Combining Labeled and UnlabeledData with Co-Training[C]//Proceedings of the 11th Annual Conference on Computational Learning Theory.Madison:ACM,1998:92-100.
[6] CHEN S J,LIU J F,HUANG Q C,et al.Conditional Value-based Co-training[J].Acta Automatica Sinica,2013,39(10):1665-1673.
[7] KATZ G,CARAGEA C,SHABTAI A.Vertical Ensemble Co-Training for Text Classification[J].ACM Transactions on Intelligent Systems and Technology,2017,9(2):1-23.
[8] LU J,GONG Y.A co-training method based on entropy and multi-criteria[J].Applied Intelligence,2021,51(6):1-14.
[9] ZHOU Z H,LI M.Tri-training:exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(11):1529-1541.
[10] XU G,ZHAO J,HUANG D.An improved social spammer detection based on tri-training[C]//2016 IEEE International Conference on Big Data(Big Data).IEEE,2016:4040-4042.
[11] LI J,WEI Z,LI K.A Novel Semi-supervised SVM Based on Tri-training for Intrusition Detection[J].Journal of Computers,2010,5(4):638-645.
[12] SØGAARD A.Simple semi-supervised training of part-of-speech taggers[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2010:205-208.
[13] RUDER S,PLANK B.Strong Baselines for Neural Semi-supervised Learning under Domain Shift[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2018:1044-1054.
[14] ZHANG Y,CHEN R R,ZHANG J.Safe Tri-training Algorithm Based on Cross Entropy[J].Journal of Computer Research and Development,2021,58(1):60-69.
[15] MELVILLE P,MOONEY R J.Creating diversity in ensembles using artificial data[J].Information Fusion,2005,6(1):99-111.
[16] ZHU X J,GHAHRAMANI Z.Learning from labels and unlabeled data with label propagation[J].Tech Report,2002,3175(2004):237-244.
[17] ZHOU Z H.Disagreement-based Semi-supervised Learning[J].Acta Automatica Sinica,2013,39(11):1871-1878.
[18] ANGLUIN D,LAIRD P.Learning From Noisy Examples[J].Machine Learning,1988,2(4):343-370.
[19] ZHANG C X,WANG G W,ZHANG J S.An empirical bias-variance analysis of DECORATE ensemble method at different training sample sizes[J].Journal of Applied Statistics,2012,39(3/4):829-850.
[20] SUN B,WANG J D,CHEN H Y,et al.Diversity measures in ensemble learning[J].Control and Decision,2014(3):385-395.
[21] WANG W,ZHOU Z H.Analyzing Co-training Style Algorithms[C]//European Conference on Machine Learning.Springer-Verlag,2007:454-465.
[22] KUNCHEVA L I,WHITAKER C J.Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy[J].Machine Learning,2003,51(2):181-207.
[23] CHU R,WANG M,ZENG X Q,et al.A New Diverse Measure in Ensemble Learning Using Unlabeled Data[C]//2012 Fourth International Conference on Computational Intelligence,Communication Systems and Networks(CICSyN).IEEE,2012:18-21.
[24] ZHANG M L,ZHOU Z H.Exploiting unlabeled data to enhance ensemble diversity[J].Data Mining and Knowledge Discovery,2013,26(1):98-129.
[25] DUA D,GRAFF C.UCI Machine Learning Repository[DB/OL].[2019-12-10].https://archive.ics.uci.edu/ml/.
[1] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[2] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[3] 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳.
基于共同子空间分类学习的跨媒体检索研究
Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning
计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[4] 任首朋, 李劲, 王静茹, 岳昆.
基于集成回归决策树的lncRNA-疾病关联预测方法
Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction
计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[5] 陈伟, 李杭, 李维华.
核小体定位预测的集成学习方法
Ensemble Learning Method for Nucleosome Localization Prediction
计算机科学, 2022, 49(2): 285-291. https://doi.org/10.11896/jsjkx.201100195
[6] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[7] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[8] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[9] 戴宗明, 胡凯, 谢捷, 郭亚.
基于直觉模糊集的集成学习算法
Ensemble Learning Algorithm Based on Intuitionistic Fuzzy Sets
计算机科学, 2021, 48(6A): 270-274. https://doi.org/10.11896/jsjkx.200700036
[10] 郇文明, 林海涛.
基于采样集成算法的入侵检测系统设计
Design of Intrusion Detection System Based on Sampling Ensemble Algorithm
计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101
[11] 刘振鹏, 苏楠, 秦益文, 卢家欢, 李小菲.
FS-CRF:基于特征切分与级联随机森林的异常点检测模型
FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest
计算机科学, 2020, 47(8): 185-188. https://doi.org/10.11896/jsjkx.190600162
[12] 钟熙, 孙祥娥.
基于Kmeans++聚类的朴素贝叶斯集成方法研究
Research on Naive Bayes Ensemble Method Based on Kmeans++ Clustering
计算机科学, 2019, 46(6A): 439-441.
[13] 曹雅茜, 黄海燕.
基于概率采样和集成学习的不平衡数据分类算法
Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning
计算机科学, 2019, 46(5): 203-208. https://doi.org/10.11896/j.issn.1002-137X.2019.05.031
[14] 胡海根, 孔祥勇, 周乾伟, 管秋, 陈胜勇.
基于深层卷积残差网络集成的黑色素瘤分类方法
Melanoma Classification Method by Integrating Deep Convolutional Residual Network
计算机科学, 2019, 46(5): 247-253. https://doi.org/10.11896/j.issn.1002-137X.2019.05.038
[15] 袁丁,王茜,邓李维.
聚类辅助特征对齐的域适应方法
Clustering Assist Feature Alignment for Unsupervised Domain Adaptation
计算机科学, 2019, 46(3): 221-226. https://doi.org/10.11896/j.issn.1002-137X.2019.03.033
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!