计算机科学 ›› 2016, Vol. 43 ›› Issue (12): 179-182.doi: 10.11896/j.issn.1002-137X.2016.12.032

• 数据挖掘 • 上一篇    下一篇

一种面向不完全标记的文本数据流自适应分类方法

张玉红,陈伟,胡学钢   

  1. 合肥工业大学计算机与信息学院 合肥230009,合肥工业大学计算机与信息学院 合肥230009,合肥工业大学计算机与信息学院 合肥230009
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受教育部创新团队(IRT13059),国家自然科学基金(61305063,2),博士点项目基金(20130111110011)资助

Self-adaptation Classification for Incomplete Labeled Text Data Stream

ZHANG Yu-hong, CHEN Wei and HU Xue-gang   

  • Online:2018-12-01 Published:2018-12-01

摘要: 现实生活中网络监控、网络评论以及微博等应用领域涌现了大量文本数据流,这些数据的不完全标记和频繁概念漂移给已有的数据流分类方法带来了挑战。为此,面向不完全标记的文本数据流提出了一种自适应的数据流分类算法。该算法以一个标记数据块作为起始数据块,对未标记数据块首先提取标记数据块与未标记数据块之间的特征集,并利用特征在两个数据块间的相似度进行概念漂移检测,最后计算未标记数据中特征的极性并对数据进行预测。实验表明了算法在分类精度上的优越性,尤其在标记信息较少和概念漂移较为频繁时。

关键词: 不完全标记,自适应,数据流,概念漂移

Abstract: In the real-world applications,a large number of text data stream are emerging,such as network monitoring,network comments and microblogs.However,these data have incomplete labels and frequent concept drifts,which have brought many challenges to existing classification methods of data stream.Thus we proposed a self-adaptation classification algorithm for incomplete labeled text data stream in this paper.The proposed algorithm uses a labeled data chunk as the starting one,and extracts features between the labeled data chunk and the unlabeled data chunk.Meanwhile,for unlabeled data chunks,it uses the similarity of features between two data chunks to test concept drift.Finally, the polari-ty of features of the unlabeled data chunks is calculated to predict the instances.The experimental results show our algorithm can improve the classification accuracy,especially in the data cases with less label information and more concepts drifts.

Key words: Incomplete labeled,Self-adaptation,Data stream,Concept drift

[1] Domingos P,Hulten G.Mining high-speed data streams[C]∥Proceedings of the Sixth ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining,2000.New York,NY,USA:ACM,2000:71-80
[2] Gama J,Medas P,Rocha R.Forest Trees for On-line Data[C]∥Proceedings of the 2004 ACM Symposium on Applied Computing,2004.New York,NY,USA:ACM,2004:632-636
[3] Wang H,Fan W,Yu P S,et al.Mining concept-drifting data streams using ensemble classifiers[C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,2003.New York,NY,USA:ACM,2003:226-235
[4] Zhou Z H,Li M.Tri-training:Exploiting unlabeled data using three classifiers[J].IEEE Transactions on Knowledge and Data Engineering,2005,7(11):1529-1541
[5] Zhang P,Zhu X,Tan J,et al.Classifier and cluster ensembles for mining concept Drifting data streams[C]∥Proceedings of IEEE International Conference on Data Mining,2010.Washington,DC,USA:IEEE Computer Society,2010:1175-1180
[6] Hoeffding W.Probability inequalities for sums of bounded random variables[J].Journal of the American Statistical Association,1963,8(301):13-30
[7] Hulten G,Spencer L,Domingos P.Mining time-changing datastreams[C]∥Proceedings of the Seventh ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining,2001.New York,NY,USA:ACM,2001:97-106
[8] Rutkowski L,Jaworski M,Pietruczuk L,et al.A New Method for Data Stream Mining Based on the Misclassification Error[J].IEEE Transactions on Neural Networks and Learning Systems,2015,6(5):1048-1059
[9] Gama J.Learning Decision Trees from Dynamic Data Streams[J].Journal of Universal Computer Science,2005,1(8):1353-1366
[10] Mena Torres D,Aguilar Ruiz J S.A similarity-based approach for data stream classification[J].Expert Systems with Applications,2014,41(9):4224-4234
[11] Gama J,Fernandes R,Rocha R.Decision Trees for Mining Data Streams[J].Intelligent Data Analysis,2006,0(1):23-45
[12] Andromeda T,Marsono M N,Ru L H.Online Data StreamLearning and Classification with Limited Labels[C]∥Procee-ding of International Conference on Electrical Engineering,Computer Science and Informatics,2014.Yogyakarta,Indonesia:Indonesia journals,2014:161-164
[13] Widyantoro D H.Exploiting Unlabeled Data in Concept DriftLearning[J].Jurnal Informatika,2007,8(1):54-62
[14] Lindstrom P,Delany S J,B M Namee.Handling Concept Drift in a Text Data Stream Constrained by High Labelling Cost[C]∥Proceedings of the 23rd International Florida Artificial Intelligence Research Society Conference,2010.Florida,USA:AAAI,2010:32-37
[15] Masud M M,Gao J,Khan L,et al.Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints[J].IEEE Transactions on Knowledge and Data Engineering,2011,3(6):859-874
[16] Xiao M,Guo Y.Semi-Supervised Kernel Matching for Domain Adaptation[C]∥Proceedings of the 26th AAAI Conference on Artificial Intelligence,2012.North America:AAAI,2012:1183-1189
[17] Kobayashi N,Inui K,Matsumoto Y.Extracting Aspect-Evaluation and Aspect-of Relations in Opinion Mining[C]∥Procee-dings of the 2007 Joint Conference on Empirical Methods in Na-tural Language Processing and Computational Natural Language Learning,2007.Prague:Association for Computational Linguistics,2007:1065-1074
[18] Li L H,Jin X M,Long M S.Topic Correlation Analysis for Cross-Domain Text Classification[C]∥Proceedings of the 26th AAAI Conference on Artificial Intelligence,2012.North America:AAAI,2012:998-1004
[19] Blitzer J,McDonald R,Pereira F.Domain adaptation with structural correspondence learning[C]∥Proceedings of the Confe-rence on Empirical Methods in Natural Language,2006.Stroudsburg,PA,USA:Association for Computational Linguistics,2006:120-128

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!