计算机科学 ›› 2014, Vol. 41 ›› Issue (5): 227-229.doi: 10.11896/j.issn.1002-137X.2014.05.047

• 软件与数据库技术 • 上一篇    下一篇

基于加权Bayes分类器的流数据在线分类算法研究

卢惠林   

  1. 哈尔滨工业大学计算机学院 哈尔滨150001 江苏省无线传感系统应用技术研发中心 无锡214153
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61170121)资助

Weighted Bayes Based Data Streaming Online Classification Algorithm

LU Hui-lin   

  • Online:2018-11-14 Published:2018-11-14

摘要: 传统的分类算法在对模型进行训练之前,需要得到整个训练数据集。然而在大数据环境下,数据以数据流的形式源源不断地流向系统,因此不可能预先获得整个训练数据集。研究了大数据环境下含有噪音的流数据的在线分类问题。将流数据的在线分类描述成一个优化问题,提出了一种加权的Nave Bayes分类器和一种误差敏感的(Error Adaptive)分类器,并通过真实的数据集对提出的算法进行了验证。实验结果表明,文中提出的误差敏感的分类器算法在系统没有噪音的情况下分类预测的准确性要优于相关的算法;此外,当流数据中含有噪音时,误差敏感的分类器算法对噪音不敏感,仍然具有很好的预测准确性,因此可以应用于大数据环境下流数据的在线分类预测。

关键词: 大数据,决策树,分类算法,流数据

Abstract: Traditional classification algorithms need to obtain the whole training dataset before training the model.However,for big data,data are streaming into the system sequentially,so it is impossible to obtain the whole training dataset beforehand.This paper studied the online classification problem in data streaming for big data.It first described the online classification problem as an optimization problem,then proposed a Weighted Nave Bayes classifier and an Error Adaptive classifier,and at last,validated the efficiency of the proposed algorithm according to two real datasets.The experiments show that the prediction accuracy of our proposed algorithm is higher than related researches in non-noisy data streaming,and moreover, while data streaming is noisy,our algorithm still has better prediction accuracy,so it can be used in real online classification application in data streaming.

Key words: Big data,Decision tree,Classification algorithm,Data streaming

[1] Domingos P,Hulten G.Mining high-speed data streams[C]∥Proceedings of the Sixth ACM SIGKDD International Confe-rence on Knowledge Discovery And Data Mining.ACM,2000:71-80
[2] Yang H,Fong S.Moderated VFDT in stream mining using adaptive tie threshold and incremental pruning[M]∥Data Warehousing and Knowledge Discovery.Springer,2011:471-483
[3] Hulten G,Spencer L,Domingos P.Mining time-changing datastreams[C]∥Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery And Data Mining.2001:97-106
[4] Li W,Han J,Pei J.CMAR:Accurate and efficient classification based on multiple class-association rules[C]∥IEEE International Conference on Data Mining.ACM,2001:369-376
[5] Han J.CPAR:Classification based on predictive association rules.http:∥sci2s.ugr.es/keel/pdf/algorithm/congreso/2003-Yin-CPAR.pdf,2003
[6] Thabtah F,Cowling P,Peng Y.MCAR:multi-class classification based on association rule[C]∥The 3rd ACS/IEEE International Conference on Computer Systems and Applications.IEEE,2005
[7] 詹英,吴春明,王宝军.一种与缓冲区紧耦合的环形循环滑动窗口的数据流抽取算法[J].电子学报,2011,39(4):2262-2267
[8] 崔贯勋,李梁,王柯柯,等.关联规则挖掘中 Apriori 算法的研究与改进[J].计算机应用,2010,30(11):2952-2955
[9] 詹英,吴春明,王宝军.基于 RCSW 的数据流速度异常检测算法研究[J].电子学报,2012,40(4):674-680
[10] 吴枫,仲妍,吴泉源.基于增量核主成分分析的数据流在线分类框架[J].自动化学报,2010,36(4):534-542
[11] Tang L,Tian L F,Steward B L.Classification of broadleaf and grass weeds using gabor wavelets and an artificial neural network[J].Transactions of the Asae,2003,46(4):1247
[12] Pfahringer B,Holmes G,Kirkby R.New options for hoeffding trees[M]∥AI 2007:Advances in Artificial Intelligence.Springer,2007:90-99
[13] Gama J A O,Rocha R,Medas P.Accurate decision trees formining high-speed data streams[C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.ACM,2003:523-528
[14] Hashemi S,Yang Y.Flexible decision tree for data stream classification in the presence of concept change,noise and missing values[J].Data Mining and Knowledge Discovery,2009,19:95-131
[15] Bifet A,Holmes G,Kirkby R,et al.Moa:Massive online analysis[J].The Journal of Machine Learning Research,2010,99:1601-1604
[16] Oza N C.Online bagging and boosting[C]∥2005IEEE International Conference on Systems,Man And Cybernetics.IEEE,2005:2340-2345
[17] Bifet A,Holmes G,Pfahringer B,et al.New ensemble methods for evolving data streams[C]∥Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery And Data Mining.2009:139-148
[18] Bifet A,Gavalda R.Learning from time-changing data with adaptive windowing.http://www.lsi.upc.edu/~abifet/TimevaryingE.pdf
[19] 王柯柯,崔贯勋,倪伟,等.基于单元的快速的大数据集离群数据挖掘算法[J].重庆邮电大学学报:自然科学版,2010,2(5):673-677

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!