计算机科学 ›› 2016, Vol. 43 ›› Issue (Z6): 374-379.doi: 10.11896/j.issn.1002-137X.2016.6A.089

• 数据挖掘 • 上一篇    下一篇

面向大数据分析的决策树算法

张棪,曹健   

  1. 上海交通大学计算机科学与工程系 上海200240,上海交通大学计算机科学与工程系 上海200240
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61272438,61472253),上海市科委项目(14511107702,15411952502)资助

Decision Tree Algorithms for Big Data Analysis

ZHANG Yan and CAO Jian   

  • Online:2018-11-14 Published:2018-11-14

摘要: 决策树作为机器学习中的一个预测模型,因其输出结果易于理解和解释,而被广泛应用于各个领域,成为了学术界研究的热点。随着数据产生速度的剧增,由于内存容量和处理器速度等限制,常规的决策树算法无法对大数据集进行处理,因此需要对决策树算法的实现进行针对性的处理。首先阐述了决策树的基本算法和优化方法,在此基础上结合大数据带来的挑战,分类比较了各类针对性算法的优缺点,并介绍了支撑这些算法运行的平台。最后讨论了面向大数据的决策树算法的未来发展方向。

关键词: 决策树,大数据,机器学习

Abstract: As a predictive model in machine learning,decision tree algorithm is widely used in various fields and has been a hot research topic due to its easily understandable result.Since the speed of data generation increases explosively,conventional decision tree algorithms cannot deal well with large datasets due to the limitation of memory capacity and processor speed,and thus novel implementation is needed urgently.Based on classic implementation and optimization method of decision tree,and challenges brought by big data,we compared various kinds of corresponding algorithms.Platforms for big data algorithms were then introduced and possible directions in the future were discussed in the end.

Key words: Decision tree,Big data,Machine learning

[1] Quinlan J R.Induction of decision trees[J].Machine Learning,1986,1(1):81-106
[2] Seni G,Elder J F.Ensemble methods in data mining:improving accuracy through combining predictions[J].Synthesis Lectures on Data Mining and Knowledge Discovery,2010,2(1):1-126
[3] Breiman L,Friedman J,Stone C J,et al.Classification and regression trees[M].CRC Press,1984
[4] Shannon C E.A note on the concept of entropy[J].Bell System Tech.J,1948,27:379-423
[5] Quinlan J R.C4.5:Programming for machine learning[M].Morgan Kauffmann,1993
[6] Ferri C,Flach P,Hernández-Orallo J.Learning decision trees using the area under the ROC curve[C]∥ICML.2002:139-146
[7] Ling C X,Yang Q,Wang J,et al.Decision trees with minimal costs[C]∥Proceedings of the Twenty-first International Conference on Machine Learning.ACM,2004:4-8
[8] Lomax S,Vadera S.A survey of cost-sensitive decision tree induction algorithms[J].ACM Computing Surveys (CSUR),2013,45(2):227-268
[9] Moret S L,Langford W T,Margineantu D D.Learning to predict channel stability using biogeomorphic features[J].Ecological Modelling,2006,191(1):47-57
[10] Kretowski M,Grzes M.Evolutionary induction of mixed decision trees[J].International Journal of Data Warehousing and Mi-ning,2007,3(4):68-82
[11] Fan W,Stolfo S J,Zhang J,et al.AdaCost:misclassification cost-sensitive boosting[C]∥ICML.1999:97-105
[12] Murthy S K,Kasif S,Salzberg S.A system for induction of oblique decision trees[J].Journal of Artificial Intelligence Research,1994,2(1):1-32
[13] Barros R C,Cerri R,Jaskowiak P,et al.A bottom-up oblique decision tree induction algorithm[C]∥2011 11th International Conference on Intelligent Systems Design and Applications (ISDA).IEEE,2011:450-456
[14] Wickramarachchi D C,Robertson B L,Reale M,et al.HH-CART:An Oblique Decision Tree[J].arXiv preprint arXiv:1504.03415,2015
[15] Mingers J.An empirical comparison of pruning methods for decision tree induction[J].Machine learning,1989,4(2):227-243
[16] Mingers J.An empirical comparison of selection measures fordecision-tree induction[J].Machine Learning,1989,3(4):319-342
[17] Elouedi Z,Mellouli K,Smets P.A pre-pruning method in belief decision trees[C]∥The Ninth International Conference on Information Processing and Management of Uncertainty in Know-ledge-Based Systems IPMU.2002,1:579-586
[18] Trabelsi S,Elouedi Z,Mellouli K.Pruning belief decision tree methods in averaging and conjunctive approaches[J].International Journal of Approximate Reasoning,2007,46(3):568-595
[19] Quinlan J R.Simplifying decision trees[J].International Journal of Man-machine sSudies,1987,27(3):221-234
[20] Quinlan J R,Cameron-Jones R M.FOIL:A midterm report[C]∥Machine Learning:ECML-93.Springer Berlin Heidelberg,1993:1-20
[21] Breslow L A,Aha D W.Simplifying decision trees:A survey[J].The Knowledge Engineering Review,1997,12(1):1-40
[22] Cestnik B,Bratko I.On estimating probabilities in tree pruning[C]∥Machine Learning-EWSL-91.Springer Berlin Heidelberg,1991:138-150
[23] Breiman L.Bagging predictors[J].Machine Learning,1996,24(2):123-140
[24] Domingos P.Metacost:A general method for making classifiers cost-sensitive[C]∥Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning.ACM,1999:155-164
[25] Lin F Y,McClean S.The Prediction of financial distress using a cost sensitive approach and prior probabilities[C]∥Proceedings of the 17th International Conference on Machine Learning (ICML’00).2000
[26] Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32
[27] Freund Y,Schapire R E.A decision-theoretic generalization of on-line learning and an application to boosting[J].Journal of Computer and System Sciences,1997,55(1):119-139
[28] Friedman J H.Greedy function approximation:a gradient boosting machine[J].Annals of Statistics,2001,29(5):1189-1232
[29] Freund Y,Mason L.The alternating decision tree learning algorithm[C]∥ICML.1999,99:124-133
[30] Gligorov V V,Williams M.Efficient,reliable and fast high-level triggering using a bonsai boosted decision tree[J].Journal of Instrumentation,2013,8(2):6
[31] Ayaru L,Ypsilantis P P,Nanapragasam A,et al.Prediction of Outcome in Acute Lower Gastrointestinal Bleeding Using Gradie-nt Boosting[J].PloS one,2015,10(7):e0132485
[32] Lombardo L,Cama M,Conoscenti C,et al.Binary logistic regression versus stochastic gradient boosted decision trees in assessing landslide susceptibility for multiple-occurring landslide events:application to the 2009 storm event in Messina (Sicily,southern Italy)[J].Natural Hazards,2015,79(3):1-28
[33] He X,Pan J,Jin O,et al.Practical lessons from predicting clicks on ads at facebook[C]∥Proceedings of 20th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.ACM,2014:1-9
[34] Haralick R M.The table look-up rule[J].Communications in Statistics-Theory and Methods,1976,5(12):1163-1191
[35] Hartmann C R P,Varshney P K,Mehrotra K G,et al.Application of information theory to the construction of efficient decision trees[J].IEEE Transactions on Information Theory,1982,28(4):565-577
[36] Knuth D E.Optimum binary search trees[J].Acta Informatica,1971,1(1):14-25
[37] Zhang S,Qin Z,Ling C X,et al.Missing is useful:missing values in cost-sensitive decision trees[J].IEEE Transactions on Knowledge and Data Engineering,2005,17(12):1689-1693
[38] Gu Q,Zhang Y,Cao J,et al.A confidence-based entity resolution approach with incomplete information[C]∥2014 International Conference on Data Science and Advanced Analytics (DSAA).IEEE,2014:97-103
[39] Utgoff P E,Berkman N C,Clouse J A.Decision tree induction based on efficient tree restructuring[J].Machine Learning,1997,29(1):5-44
[40] Gama J,Fernandes R,Rocha R.Decision trees for mining data streams[J].Intelligent Data Analysis,2006,10(1):23-45
[41] Buntine W,Niblett T.A further comparison of splitting rules for decision-tree induction[J].Machine Learning,1992,8(1):75-85
[42] Dietterich T G.An experimental comparison of three methods for constructing ensembles of decision trees:Bagging,boosting,and randomization[J].Machine Learning,2000,40(2):139-157
[43] Dietterich T G.Ensemble methods in machine learning[M]∥Multiple Classifier Systems.Springer Berlin Heidelberg,2000:1-15
[44] Opitz D,Maclin R.Popular ensemble methods:An empiricalstudy[J].Journal of Artificial Intelligence Research,1999:169-198
[45] Panda B,Herbach J S,Basu S,et al.Planet:massively parallellearning of tree ensembles with mapreduce[J].Proceedings of the VLDB Endowment,2009,2(2):1426-1437
[46] Yin W,Simmhan Y,Prasanna V K.Scalable regression treelearning on Hadoop using OpenPlanet[C]∥Proceedings of third international workshop on MapReduce and its Applications Date.ACM,2012:57-64
[47] Ye J,Chow J H,Chen J,et al.Stochastic gradient boosted distributed decision trees[C]∥Proceedings of the 18th ACM Conference on Information and Knowledge Management.ACM,2009:2061-2064
[48] Gropp W,Lusk E,Doss N,et al.A high-performance,portable implementation of the MPI message passing interface standard[J].Parallel Computing,1996,22(6):789-828
[49] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113
[50] Mehta M,Agrawal R,Rissanen J.SLIQ:A fast scalable classi-fier for data mining[M]∥Advances in Database Technology(EDBT’96).Springer Berlin Heidelberg,1996:18-32
[51] Bradford J P,Fortes J A B.Characterization and parallelization of decision-tree induction[J].Journal of Parallel and Distributed Computing,2001,61(3):322-349
[52] Ranka S,Singh V.CLOUDS:A decision tree classifier for large datasets[J].Knowledge Discovery and Data Mining,1998:2-8
[53] Shafer J,Agrawal R,Mehta M.SPRINT:A scalable parallelclassifier for data mining[C]∥Proc.1996 Int.Conf.Very Large Data Bases.1996:544-555
[54] Gehrke J,Ramakrishnan R,Ganti V.RainForest-A framework for fast decision tree construction of large datasets[C]∥VLDB.1998,98:416-427
[55] Gehrke J,Ganti V,Ramakrishnan R,et al.BOAT—optimistic decision tree construction[C]∥ACM SIGMOD Record.ACM,1999,28(2):169-180
[56] Domingos P,Hulten G.Mining high-speed data streams[C]∥Proceedings of the Sixth ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.ACM,2000:71-80
[57] Hulten G,Spencer L,Domingos P.Mining time-changing datastreams[C]∥Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2001:97-106
[58] Gama J,Rocha R,Medas P.Accurate decision trees for mining high-speed data streams[C]∥Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:523-528
[59] Bifet A,Holmes G,Kirkby R,et al.Moa:Massive online analysis[J].The Journal of Machine Learning Research,2010,11:1601-1604
[60] Neumeyer L,Robbins B,Nair A,et al.S4:Distributed stream computing platform[C]∥2010 IEEE International Conference on Data Mining Workshops (ICDMW).IEEE,2010:170-177
[61] Wojnarski M.Debellor:a data mining platform with stream architecture[M]∥Transactions on Rough Sets IX.Springer Berlin Heidelberg,2008:405-427
[62] Fan C Y,Chang P C,Lin J J,et al.A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification[J].Applied Soft Computing,2011,11(1):632-644
[63] Wu M C,Lin S Y,Lin C H.An effective application of decision tree to stock trading[J].Expert Systems with Applications,2006,31(2):270-274
[64] Kim J W,Lee B H,Shaw M J,et al.Application of decision-tree induction techniques to personalized advertisements on internet storefronts[J].International Journal of Electronic Commerce,2001,5(3):45-62
[65] Zeitouni K,Chelghoum N.Spatial decision tree-application totraffic risk analysis[C]∥ACS/IEEE International Conference on Computer Systems and Applications.IEEE,2001:203-207

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!