基于MapReduce的特征选择并行化研究

Abstract

Abstract: Feature selection has become a necessary preprocessing procedure for high-dimensional data.With the explosive growth of data size,the traditional feature selection algorithm can not meet the current requirements of processing large-scale and high-dimensional data.Resorting to Google’s MapReduce programming model,we designed a distributed local learning-based feature selection algorithm D-logsf.Experiments were conducted on several real and synthesis data sets.The results show that the D-logsf algorithm is correct and has good reliability.Compared with traditional feature selection algorithm Logsf,D-logsf can obtain approximate linear speedup.Moreover,D-logsf can effectively handle large-scale data set.

Key words: Feature selection,Local learning,Distributed,MapReduce

LU Jiang and LI Yun. MapReduce Based Feature Selection Parallelization[J].Computer Science, 2015, 42(8): 44-47.

References

[1] Singh S,Kubica J,Larsen S,et al.Parallel large scale feature selection for logistic regression[C]∥SDM.2009:1171-1182
[2] Vernica R,Carey M J,Li C.Efficient parallel set-similarity joins using MapReduce[C]∥Proceedings of the International Confe-rence on Management of Data.2010:495-506
[3] Dean J,Ghemawat S.Mapreduce:Simplified data processing on large clusters[J].Operating Systems Design and Implementation,2004:137-149
[4] Apache Mahout.http://mahout.apache.org/
[5] Chu C T,Kim S K,Lin Y A,et al.MapReduce for machine learning on multicore[C]∥Advances in Neural Information Processing Systems.2007:186-194
[6] Zhao Z,Zhang R,Cox J,et al.Massively parallel feature selection:an approach based on variance preservation[M]∥Machine Learning and Knowledge Discovery in Databases.2012:237-252
[7] White T.Hadoop:The Definitive Guide[M].O’Reilly Media,Inc,USA,2012
[8] Elsayed T,Lin J,Douglas,et al.Pairwise document similarity in large collections with MapReduce[C]∥Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies.2008:265-268
[9] Kim Y,Shim K.Parallel Top-K Similarity Join Algorithms Using MapReduce[C]∥Proceedings of the 2012 IEEE 28th International Conference on Data Engineering.2012:510-521
[10] Rizvandi N B,Taheri J,Zomaya A Y.On using pattern matching algorithms in MapReduce applications[C]∥The 9th IEEE International Symposium on Parallel and Distributed Processing with Applications(ISPA).2011:75-80
[11] Sun Y J,Todorvic S,Goodison S.Local Learning Based Feature Selection for High Dimensional Data Analysis[J].On Pattern Analysis and Machine Intelligence,2010,32(9):1610-1626
[12] Li Y,Lu B L.Feature selection based on loss margin of nearest neighbor classification[J].Pattern Recognition,2009,42:1914-1921
[13] Han Y,Yu L.A variance reduction framework for stable feature selection[C]∥Proc.Int’l Conf.on Data Mining.2010:206-215
[14] Zinkevich M,Weimer M,Smola A,et al.Parallelized stochastic gradient descent[J].Advances in Neural Information Processing Systems,2010:1-37
[15] Ranger C,Raghuraman R,Penmetsa A,et al.Evaluating MapReduce for Multicore and Multiprocessor Systems[C]∥Proceedings of 2007 IEEE 13th International Symposium on High Performance Computer Architecture.2007:13-24
[16] Lichman M.UCI Machine Learning Repository.http://archive.ics.uci.edu/ml/
[17] http://www.nipsfsc.ecs.soton.ac.uk/datasets/

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

MapReduce Based Feature Selection Parallelization

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0