Computer Science ›› 2018, Vol. 45 ›› Issue (11A): 468-473.

• Big Data & Data Mining • Previous Articles     Next Articles

Multi-level Feature Selection Mechanism Based on MapReduce

SONG Zhe-li1, WANG Chao2, WANG Zhen-fei3   

  1. Zhengzhou Vocational College of Finance and Taxation,Zhengzhou 450048,China1
    The 713th Research Institute of China Shipbuilding Industry Corporation,Zhengzhou 450015,China2
    School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China3
  • Online:2019-02-26 Published:2019-02-26

Abstract: Feature selection is a committed step of text classification.The classification accuracy mainly depends on the merits and demerits of the selected feature words.This paper proposed a multi-level feature selection mechanism based on MapReduce.On the one hand,the mechanism screens the original dataset by an improved CHI feature selection algorithm,then uses the mutual information method to filter the noise words and to put the high quality feature words forward for the primaries.On the other hand,the time consumption of multi-level feature selection is reduced by introducing the mechanism into MapReduce model.Experimental results show that the mechanism improves both the classification accuracy and its runtime when dealing with big data problems.

Key words: CHI, Feature selection, MapReduce, Mutual information, Text classification

CLC Number: 

  • TP301
[1]DASH M,LIU H.Feature selection for classification[J].Intelligent Data Analysis,1997,1(3):131-156.
[2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[C]∥Conference on Symposium on Opear-ting Systems Design & Implementation.USENIX Association,2004:10.
[3]MENG J N,LIN H F,YU Y H.A two-stage feature selection method for text categorization [C]∥2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).IEEE,2010:1492-1496.
[4]KURSAT U A,SERKAN G.A novel probabilistic feature selection method for text classification [J].Knowledge-Based Systems,2012,36(6):226-235.
[5]HARUN U.A two-stage feature selection method for text categorization by using information gain,principal component analysis and genetic algorithm [J].Knowledge-Based Systems,2011,24(7):1024-1032.
[6]KASHIF J,SAMMEN M,BABRI HAROON A.A two-stage Markov blanket based feature selection algorithm for text classification [J].Neurocomputing,2015,157:91-104.
[7]李军怀,付静飞,蒋文杰,等.基于MRMR的文本分类特征选择方法[J].计算机科学,2016,43(10):225-228.
[8]黄源,李茂,吕建成,等.一种基于开方检验的特征选择方法[J].计算机科学,2015,42(5):54-56.
[9]ZHENG Z,LEI W,HUAN L,et al.On Similarity Preserving Feature Selection[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(99):1.
[10]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[11]MASHAYEKHY L,NEJAD M,GROSU D,et al.EnergyAware- Scheduling of MapReduce Jobs[J].IEEE International Congress on Big Data,2014,26(10):32-39.
[12]HAN L,SUN X Z,WU Z C,et al.Optimization Study on Sample Based Partition on MapReduce[J].Journal of Computer Research and Development,2013,50(Suppl.):77-84.
[13]GUNTHER N,PUGLIA P,TOMASETTE K.Hadoop Superlinear Scalability[J].Communications of the ACM,2015,58(4):1542-7730.
[14]FEI X,LI X F,SHEN C.Parallelized text classification algorithm for processing large scale TCM clinical data with MapReduce[C]∥IEEE International Conference on Information and Automation.IEEE,2015:1983-1986.
[15]LIU J,ZHU A,QIN C.Estimation of theoretical maximum speedup ratio for parallel computing of grid-based distributed hydrological models[J].Computers & Geosciences,2013,60(10):58-62.
[1] LYU Xiao-feng, ZHAO Shu-liang, GAO Heng-da, WU Yong-liang, ZHANG Bao-qi. Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network [J]. Computer Science, 2022, 49(9): 92-100.
[2] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[3] LENG Dian-dian, DU Peng, CHEN Jian-ting, XIANG Yang. Automated Container Terminal Oriented Travel Time Estimation of AGV [J]. Computer Science, 2022, 49(9): 208-214.
[4] HU Yu-jiao, JIA Qing-min, SUN Qing-shuang, XIE Ren-chao, HUANG Tao. Functional Architecture to Intelligent Computing Power Network [J]. Computer Science, 2022, 49(9): 249-259.
[5] LIU Xin, WANG Jun, SONG Qiao-feng, LIU Jia-hao. Collaborative Multicast Proactive Caching Scheme Based on AAE [J]. Computer Science, 2022, 49(9): 260-267.
[6] TANG Qing-hua, WANG Mei, TANG Chao-chen, LIU Xin, LIANG Wen. PDR Indoor Positioning Method Based on M2M Encounter Region [J]. Computer Science, 2022, 49(9): 283-287.
[7] NING Han-yang, MA Miao, YANG Bo, LIU Shi-chang. Research Progress and Analysis on Intelligent Cryptology [J]. Computer Science, 2022, 49(9): 288-296.
[8] HU An-xiang, YIN Xiao-kang, ZHU Xiao-ya, LIU Sheng-li. Strcmp-like Function Identification Method Based on Data Flow Feature Matching [J]. Computer Science, 2022, 49(9): 326-332.
[9] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[10] WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[11] LIU Gao-cong, LUO Yong-ping, JIN Pei-quan. Accelerating Persistent Memory-based Indices Based on Hotspot Data [J]. Computer Science, 2022, 49(8): 26-32.
[12] LI Yao, LI Tao, LI Qi-fan, LIANG Jia-rui, Ibegbu Nnamdi JULIAN, CHEN Jun-jie, GUO Hao. Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network [J]. Computer Science, 2022, 49(8): 257-266.
[13] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[14] ZHANG Guang-hua, GAO Tian-jiao, CHEN Zhen-guo, YU Nai-wen. Study on Malware Classification Based on N-Gram Static Analysis Technology [J]. Computer Science, 2022, 49(8): 336-343.
[15] QIN Qi-qi, ZHANG Yue-qin, WANG Run-ze, ZHANG Ze-hua. Hierarchical Granulation Recommendation Method Based on Knowledge Graph [J]. Computer Science, 2022, 49(8): 64-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!