计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 468-473.

• 大数据与数据挖掘 • 上一篇    下一篇

基于MapReduce的多级特征选择机制

宋哲理1, 王超2, 王振飞3   

  1. 郑州财税金融职业学院 郑州4500481
    中国船舶重工集团公司第七一三研究所 郑州4500152
    郑州大学信息工程学院 郑州4500013
  • 出版日期:2019-02-26 发布日期:2019-02-26
  • 通讯作者: 王 超(1988-),男,硕士,工程师,主要研究方向为机器学习,E-mail:854909839@qq.com
  • 作者简介:宋哲理(1983-),女,硕士,讲师,主要研究方向为计算机应用;王振飞(1973-),男,博士,副教授,CCF会员,主要研究方向为社交网络、大数据分析等。
  • 基金资助:
    本文受国家自然科学基金项目(61379079)资助。

Multi-level Feature Selection Mechanism Based on MapReduce

SONG Zhe-li1, WANG Chao2, WANG Zhen-fei3   

  1. Zhengzhou Vocational College of Finance and Taxation,Zhengzhou 450048,China1
    The 713th Research Institute of China Shipbuilding Industry Corporation,Zhengzhou 450015,China2
    School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China3
  • Online:2019-02-26 Published:2019-02-26

摘要: 特征选择是文本分类的关键步骤,分类结果的准确度主要取决于选择得到的特征词的优劣。文中提出一种基于MapReduce的多级特征选择机制,一方面利用改进的CHI特征选择算法进行初次筛选,再通过互信息方法对初选结果进行噪声词过滤、优质特征词前置等操作;另一方面将本机制载入MapReduce模型中,以减少多级特征选择作用于海量数据的时间消耗。实验结果表明,该机制能在较短的时间内处理大规模数据,同时也提升了文本分类的精度。

关键词: CHI, MapReduce, 互信息, 特征选择, 文本分类

Abstract: Feature selection is a committed step of text classification.The classification accuracy mainly depends on the merits and demerits of the selected feature words.This paper proposed a multi-level feature selection mechanism based on MapReduce.On the one hand,the mechanism screens the original dataset by an improved CHI feature selection algorithm,then uses the mutual information method to filter the noise words and to put the high quality feature words forward for the primaries.On the other hand,the time consumption of multi-level feature selection is reduced by introducing the mechanism into MapReduce model.Experimental results show that the mechanism improves both the classification accuracy and its runtime when dealing with big data problems.

Key words: CHI, Feature selection, MapReduce, Mutual information, Text classification

中图分类号: 

  • TP301
[1]DASH M,LIU H.Feature selection for classification[J].Intelligent Data Analysis,1997,1(3):131-156.
[2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[C]∥Conference on Symposium on Opear-ting Systems Design & Implementation.USENIX Association,2004:10.
[3]MENG J N,LIN H F,YU Y H.A two-stage feature selection method for text categorization [C]∥2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).IEEE,2010:1492-1496.
[4]KURSAT U A,SERKAN G.A novel probabilistic feature selection method for text classification [J].Knowledge-Based Systems,2012,36(6):226-235.
[5]HARUN U.A two-stage feature selection method for text categorization by using information gain,principal component analysis and genetic algorithm [J].Knowledge-Based Systems,2011,24(7):1024-1032.
[6]KASHIF J,SAMMEN M,BABRI HAROON A.A two-stage Markov blanket based feature selection algorithm for text classification [J].Neurocomputing,2015,157:91-104.
[7]李军怀,付静飞,蒋文杰,等.基于MRMR的文本分类特征选择方法[J].计算机科学,2016,43(10):225-228.
[8]黄源,李茂,吕建成,等.一种基于开方检验的特征选择方法[J].计算机科学,2015,42(5):54-56.
[9]ZHENG Z,LEI W,HUAN L,et al.On Similarity Preserving Feature Selection[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(99):1.
[10]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[11]MASHAYEKHY L,NEJAD M,GROSU D,et al.EnergyAware- Scheduling of MapReduce Jobs[J].IEEE International Congress on Big Data,2014,26(10):32-39.
[12]HAN L,SUN X Z,WU Z C,et al.Optimization Study on Sample Based Partition on MapReduce[J].Journal of Computer Research and Development,2013,50(Suppl.):77-84.
[13]GUNTHER N,PUGLIA P,TOMASETTE K.Hadoop Superlinear Scalability[J].Communications of the ACM,2015,58(4):1542-7730.
[14]FEI X,LI X F,SHEN C.Parallelized text classification algorithm for processing large scale TCM clinical data with MapReduce[C]∥IEEE International Conference on Information and Automation.IEEE,2015:1983-1986.
[15]LIU J,ZHU A,QIN C.Estimation of theoretical maximum speedup ratio for parallel computing of grid-based distributed hydrological models[J].Computers & Geosciences,2013,60(10):58-62.
[1] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[2] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3] 王坤姝, 张泽辉, 高铁杠.
基于Hachimoji DNA和QR分解的遥感图像可逆隐藏算法
Reversible Hidden Algorithm for Remote Sensing Images Based on Hachimoji DNA and QR Decomposition
计算机科学, 2022, 49(8): 127-135. https://doi.org/10.11896/jsjkx.210700216
[4] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[5] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[6] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[7] 刘卫明, 安冉, 毛伊敏.
基于聚类和WOA的并行支持向量机算法
Parallel Support Vector Machine Algorithm Based on Clustering and WOA
计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040
[8] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[9] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[10] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[11] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[12] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[13] 邓朝阳, 仲国强, 王栋.
基于注意力门控图神经网络的文本分类
Text Classification Based on Attention Gated Graph Neural Network
计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218
[14] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[15] 钟桂凤, 庞雄文, 隋栋.
基于Word2Vec和改进注意力机制AlexNet-2的文本分类方法
Text Classification Method Based on Word2Vec and AlexNet-2 with Improved AttentionMechanism
计算机科学, 2022, 49(4): 288-293. https://doi.org/10.11896/jsjkx.211100016
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!