计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 468-473.
宋哲理1, 王超2, 王振飞3
SONG Zhe-li1, WANG Chao2, WANG Zhen-fei3
摘要: 特征选择是文本分类的关键步骤,分类结果的准确度主要取决于选择得到的特征词的优劣。文中提出一种基于MapReduce的多级特征选择机制,一方面利用改进的CHI特征选择算法进行初次筛选,再通过互信息方法对初选结果进行噪声词过滤、优质特征词前置等操作;另一方面将本机制载入MapReduce模型中,以减少多级特征选择作用于海量数据的时间消耗。实验结果表明,该机制能在较短的时间内处理大规模数据,同时也提升了文本分类的精度。
中图分类号:
[1]DASH M,LIU H.Feature selection for classification[J].Intelligent Data Analysis,1997,1(3):131-156. [2]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[C]∥Conference on Symposium on Opear-ting Systems Design & Implementation.USENIX Association,2004:10. [3]MENG J N,LIN H F,YU Y H.A two-stage feature selection method for text categorization [C]∥2010 Seventh International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).IEEE,2010:1492-1496. [4]KURSAT U A,SERKAN G.A novel probabilistic feature selection method for text classification [J].Knowledge-Based Systems,2012,36(6):226-235. [5]HARUN U.A two-stage feature selection method for text categorization by using information gain,principal component analysis and genetic algorithm [J].Knowledge-Based Systems,2011,24(7):1024-1032. [6]KASHIF J,SAMMEN M,BABRI HAROON A.A two-stage Markov blanket based feature selection algorithm for text classification [J].Neurocomputing,2015,157:91-104. [7]李军怀,付静飞,蒋文杰,等.基于MRMR的文本分类特征选择方法[J].计算机科学,2016,43(10):225-228. [8]黄源,李茂,吕建成,等.一种基于开方检验的特征选择方法[J].计算机科学,2015,42(5):54-56. [9]ZHENG Z,LEI W,HUAN L,et al.On Similarity Preserving Feature Selection[J].IEEE Transactions on Knowledge & Data Engineering,2013,25(99):1. [10]DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [11]MASHAYEKHY L,NEJAD M,GROSU D,et al.EnergyAware- Scheduling of MapReduce Jobs[J].IEEE International Congress on Big Data,2014,26(10):32-39. [12]HAN L,SUN X Z,WU Z C,et al.Optimization Study on Sample Based Partition on MapReduce[J].Journal of Computer Research and Development,2013,50(Suppl.):77-84. [13]GUNTHER N,PUGLIA P,TOMASETTE K.Hadoop Superlinear Scalability[J].Communications of the ACM,2015,58(4):1542-7730. [14]FEI X,LI X F,SHEN C.Parallelized text classification algorithm for processing large scale TCM clinical data with MapReduce[C]∥IEEE International Conference on Information and Automation.IEEE,2015:1983-1986. [15]LIU J,ZHU A,QIN C.Estimation of theoretical maximum speedup ratio for parallel computing of grid-based distributed hydrological models[J].Computers & Geosciences,2013,60(10):58-62. |
[1] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[2] | 李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124 |
[3] | 王坤姝, 张泽辉, 高铁杠. 基于Hachimoji DNA和QR分解的遥感图像可逆隐藏算法 Reversible Hidden Algorithm for Remote Sensing Images Based on Hachimoji DNA and QR Decomposition 计算机科学, 2022, 49(8): 127-135. https://doi.org/10.11896/jsjkx.210700216 |
[4] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[5] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[6] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[7] | 刘卫明, 安冉, 毛伊敏. 基于聚类和WOA的并行支持向量机算法 Parallel Support Vector Machine Algorithm Based on Clustering and WOA 计算机科学, 2022, 49(7): 64-72. https://doi.org/10.11896/jsjkx.210500040 |
[8] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[9] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[10] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
[11] | 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135 |
[12] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[13] | 邓朝阳, 仲国强, 王栋. 基于注意力门控图神经网络的文本分类 Text Classification Based on Attention Gated Graph Neural Network 计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218 |
[14] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[15] | 钟桂凤, 庞雄文, 隋栋. 基于Word2Vec和改进注意力机制AlexNet-2的文本分类方法 Text Classification Method Based on Word2Vec and AlexNet-2 with Improved AttentionMechanism 计算机科学, 2022, 49(4): 288-293. https://doi.org/10.11896/jsjkx.211100016 |
|