计算机科学 ›› 2018, Vol. 45 ›› Issue (12): 148-152.doi: 10.11896/j.issn.1002-137X.2018.12.023
彭徵, 王灵矫, 郭华
PENG Zheng, WANG Ling-jiao, GUO Hua
摘要: 文本分类是信息检索的核心技术。传统的文本分类系统由于单机的计算与存储能力有限,已经不适用于大数据时代。在Spark大数据平台上并行地运行算法对文本进行分类,以数据和任务的并行化来提高算法的效率具有现实性和紧迫性。文中提出了改进的不平衡数据随机森林算法,通过对训练样本的多数类进行欠取样且对少数类进行有放回取样从而形成新训练样本的方法来减少不平衡数据对随机森林的影响。实验结果表明,新算法在处理不平衡数据集上的少数类时提高了分类的正确率。
中图分类号:
[1]YIN C Y,XI J W.The Research of Text Classification Techno-logy Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145. [2]LIU J,JIN T,PAN K J.An Improved KNN Text Classification Algorithmbased on Simhash[C]∥International Conference on Cognitive Informatics & Cognitive Computing.2017:92-95. [3]SHARMA N,SINGH M.Modifying Naive Bayes Classifier forMultinomial Text Classification[C]∥International Conference on Recent Advances and Innovations in Engineering.2016:1-7. [4]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥International Conference on Information Science and Control Engineering.2017:429-432. [5]BÂDULESCU L A.Data Mining Classification Experimentswith Decision Trees over the Forest Covertype Database[C]∥International Conference on System Theory,Control and Computing.2017:236-241. [6]HE J.Random Forest in Application of Text Classification[D].Guangzhou:South China University of Technology,2015.(in Chinese) 贺捷.随机森林在文本分类中的应用[D].广州:华南理工大学,2015. [7]BECHINI A,MATTEIS A D D.Spreading Fuzzy Random Fo-rests with MapReduce[C]∥IEEE International Conference on Systems,Man,and Cybernetics.2017. [8]XIANG X J,GAO Y,SHANG L.Parallel Text Categorization of Massive Text based on Hadoop[J].Computer Science,2011,38(10):184-188.(in Chinese) 向小军,高阳,商琳.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188. [9]YAN J M.The Research and Application of Text Classification Based on Cloud Computing[D].Hangzhou:Zhejiang Sci-Tech University,2016.(in Chinese) 严嘉铭.基于云计算的文本分类研究与应用[D].杭州:浙江理工大学,2016. [10]MORE A S,RANA D P.Review of Random Forest Classification Techniques to Resolve Data Imbalance[C]∥International Conference on Intelligent Systems and Information Management.2017:72-78. [11]YIN H,HU Y P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Journal of Sun Yat-sen Universyt,2014,5(9):59-65.(in Chinese) 尹华,胡玉平.基于随机森林的不平衡特征选择算法[J].中山大学学报,2014,5(9):59-65. [12]YU H L,GAO S,ZHAO J.Classification for Imbalanced Mi-croarray Data Based on Oversampling Technology and Random Forest[J].Computer Science,2012,39(5):190-194.(in Chinese) 于化龙,高尚,赵靖.基于过采样技术和随机森林的不平衡微阵列数据分类方法研究[J].计算机科学,2012,39(5):190-194. [13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357. [14]YIN C Y,XI J W,WANG J.The Research of Text Classification Technology Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145. [15]GUO A Z,YANG T.Research and Improvement of featurewords weight based on TFIDF Algorithm[C]∥Information Technology,Networking,Electronic and Automation Control Confe-rence.2016:415-149. [16]EL HABIB DAHO M,SETTOUTI N,EL AMINE LAZOUNI M.Weighted Vote for Trees Aggregation in Random Forest[C]∥International Conference on Multimedia Computing and Systems.2014:428-443. [17]CUI Y,LI G Q,CHENG H.Indexing for Large Scale Data Querying based on Spark SQL[C]∥International Conference on e-Business Engineering.2017:103-108. [18]AKGÜN B,ÖČÜDÜCÜ G.Streaming Linear Regression onSpark MLlib and MOA[C]∥International Conference on Advances in Social Networks Analysis and Mining.2015:1244-1247. [19]GOMBOS G,KISS A.P-Spar(k)ql:SPARQL Evaluation Me-thod on Spark GraphX with Parallel Query Plan[C]∥International Conference on Future Internet of Things and Cloud.2017:212-219. [20]PERROT A,BOURQUI R,HANUSSE N.HeatPipe:HighThroughput,Low Latency Big Data Heatmap with Spark Streaming[C]∥International Conference Information Visualisation.2017:66-71. [21]夏俊鸾.Spark大数据处理技术[M].北京:电子工业出版社,2015. [22]LI H,LI Z,SHE K.An Improvement of Random Forest Algorithm Based on Comprehensive Sampling without Replacement[J].Computer Engineering & Science,2015,7(37):1233-1238.(in Chinese) 李慧,李正,佘堃.一种基于综合不放回抽样的随机森林算法改进[J].计算机工程与科学,2015,7(37):1233-1238. |
[1] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[2] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[3] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[4] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[5] | 高振卓, 王志海, 刘海洋. 嵌入典型时间序列特征的随机Shapelet森林算法 Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features 计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226 |
[6] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[7] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[8] | 王文强, 贾星星, 李朋. 自适应的集成定序算法 Adaptive Ensemble Ordering Algorithm 计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108 |
[9] | 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏. Grassberger熵随机森林在窃电行为检测的应用 Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection 计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032 |
[10] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
[11] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[12] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[13] | 邓朝阳, 仲国强, 王栋. 基于注意力门控图神经网络的文本分类 Text Classification Based on Attention Gated Graph Neural Network 计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218 |
[14] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[15] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
|