Computer Science ›› 2018, Vol. 45 ›› Issue (12): 148-152.doi: 10.11896/j.issn.1002-137X.2018.12.023

• Artificial Intelligence • Previous Articles     Next Articles

Parallel Text Categorization of Random Forest

PENG Zheng, WANG Ling-jiao, GUO Hua   

  1. (The College of Information Engineering,Xiangtan University,Xiangtan,Hunan 411105,China)
  • Received:2017-10-22 Online:2018-12-15 Published:2019-02-25

Abstract: Text categorization is one of the core technologies of information retrieval.Because of the limited computing performance and storage capacity in a computer,the traditional text categorization method can’t be suitable for big data era nowadays.It is realistic and urgent to execute algorithms for classifying the text in parallel to improve the efficiency of algorithm by the parallelization operation of data and tasks on the big data platform of Spark.This paper proposed an improved random fo-rest algorithm for the imbalanced data.It can reduce the impact of imbalanced data on random fo-rests by under-sampling the majority class samples and back-sampling the minority class samples to make up new trai-ning samples.The experimental results show that the new algorithm improves the categorization accuracy of the minority classes when handling imbalanced data sets.

Key words: Text categorization, Spark, Random forest, Imbalanced data, Parallelization

CLC Number: 

  • TP311
[1]YIN C Y,XI J W.The Research of Text Classification Techno-logy Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[2]LIU J,JIN T,PAN K J.An Improved KNN Text Classification Algorithmbased on Simhash[C]∥International Conference on Cognitive Informatics & Cognitive Computing.2017:92-95.
[3]SHARMA N,SINGH M.Modifying Naive Bayes Classifier forMultinomial Text Classification[C]∥International Conference on Recent Advances and Innovations in Engineering.2016:1-7.
[4]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥International Conference on Information Science and Control Engineering.2017:429-432.
[5]BÂDULESCU L A.Data Mining Classification Experimentswith Decision Trees over the Forest Covertype Database[C]∥International Conference on System Theory,Control and Computing.2017:236-241.
[6]HE J.Random Forest in Application of Text Classification[D].Guangzhou:South China University of Technology,2015.(in Chinese)
贺捷.随机森林在文本分类中的应用[D].广州:华南理工大学,2015.
[7]BECHINI A,MATTEIS A D D.Spreading Fuzzy Random Fo-rests with MapReduce[C]∥IEEE International Conference on Systems,Man,and Cybernetics.2017.
[8]XIANG X J,GAO Y,SHANG L.Parallel Text Categorization of Massive Text based on Hadoop[J].Computer Science,2011,38(10):184-188.(in Chinese)
向小军,高阳,商琳.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188.
[9]YAN J M.The Research and Application of Text Classification Based on Cloud Computing[D].Hangzhou:Zhejiang Sci-Tech University,2016.(in Chinese)
严嘉铭.基于云计算的文本分类研究与应用[D].杭州:浙江理工大学,2016.
[10]MORE A S,RANA D P.Review of Random Forest Classification Techniques to Resolve Data Imbalance[C]∥International Conference on Intelligent Systems and Information Management.2017:72-78.
[11]YIN H,HU Y P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Journal of Sun Yat-sen Universyt,2014,5(9):59-65.(in Chinese)
尹华,胡玉平.基于随机森林的不平衡特征选择算法[J].中山大学学报,2014,5(9):59-65.
[12]YU H L,GAO S,ZHAO J.Classification for Imbalanced Mi-croarray Data Based on Oversampling Technology and Random Forest[J].Computer Science,2012,39(5):190-194.(in Chinese)
于化龙,高尚,赵靖.基于过采样技术和随机森林的不平衡微阵列数据分类方法研究[J].计算机科学,2012,39(5):190-194.
[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[14]YIN C Y,XI J W,WANG J.The Research of Text Classification Technology Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[15]GUO A Z,YANG T.Research and Improvement of featurewords weight based on TFIDF Algorithm[C]∥Information Technology,Networking,Electronic and Automation Control Confe-rence.2016:415-149.
[16]EL HABIB DAHO M,SETTOUTI N,EL AMINE LAZOUNI M.Weighted Vote for Trees Aggregation in Random Forest[C]∥International Conference on Multimedia Computing and Systems.2014:428-443.
[17]CUI Y,LI G Q,CHENG H.Indexing for Large Scale Data Querying based on Spark SQL[C]∥International Conference on e-Business Engineering.2017:103-108.
[18]AKGÜN B,ÖČÜDÜCÜ G.Streaming Linear Regression onSpark MLlib and MOA[C]∥International Conference on Advances in Social Networks Analysis and Mining.2015:1244-1247.
[19]GOMBOS G,KISS A.P-Spar(k)ql:SPARQL Evaluation Me-thod on Spark GraphX with Parallel Query Plan[C]∥International Conference on Future Internet of Things and Cloud.2017:212-219.
[20]PERROT A,BOURQUI R,HANUSSE N.HeatPipe:HighThroughput,Low Latency Big Data Heatmap with Spark Streaming[C]∥International Conference Information Visualisation.2017:66-71.
[21]夏俊鸾.Spark大数据处理技术[M].北京:电子工业出版社,2015.
[22]LI H,LI Z,SHE K.An Improvement of Random Forest Algorithm Based on Comprehensive Sampling without Replacement[J].Computer Engineering & Science,2015,7(37):1233-1238.(in Chinese)
李慧,李正,佘堃.一种基于综合不放回抽样的随机森林算法改进[J].计算机工程与科学,2015,7(37):1233-1238.
[1] LIU Zhen-peng, SU Nan, QIN Yi-wen, LU Jia-huan, LI Xiao-fei. FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest [J]. Computer Science, 2020, 47(8): 185-188.
[2] YANG Wei-chao, GUO Yuan-bo, LI Tao, ZHU Ben-quan. Method Based on Traffic Fingerprint for IoT Device Identification and IoT Security Model [J]. Computer Science, 2020, 47(7): 299-306.
[3] CUI Wei, JIA Xiao-lin, FAN Shuai-shuai and ZHU Xiao-yan. New Associative Classification Algorithm for Imbalanced Data [J]. Computer Science, 2020, 47(6A): 488-493.
[4] SONG Ling-ling, WANG Shi-hui, YANG Chao, SHENG Xiao. Application Research of Improved XGBoost in Imbalanced Data Processing [J]. Computer Science, 2020, 47(6): 98-103.
[5] YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming [J]. Computer Science, 2020, 47(4): 36-41.
[6] ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[7] ZHAO Rui-jie, SHI Yong, ZHANG Han, LONG Jun, XUE Zhi. Webshell File Detection Method Based on TF-IDF [J]. Computer Science, 2020, 47(11A): 363-367.
[8] DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
[9] WANG Xiao-hui, ZHANG Liang, LI Jun-qing, SUN Yu-cui, TIAN Jie, HAN Rui-yi. Study on XGBoost Improved Method Based on Genetic Algorithm and Random Forest [J]. Computer Science, 2020, 47(11A): 454-458.
[10] YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[11] YANG Hao, CHEN HONG-mei. Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm [J]. Computer Science, 2020, 47(11): 88-94.
[12] ZHANG Bin-bin, WANG Juan, YUE Kun, WU Hao, HAO Jia. Performance Prediction and Configuration Optimization of Virtual Machines Based on Random Forest [J]. Computer Science, 2019, 46(9): 85-92.
[13] CAI Li, LI Ying-zi, JIANG Fang, LIANG Yu. Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots [J]. Computer Science, 2019, 46(8): 16-22.
[14] PANG Yu, LIU Ping, LEI Yin-jie. Realization of “Uncontrolled” Object Recognition Algorithm Based on Mobile Terminal [J]. Computer Science, 2019, 46(6A): 153-157.
[15] SHI Yu-xin, DENG Hong-min, GUO Wei-lin. Static Gesture Recognition Based on Hybrid Convolution Neural Network [J]. Computer Science, 2019, 46(6A): 165-168.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[3] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[4] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[5] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[6] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[7] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[8] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[9] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[10] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .