Computer Science ›› 2018, Vol. 45 ›› Issue (12): 148-152.doi: 10.11896/j.issn.1002-137X.2018.12.023

• Artificial Intelligence • Previous Articles     Next Articles

Parallel Text Categorization of Random Forest

PENG Zheng, WANG Ling-jiao, GUO Hua   

  1. (The College of Information Engineering,Xiangtan University,Xiangtan,Hunan 411105,China)
  • Received:2017-10-22 Online:2018-12-15 Published:2019-02-25

Abstract: Text categorization is one of the core technologies of information retrieval.Because of the limited computing performance and storage capacity in a computer,the traditional text categorization method can’t be suitable for big data era nowadays.It is realistic and urgent to execute algorithms for classifying the text in parallel to improve the efficiency of algorithm by the parallelization operation of data and tasks on the big data platform of Spark.This paper proposed an improved random fo-rest algorithm for the imbalanced data.It can reduce the impact of imbalanced data on random fo-rests by under-sampling the majority class samples and back-sampling the minority class samples to make up new trai-ning samples.The experimental results show that the new algorithm improves the categorization accuracy of the minority classes when handling imbalanced data sets.

Key words: Text categorization, Spark, Random forest, Imbalanced data, Parallelization

CLC Number: 

  • TP311
[1]YIN C Y,XI J W.The Research of Text Classification Techno-logy Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[2]LIU J,JIN T,PAN K J.An Improved KNN Text Classification Algorithmbased on Simhash[C]∥International Conference on Cognitive Informatics & Cognitive Computing.2017:92-95.
[3]SHARMA N,SINGH M.Modifying Naive Bayes Classifier forMultinomial Text Classification[C]∥International Conference on Recent Advances and Innovations in Engineering.2016:1-7.
[4]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥International Conference on Information Science and Control Engineering.2017:429-432.
[5]BÂDULESCU L A.Data Mining Classification Experimentswith Decision Trees over the Forest Covertype Database[C]∥International Conference on System Theory,Control and Computing.2017:236-241.
[6]HE J.Random Forest in Application of Text Classification[D].Guangzhou:South China University of Technology,2015.(in Chinese)
贺捷.随机森林在文本分类中的应用[D].广州:华南理工大学,2015.
[7]BECHINI A,MATTEIS A D D.Spreading Fuzzy Random Fo-rests with MapReduce[C]∥IEEE International Conference on Systems,Man,and Cybernetics.2017.
[8]XIANG X J,GAO Y,SHANG L.Parallel Text Categorization of Massive Text based on Hadoop[J].Computer Science,2011,38(10):184-188.(in Chinese)
向小军,高阳,商琳.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188.
[9]YAN J M.The Research and Application of Text Classification Based on Cloud Computing[D].Hangzhou:Zhejiang Sci-Tech University,2016.(in Chinese)
严嘉铭.基于云计算的文本分类研究与应用[D].杭州:浙江理工大学,2016.
[10]MORE A S,RANA D P.Review of Random Forest Classification Techniques to Resolve Data Imbalance[C]∥International Conference on Intelligent Systems and Information Management.2017:72-78.
[11]YIN H,HU Y P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Journal of Sun Yat-sen Universyt,2014,5(9):59-65.(in Chinese)
尹华,胡玉平.基于随机森林的不平衡特征选择算法[J].中山大学学报,2014,5(9):59-65.
[12]YU H L,GAO S,ZHAO J.Classification for Imbalanced Mi-croarray Data Based on Oversampling Technology and Random Forest[J].Computer Science,2012,39(5):190-194.(in Chinese)
于化龙,高尚,赵靖.基于过采样技术和随机森林的不平衡微阵列数据分类方法研究[J].计算机科学,2012,39(5):190-194.
[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[14]YIN C Y,XI J W,WANG J.The Research of Text Classification Technology Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[15]GUO A Z,YANG T.Research and Improvement of featurewords weight based on TFIDF Algorithm[C]∥Information Technology,Networking,Electronic and Automation Control Confe-rence.2016:415-149.
[16]EL HABIB DAHO M,SETTOUTI N,EL AMINE LAZOUNI M.Weighted Vote for Trees Aggregation in Random Forest[C]∥International Conference on Multimedia Computing and Systems.2014:428-443.
[17]CUI Y,LI G Q,CHENG H.Indexing for Large Scale Data Querying based on Spark SQL[C]∥International Conference on e-Business Engineering.2017:103-108.
[18]AKGÜN B,ÖČÜDÜCÜ G.Streaming Linear Regression onSpark MLlib and MOA[C]∥International Conference on Advances in Social Networks Analysis and Mining.2015:1244-1247.
[19]GOMBOS G,KISS A.P-Spar(k)ql:SPARQL Evaluation Me-thod on Spark GraphX with Parallel Query Plan[C]∥International Conference on Future Internet of Things and Cloud.2017:212-219.
[20]PERROT A,BOURQUI R,HANUSSE N.HeatPipe:HighThroughput,Low Latency Big Data Heatmap with Spark Streaming[C]∥International Conference Information Visualisation.2017:66-71.
[21]夏俊鸾.Spark大数据处理技术[M].北京:电子工业出版社,2015.
[22]LI H,LI Z,SHE K.An Improvement of Random Forest Algorithm Based on Comprehensive Sampling without Replacement[J].Computer Engineering & Science,2015,7(37):1233-1238.(in Chinese)
李慧,李正,佘堃.一种基于综合不放回抽样的随机森林算法改进[J].计算机工程与科学,2015,7(37):1233-1238.
[1] ZHANG Bin-bin, WANG Juan, YUE Kun, WU Hao, HAO Jia. Performance Prediction and Configuration Optimization of Virtual Machines Based on Random Forest [J]. Computer Science, 2019, 46(9): 85-92.
[2] CAI Li, LI Ying-zi, JIANG Fang, LIANG Yu. Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots [J]. Computer Science, 2019, 46(8): 16-22.
[3] PANG Yu, LIU Ping, LEI Yin-jie. Realization of “Uncontrolled” Object Recognition Algorithm Based on Mobile Terminal [J]. Computer Science, 2019, 46(6A): 153-157.
[4] SHI Yu-xin, DENG Hong-min, GUO Wei-lin. Static Gesture Recognition Based on Hybrid Convolution Neural Network [J]. Computer Science, 2019, 46(6A): 165-168.
[5] JIA Ning, LI Ying-da. Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device [J]. Computer Science, 2019, 46(6A): 566-570.
[6] CUI Jing-chun, WANG Jing. Face Expression Recognition Model Based on Enhanced Head Pose Estimation [J]. Computer Science, 2019, 46(6): 322-327.
[7] CHEN Xi, LI Lei-da, LI Qiao-yue, HAN Xi-xi, ZHU Han-cheng. No-reference Quality Assessment of Depth Images Based on Natural Scenes Statistics [J]. Computer Science, 2019, 46(6): 256-262.
[8] WU Yu-xi, WANG Jun-li, YANG Li, YU Miao-miao. Survey on Cost-sensitive Deep Learning Methods [J]. Computer Science, 2019, 46(5): 1-12.
[9] ZHAO Jun-xian, YU Jian. Optimization of Spark RDD Based on Non-serialization Native Storage [J]. Computer Science, 2019, 46(5): 143-149.
[10] CAO Ya-xi, HUANG Hai-yan. Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning [J]. Computer Science, 2019, 46(5): 203-208.
[11] WEI Liang, LIN Zi-yu, LAI Yong-xuan. DFTS:A Top-k Skyline Query for Large Datasets [J]. Computer Science, 2019, 46(5): 150-156.
[12] XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young. Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data [J]. Computer Science, 2019, 46(4): 22-27.
[13] ZHOU Ming,JIA Yan-ming,ZHOU Cai-lan,XU Ning. English Automated Essay Scoring Methods Based on Discourse Structure [J]. Computer Science, 2019, 46(3): 234-241.
[14] GUAN Xiao-qiang, PANG Ji-fang, LIANG Ji-ye. Randomization of Classes Based Random Forest Algorithm [J]. Computer Science, 2019, 46(2): 196-201.
[15] CUI Guang-fan, XU Li-jie, LIU Jie, YE Dan, ZHONG Hua. Design and Implementation of Distributed Full-text Search Framework Based on Spark SQL [J]. Computer Science, 2018, 45(9): 104-112, 145.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[3] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[4] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[5] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[6] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[7] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[8] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[9] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[10] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .