Computer Science ›› 2018, Vol. 45 ›› Issue (12): 148-152.doi: 10.11896/j.issn.1002-137X.2018.12.023

• Artificial Intelligence • Previous Articles     Next Articles

Parallel Text Categorization of Random Forest

PENG Zheng, WANG Ling-jiao, GUO Hua   

  1. (The College of Information Engineering,Xiangtan University,Xiangtan,Hunan 411105,China)
  • Received:2017-10-22 Online:2018-12-15 Published:2019-02-25

Abstract: Text categorization is one of the core technologies of information retrieval.Because of the limited computing performance and storage capacity in a computer,the traditional text categorization method can’t be suitable for big data era nowadays.It is realistic and urgent to execute algorithms for classifying the text in parallel to improve the efficiency of algorithm by the parallelization operation of data and tasks on the big data platform of Spark.This paper proposed an improved random fo-rest algorithm for the imbalanced data.It can reduce the impact of imbalanced data on random fo-rests by under-sampling the majority class samples and back-sampling the minority class samples to make up new trai-ning samples.The experimental results show that the new algorithm improves the categorization accuracy of the minority classes when handling imbalanced data sets.

Key words: Imbalanced data, Parallelization, Random forest, Spark, Text categorization

CLC Number: 

  • TP311
[1]YIN C Y,XI J W.The Research of Text Classification Techno-logy Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[2]LIU J,JIN T,PAN K J.An Improved KNN Text Classification Algorithmbased on Simhash[C]∥International Conference on Cognitive Informatics & Cognitive Computing.2017:92-95.
[3]SHARMA N,SINGH M.Modifying Naive Bayes Classifier forMultinomial Text Classification[C]∥International Conference on Recent Advances and Innovations in Engineering.2016:1-7.
[4]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥International Conference on Information Science and Control Engineering.2017:429-432.
[5]BÂDULESCU L A.Data Mining Classification Experimentswith Decision Trees over the Forest Covertype Database[C]∥International Conference on System Theory,Control and Computing.2017:236-241.
[6]HE J.Random Forest in Application of Text Classification[D].Guangzhou:South China University of Technology,2015.(in Chinese)
贺捷.随机森林在文本分类中的应用[D].广州:华南理工大学,2015.
[7]BECHINI A,MATTEIS A D D.Spreading Fuzzy Random Fo-rests with MapReduce[C]∥IEEE International Conference on Systems,Man,and Cybernetics.2017.
[8]XIANG X J,GAO Y,SHANG L.Parallel Text Categorization of Massive Text based on Hadoop[J].Computer Science,2011,38(10):184-188.(in Chinese)
向小军,高阳,商琳.基于Hadoop平台的海量文本分类的并行化[J].计算机科学,2011,38(10):184-188.
[9]YAN J M.The Research and Application of Text Classification Based on Cloud Computing[D].Hangzhou:Zhejiang Sci-Tech University,2016.(in Chinese)
严嘉铭.基于云计算的文本分类研究与应用[D].杭州:浙江理工大学,2016.
[10]MORE A S,RANA D P.Review of Random Forest Classification Techniques to Resolve Data Imbalance[C]∥International Conference on Intelligent Systems and Information Management.2017:72-78.
[11]YIN H,HU Y P.An Imbalanced Feature Selection AlgorithmBased on Random Forest[J].Journal of Sun Yat-sen Universyt,2014,5(9):59-65.(in Chinese)
尹华,胡玉平.基于随机森林的不平衡特征选择算法[J].中山大学学报,2014,5(9):59-65.
[12]YU H L,GAO S,ZHAO J.Classification for Imbalanced Mi-croarray Data Based on Oversampling Technology and Random Forest[J].Computer Science,2012,39(5):190-194.(in Chinese)
于化龙,高尚,赵靖.基于过采样技术和随机森林的不平衡微阵列数据分类方法研究[J].计算机科学,2012,39(5):190-194.
[13]CHAWLA N V,BOWYER K W,HALL L O,et al.SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[14]YIN C Y,XI J W,WANG J.The Research of Text Classification Technology Based on Improved Maximum Entropy Model[C]∥International Conference on Computational Intelligence Theory,Systems and Applications.2015:142-145.
[15]GUO A Z,YANG T.Research and Improvement of featurewords weight based on TFIDF Algorithm[C]∥Information Technology,Networking,Electronic and Automation Control Confe-rence.2016:415-149.
[16]EL HABIB DAHO M,SETTOUTI N,EL AMINE LAZOUNI M.Weighted Vote for Trees Aggregation in Random Forest[C]∥International Conference on Multimedia Computing and Systems.2014:428-443.
[17]CUI Y,LI G Q,CHENG H.Indexing for Large Scale Data Querying based on Spark SQL[C]∥International Conference on e-Business Engineering.2017:103-108.
[18]AKGÜN B,ÖČÜDÜCÜ G.Streaming Linear Regression onSpark MLlib and MOA[C]∥International Conference on Advances in Social Networks Analysis and Mining.2015:1244-1247.
[19]GOMBOS G,KISS A.P-Spar(k)ql:SPARQL Evaluation Me-thod on Spark GraphX with Parallel Query Plan[C]∥International Conference on Future Internet of Things and Cloud.2017:212-219.
[20]PERROT A,BOURQUI R,HANUSSE N.HeatPipe:HighThroughput,Low Latency Big Data Heatmap with Spark Streaming[C]∥International Conference Information Visualisation.2017:66-71.
[21]夏俊鸾.Spark大数据处理技术[M].北京:电子工业出版社,2015.
[22]LI H,LI Z,SHE K.An Improvement of Random Forest Algorithm Based on Comprehensive Sampling without Replacement[J].Computer Engineering & Science,2015,7(37):1233-1238.(in Chinese)
李慧,李正,佘堃.一种基于综合不放回抽样的随机森林算法改进[J].计算机工程与科学,2015,7(37):1233-1238.
[1] GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang. Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features [J]. Computer Science, 2022, 49(7): 40-49.
[2] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[3] LIN Xi, CHEN Zi-zhuo, WANG Zhong-qing. Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning [J]. Computer Science, 2022, 49(6A): 144-149.
[4] QUE Hua-kun, FENG Xiao-feng, LIU Pan-long, GUO Wen-chong, LI Jian, ZENG Wei-liang, FAN Jing-min. Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection [J]. Computer Science, 2022, 49(6A): 790-794.
[5] WANG Wen-qiang, JIA Xing-xing, LI Peng. Adaptive Ensemble Ordering Algorithm [J]. Computer Science, 2022, 49(6A): 242-246.
[6] DONG Qi-da, WANG Zhe, WU Song-yang. Feature Fusion Framework Combining Attention Mechanism and Geometric Information [J]. Computer Science, 2022, 49(5): 129-134.
[7] LIU Jiang, LIU Wen-bo, ZHANG Ju. Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam [J]. Computer Science, 2022, 49(3): 3-10.
[8] ZHANG Xiao-qing, FANG Jian-sheng, XIAO Zun-jie, CHEN Bang, Risa HIGASHITA, CHEN Wan, YUAN Jin, LIU Jiang. Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image [J]. Computer Science, 2022, 49(3): 204-210.
[9] JIANG Hao-chen, WEI Zi-qi, LIU Lin, CHEN Jun. Imbalanced Data Classification:A Survey and Experiments in Medical Domain [J]. Computer Science, 2022, 49(1): 80-88.
[10] LIU Zhen-yu, SONG Xiao-ying. Multivariate Regression Forest for Categorical Attribute Data [J]. Computer Science, 2022, 49(1): 108-114.
[11] DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[12] YANG Xiao-qin, LIU Guo-jun, GUO Jian-hui, MA Wen-tao. Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest [J]. Computer Science, 2021, 48(8): 99-105.
[13] ZHENG Jian-hua, LI Xiao-min, LIU Shuang-yin, LI Di. Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling [J]. Computer Science, 2021, 48(7): 145-154.
[14] CHEN Jing-jie, WANG Kun. Interval Prediction Method for Imbalanced Fuel Consumption Data [J]. Computer Science, 2021, 48(7): 178-183.
[15] CAO Yang-chen, ZHU Guo-sheng, QI Xiao-yun, ZOU Jie. Research on Intrusion Detection Classification Based on Random Forest [J]. Computer Science, 2021, 48(6A): 459-463.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!