计算机科学 ›› 2021, Vol. 48 ›› Issue (9): 118-124.doi: 10.11896/jsjkx.210400280
所属专题: 智能数据治理技术与系统
戴宏亮1, 钟国金1, 游志铭1, 戴宏明2
DAI Hong-liang1, ZHONG Guo-jin1, YOU Zhi-ming1 , DAI Hong-ming2
摘要: 随着移动互联技术的不断发展,社交媒体成为了公众分享观点和抒发情感的主要平台,在重大社会事件下对社交媒体文本进行情感分析能够有效监控舆情。针对现有中文社交媒体情感分析算法的准确性能和运行效率较低的问题,提出了一种基于Spark分布式系统的集成情感大数据分析方法(Spark Feature Weighted Stacking,S-FWS)。该方法首先基于Jieba库预分词和PMI关联度完成新词发现;然后考虑词语重要度混合提取文本特征,并使用Lasso进行特征选择;最后改进传统Stacking框架忽略特征重要度的缺点,使用初级学习器的准确率信息对类概率特征进行加权处理并构造多项式特征,进而训练次级学习器。分别在单机模式和Spark平台下引入多种算法进行对比实验,实验结果证明所提S-FWS方法的准确性能和耗时性能具备一定优势,并且分布式系统能够大幅提高算法的运行效率,同时随着集群工作节点的增加,算法耗时逐渐降低。
中图分类号:
[1]PANG B,LEE L.Opinion mining and sentiment analysis[J].Foundations & Trends in Information Retrieval,2008,2(1/2):1-135. [2]PANG B,LEE L,VAITHYANATHAN S.Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of 2002 Empirical Methods in Natural Language Processing.2002:79-86. [3]WANG T,LI M.Research on Comment Text Mining Based on LDA Model and Semantic Network[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2019,36(4):9-16. [4]YUN W,GAO Q.An Ensemble Sentiment Classification System of Twitter Data for Airline Services Analysis[C]//IEEE International Conference on Data Mining Workshop.IEEE,2015:1318-1325. [5]ALREHILI A,ALBALAWI K.Sentiment analysis of customer reviews using ensemble method[C]//Proc of International Conference on Computer and Information Sciences.Piscataway,NJ:IEEE press,2019:1-6. [6]ZHANG Y,ZHOU Y,LU H,et al.Traffic Network Flow Prediction Using Parallel Training for Deep Convolutional Neural Networks on Spark Cloud[J].IEEE Transactions on Industrial Informatics,2020(99):1-1. [7]ELZAYADY H,BADRAN K M,SALAMA G I.SentimentAnalysis on Twitter Data using Apache Spark Framework[C]//2018 13th International Conference on Computer Engineering and Systems (ICCES).2018:171-176. [8]YANG L Y,WANG Y Z.Application of Spark in SentimentAnalysis of Ensemble Learning Text[J].Computer Applications and Software,2020,37(6):130-134. [9]WANG W H,JIN L J.Sentiment Analysis Ensemble Algorithm Based on Spark[J].Journal of Zhejiang University of Technology,2020,48(4):405-410,434. [10]GRBIC D,HAFFERTY F W,HAFFERTY P K.Medical School Mission Statements as Reflections of Institutional Identity and Educational Purpose:A Network Text Analysis[J].Academic Medicine:Journal of the Association of American Medical Colleges,2013,88(6):852-860. [11]ZHU A Q,LI S,TANG X D.Parallel FP_growth Association Rules Mining Method on Spark Platform[J].Computer Science,2020,47(12):139-143. [12]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]//Usenix Conference on Networked Systems Design & Implementation.2012:15-28. [13]PECINA P,SCHLESINGER P.Combining association measures for collocation extraction [C]//Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions.Stroudsburg:ACL,2006:651-658. [14]HE J F,ZHAO H,HE X M.Suspicious Person Text Representation Method Based on Improved TF-IDF[J].Computer Engineering and Design,2021,42(2):396-401. [15]ZHANG D W,XU H,SU Z C,et al.Chinese comments sentiment classification based on word2vec and SVMperf[J].Expert Systems with Applications,2015,42(4):1857-1863. [16]BREIMAN L.Random Forests[J].Machine Learning,2001,45(1):5-32. [17]FRIEDMAN J H.Greedy Function Approximation:A Gradient Boosting Machine[J].Annals of Statistics,2001,29(5):1189-1232. [18]CHEN T,GUESTRIN C.XGBoost:a scalable tree boosting system[J].International Conference on Knowledge Discovery and Data Mining,2016,1(1):785-794. [19]DEROSKI S,ENKO B.Is combining classifiers with stacking better than selecting the best one?[J].Machine Learning,2004,54(3):255-273. [20]CHU T Z,CHENG L,WONG H S.Corpus-based topic diffusion for short text clustering[J].Neurocomputing,2018,275:2444-2458. [21]ASHRAF M,ZAMAN M,AHMED M.Using Ensemble Stac-kingC Method and Base Classifiers to Ameliorate Prediction Accuracy of Pedagogical Data[J].Procedia Computer Science,2018,132:1021-1040. [22]MENAHEM E,ROKACH L,ELOVICI Y.Troika-An im-proved stacking schema for classification tasks[J].Information Sciences,2009,179(24):4097-4122. [23]DAI H,WU W K,LI J C,et al.Incorporating Feature Selection in the Improved Stacking Algorithm for Online Learning Analysis and Prediction[J].Engineering Letters,2020,28(4):1011. [24]JIANG M,LIU J,ZHANG L,et al.An improved stacking framework for stock index prediction by leveraging tree-based ensemble models and deep learning algorithms[J].Physica A:Statistical Mechanics and its Applications,2020,541:122272. |
[1] | 王剑, 彭雨琦, 赵宇斐, 杨健. 基于深度学习的社交网络舆情信息抽取方法综述 Survey of Social Network Public Opinion Information Extraction Based on Deep Learning 计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099 |
[2] | 王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030 |
[3] | 丁锋, 孙晓. 基于注意力机制和BiLSTM-CRF的消极情绪意见目标抽取 Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF 计算机科学, 2022, 49(2): 223-230. https://doi.org/10.11896/jsjkx.210100046 |
[4] | 袁景凌, 丁远远, 盛德明, 李琳. 基于视觉方面注意力的图像文本情感分析模型 Image-Text Sentiment Analysis Model Based on Visual Aspect Attention 计算机科学, 2022, 49(1): 219-224. https://doi.org/10.11896/jsjkx.201000074 |
[5] | 胡艳丽, 童谭骞, 张啸宇, 彭娟. 融入自注意力机制的深度学习情感分析方法 Self-attention-based BGRU and CNN for Sentiment Analysis 计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063 |
[6] | 张瑾, 段利国, 李爱萍, 郝晓燕. 基于注意力与门控机制相结合的细粒度情感分析 Fine-grained Sentiment Analysis Based on Combination of Attention and Gated Mechanism 计算机科学, 2021, 48(8): 226-233. https://doi.org/10.11896/jsjkx.200700058 |
[7] | 史伟, 付月. 考虑语境的微博短文本挖掘:情感分析的方法 Microblog Short Text Mining Considering Context:A Method of Sentiment Analysis 计算机科学, 2021, 48(6A): 158-164. https://doi.org/10.11896/jsjkx.210200089 |
[8] | 程铁军, 王曼. 基于变权组合的突发事件网络舆情趋势预测 Network Public Opinion Trend Prediction of Emergencies Based on Variable Weight Combination 计算机科学, 2021, 48(6A): 190-195. https://doi.org/10.11896/jsjkx.200600094 |
[9] | 潘芳, 张会兵, 董俊超, 首照宇. 基于高效Transformer的中文在线课程评论方面情感分析 Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer 计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116 |
[10] | 俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129 |
[11] | 张明阳, 王刚, 彭起, 张岩峰. 学术论文公开评审平台数据分析 Data Analysis of OpenReview 计算机科学, 2021, 48(6): 63-70. https://doi.org/10.11896/jsjkx.200500138 |
[12] | 尹久, 池凯凯, 宦若虹. 基于ATT-DGRU的文本方面级别情感分析 Aspect-level Sentiment Analysis of Text Based on ATT-DGRU 计算机科学, 2021, 48(5): 217-224. https://doi.org/10.11896/jsjkx.200500076 |
[13] | 李建兰, 潘岳, 李小聪, 刘子维, 王天宇. 基于CiteSpace的中文评论文本研究现状与趋势分析 Chinese Commentary Text Research Status and Trend Analysis Based on CiteSpace 计算机科学, 2021, 48(11A): 17-21. https://doi.org/10.11896/jsjkx.210300172 |
[14] | 王茂光, 杨行. 一种基于AP-Entropy选择集成的风控模型和算法 Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble 计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110 |
[15] | 杨青, 张亚文, 朱丽, 吴涛. 基于注意力机制和BiGRU融合的文本情感分析 Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU 计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075 |
|