计算机科学 ›› 2021, Vol. 48 ›› Issue (9): 118-124.doi: 10.11896/jsjkx.210400280

所属专题: 智能数据治理技术与系统

• 智能数据治理技术与系统* 上一篇    下一篇

基于Spark的舆情情感大数据分析集成方法

戴宏亮1, 钟国金1, 游志铭1, 戴宏明2   

  1. 1 广州大学经济与统计学院 广州510006
    2 华南理工大学软件学院 广州510006
  • 收稿日期:2021-04-26 修回日期:2021-06-16 出版日期:2021-09-15 发布日期:2021-09-10
  • 通讯作者: 戴宏明(1355369191@qq.com)
  • 作者简介:hldai618@gzhu.edu.cn
  • 基金资助:
    国家社会科学基金项目(18BTJ029)

Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark

DAI Hong-liang1, ZHONG Guo-jin1, YOU Zhi-ming1 , DAI Hong-ming2   

  1. 1 School of Economics and Statistics,Guangzhou University,Guangzhou 510006,China
    2 School of Software,South China University of Technology,Guangzhou 510006,China
  • Received:2021-04-26 Revised:2021-06-16 Online:2021-09-15 Published:2021-09-10
  • About author:DAI Hong-liang,born in 1978,Ph.D,professor,postdoctoral supervisor,is a member of China Computer Federation.His main research interests include machine learning and big data analysis.
    DAI Hong-ming,born in 1978,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include machine lear-ning and big data analysis,and software engineering.
  • Supported by:
    National Social Science Foundation(18BTJ029)

摘要: 随着移动互联技术的不断发展,社交媒体成为了公众分享观点和抒发情感的主要平台,在重大社会事件下对社交媒体文本进行情感分析能够有效监控舆情。针对现有中文社交媒体情感分析算法的准确性能和运行效率较低的问题,提出了一种基于Spark分布式系统的集成情感大数据分析方法(Spark Feature Weighted Stacking,S-FWS)。该方法首先基于Jieba库预分词和PMI关联度完成新词发现;然后考虑词语重要度混合提取文本特征,并使用Lasso进行特征选择;最后改进传统Stacking框架忽略特征重要度的缺点,使用初级学习器的准确率信息对类概率特征进行加权处理并构造多项式特征,进而训练次级学习器。分别在单机模式和Spark平台下引入多种算法进行对比实验,实验结果证明所提S-FWS方法的准确性能和耗时性能具备一定优势,并且分布式系统能够大幅提高算法的运行效率,同时随着集群工作节点的增加,算法耗时逐渐降低。

关键词: Spark, Stacking, 情感分析, 舆情, 中文社交媒体

Abstract: With the development of mobile Internet technology,social media has become the main approach for the public to share views and express their emotions.Sentiment analysis for social media texts in major social events can effectively monitor public opinion.In order to solve the problem of low accuracy and efficiency of existing Chinese social media sentiment analysis algorithms,an ensemble sentiment analysis big data method(S-FWS) based on Spark distributed system is proposed.Firstly,the new words are found by calculating the PMI association degree after pre-segmentation by Jieba library.Then,the text features are extracted by considering the importance of words and feature selection is realized by Lasso.Finally,in order to improve the traditional Stacking framework neglecting the feature importance,the accuracy information of the primary learners is used to weight the probabilistic features,and the polynomial features are constructed to train the secondary learner.A variety of algorithms are introduced in the stand-alone mode and the Spark platform receptively to carry out comparative experiments.Results show that the S-FWS method proposed in this paper has certain advantages in accuracy and time consumption;distributed system can greatly improve the operating efficiency of the algorithms,and with the increase of working nodes,the time consumption of the algorithms gradually decreases.

Key words: Chinese social media, Public opinion, Sentiment analysis, Spark, Stacking

中图分类号: 

  • TP391
[1]PANG B,LEE L.Opinion mining and sentiment analysis[J].Foundations & Trends in Information Retrieval,2008,2(1/2):1-135.
[2]PANG B,LEE L,VAITHYANATHAN S.Thumbs up? Sentiment classification using machine learning techniques[C]//Proceedings of 2002 Empirical Methods in Natural Language Processing.2002:79-86.
[3]WANG T,LI M.Research on Comment Text Mining Based on LDA Model and Semantic Network[J].Journal of Chongqing Technology and Business University(Natural Science Edition),2019,36(4):9-16.
[4]YUN W,GAO Q.An Ensemble Sentiment Classification System of Twitter Data for Airline Services Analysis[C]//IEEE International Conference on Data Mining Workshop.IEEE,2015:1318-1325.
[5]ALREHILI A,ALBALAWI K.Sentiment analysis of customer reviews using ensemble method[C]//Proc of International Conference on Computer and Information Sciences.Piscataway,NJ:IEEE press,2019:1-6.
[6]ZHANG Y,ZHOU Y,LU H,et al.Traffic Network Flow Prediction Using Parallel Training for Deep Convolutional Neural Networks on Spark Cloud[J].IEEE Transactions on Industrial Informatics,2020(99):1-1.
[7]ELZAYADY H,BADRAN K M,SALAMA G I.SentimentAnalysis on Twitter Data using Apache Spark Framework[C]//2018 13th International Conference on Computer Engineering and Systems (ICCES).2018:171-176.
[8]YANG L Y,WANG Y Z.Application of Spark in SentimentAnalysis of Ensemble Learning Text[J].Computer Applications and Software,2020,37(6):130-134.
[9]WANG W H,JIN L J.Sentiment Analysis Ensemble Algorithm Based on Spark[J].Journal of Zhejiang University of Technology,2020,48(4):405-410,434.
[10]GRBIC D,HAFFERTY F W,HAFFERTY P K.Medical School Mission Statements as Reflections of Institutional Identity and Educational Purpose:A Network Text Analysis[J].Academic Medicine:Journal of the Association of American Medical Colleges,2013,88(6):852-860.
[11]ZHU A Q,LI S,TANG X D.Parallel FP_growth Association Rules Mining Method on Spark Platform[J].Computer Science,2020,47(12):139-143.
[12]ZAHARIA M,CHOWDHURY M,DAS T,et al.Resilient distributed datasets:A fault-tolerant abstraction for in-memory cluster computing[C]//Usenix Conference on Networked Systems Design & Implementation.2012:15-28.
[13]PECINA P,SCHLESINGER P.Combining association measures for collocation extraction [C]//Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions.Stroudsburg:ACL,2006:651-658.
[14]HE J F,ZHAO H,HE X M.Suspicious Person Text Representation Method Based on Improved TF-IDF[J].Computer Engineering and Design,2021,42(2):396-401.
[15]ZHANG D W,XU H,SU Z C,et al.Chinese comments sentiment classification based on word2vec and SVMperf[J].Expert Systems with Applications,2015,42(4):1857-1863.
[16]BREIMAN L.Random Forests[J].Machine Learning,2001,45(1):5-32.
[17]FRIEDMAN J H.Greedy Function Approximation:A Gradient Boosting Machine[J].Annals of Statistics,2001,29(5):1189-1232.
[18]CHEN T,GUESTRIN C.XGBoost:a scalable tree boosting system[J].International Conference on Knowledge Discovery and Data Mining,2016,1(1):785-794.
[19]DŽEROSKI S,ŽENKO B.Is combining classifiers with stacking better than selecting the best one?[J].Machine Learning,2004,54(3):255-273.
[20]CHU T Z,CHENG L,WONG H S.Corpus-based topic diffusion for short text clustering[J].Neurocomputing,2018,275:2444-2458.
[21]ASHRAF M,ZAMAN M,AHMED M.Using Ensemble Stac-kingC Method and Base Classifiers to Ameliorate Prediction Accuracy of Pedagogical Data[J].Procedia Computer Science,2018,132:1021-1040.
[22]MENAHEM E,ROKACH L,ELOVICI Y.Troika-An im-proved stacking schema for classification tasks[J].Information Sciences,2009,179(24):4097-4122.
[23]DAI H,WU W K,LI J C,et al.Incorporating Feature Selection in the Improved Stacking Algorithm for Online Learning Analysis and Prediction[J].Engineering Letters,2020,28(4):1011.
[24]JIANG M,LIU J,ZHANG L,et al.An improved stacking framework for stock index prediction by leveraging tree-based ensemble models and deep learning algorithms[J].Physica A:Statistical Mechanics and its Applications,2020,541:122272.
[1] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[2] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
[3] 丁锋, 孙晓.
基于注意力机制和BiLSTM-CRF的消极情绪意见目标抽取
Negative-emotion Opinion Target Extraction Based on Attention and BiLSTM-CRF
计算机科学, 2022, 49(2): 223-230. https://doi.org/10.11896/jsjkx.210100046
[4] 袁景凌, 丁远远, 盛德明, 李琳.
基于视觉方面注意力的图像文本情感分析模型
Image-Text Sentiment Analysis Model Based on Visual Aspect Attention
计算机科学, 2022, 49(1): 219-224. https://doi.org/10.11896/jsjkx.201000074
[5] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[6] 张瑾, 段利国, 李爱萍, 郝晓燕.
基于注意力与门控机制相结合的细粒度情感分析
Fine-grained Sentiment Analysis Based on Combination of Attention and Gated Mechanism
计算机科学, 2021, 48(8): 226-233. https://doi.org/10.11896/jsjkx.200700058
[7] 史伟, 付月.
考虑语境的微博短文本挖掘:情感分析的方法
Microblog Short Text Mining Considering Context:A Method of Sentiment Analysis
计算机科学, 2021, 48(6A): 158-164. https://doi.org/10.11896/jsjkx.210200089
[8] 程铁军, 王曼.
基于变权组合的突发事件网络舆情趋势预测
Network Public Opinion Trend Prediction of Emergencies Based on Variable Weight Combination
计算机科学, 2021, 48(6A): 190-195. https://doi.org/10.11896/jsjkx.200600094
[9] 潘芳, 张会兵, 董俊超, 首照宇.
基于高效Transformer的中文在线课程评论方面情感分析
Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer
计算机科学, 2021, 48(6A): 264-269. https://doi.org/10.11896/jsjkx.200800116
[10] 俞建业, 戚湧, 王宝茁.
基于Spark的车联网分布式组合深度学习入侵检测方法
Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark
计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129
[11] 张明阳, 王刚, 彭起, 张岩峰.
学术论文公开评审平台数据分析
Data Analysis of OpenReview
计算机科学, 2021, 48(6): 63-70. https://doi.org/10.11896/jsjkx.200500138
[12] 尹久, 池凯凯, 宦若虹.
基于ATT-DGRU的文本方面级别情感分析
Aspect-level Sentiment Analysis of Text Based on ATT-DGRU
计算机科学, 2021, 48(5): 217-224. https://doi.org/10.11896/jsjkx.200500076
[13] 李建兰, 潘岳, 李小聪, 刘子维, 王天宇.
基于CiteSpace的中文评论文本研究现状与趋势分析
Chinese Commentary Text Research Status and Trend Analysis Based on CiteSpace
计算机科学, 2021, 48(11A): 17-21. https://doi.org/10.11896/jsjkx.210300172
[14] 王茂光, 杨行.
一种基于AP-Entropy选择集成的风控模型和算法
Risk Control Model and Algorithm Based on AP-Entropy Selection Ensemble
计算机科学, 2021, 48(11A): 71-76. https://doi.org/10.11896/jsjkx.210200110
[15] 杨青, 张亚文, 朱丽, 吴涛.
基于注意力机制和BiGRU融合的文本情感分析
Text Sentiment Analysis Based on Fusion of Attention Mechanism and BiGRU
计算机科学, 2021, 48(11): 307-311. https://doi.org/10.11896/jsjkx.201000075
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!