计算机科学 ›› 2022, Vol. 49 ›› Issue (2): 256-264.doi: 10.11896/jsjkx.201200082

• 人工智能 • 上一篇    下一篇

基于高斯分布的改进词嵌入主题情感模型

李玉强1, 张伟江1, 黄瑜1, 李琳1, 刘爱华2   

  1. 1 武汉理工大学计算机科学与技术学院 武汉430063
    2 武汉理工大学能源与动力工程学院 武汉430063
  • 收稿日期:2020-12-08 修回日期:2021-03-13 出版日期:2022-02-15 发布日期:2022-02-23
  • 通讯作者: 张伟江(913114863@qq.com)
  • 作者简介:liyuqiang@whut.edu.cn
  • 基金资助:
    国家社会科学基金项目(15BGL048)

Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution

LI Yu-qiang1, ZHANG Wei-jiang1, HUANG Yu1, LI Lin1, LIU Ai-hua2   

  1. 1 School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430063,China
    2 School of Energy and Power Engineering,Wuhan University of Technology,Wuhan 430063,China
  • Received:2020-12-08 Revised:2021-03-13 Online:2022-02-15 Published:2022-02-23
  • About author:LI Yu-qiang,born in 1977,Ph.D,asso-ciate professor,master tutor.His main research interests include machine learning and big data analysis.
    ZHANG Wei-jiang,born in 1994,postgraduate.His main research interests include machine learning and big data analysis.
  • Supported by:
    National Social Science Foundation of China(15BGL048).

摘要: 近年来,主题情感联合模型成为了无监督学习领域的一项重要研究内容,在文本主题挖掘和情感分析等方面均有实际应用。然而,在现实场景中,微博因其文字短小、结构不完整等特征,给主题情感联合模型带来了一定的挑战。因此,围绕微博主题情感模型展开相关的研究与改进工作,目前较为流行的主题情感模型——TSMMF模型(Topic Sentiment Model Based on Multi-feature Fusion)中引入了词向量技术,运用多元高斯分布从词向量空间中快速采样邻近词语,并替换掉原Dirichlet多项式分布产生的单词,从而将共现频率低、信息量少的单词转变成突出主题、信息明确的单词,同时使用最近邻搜索算法来进一步提升模型处理大型微博语料库的运行速度,进而提出了GWE-TSMMF模型。对比实验结果表明,GWE-TSMMF模型的平均F1值约为0.718,相比原模型和现有的主流词嵌入主题情感模型(WS-TSWE模型和HST-SCW模型),其微博情感极性的分析效果均有显著提升。

关键词: 词嵌入, 高斯分布, 微博情感极性分析, 主题情感模型

Abstract: In recent years,the topic sentiment model as an important research in the field of unsupervised learning,has been used in text topic mining and sentiment analysis.However,Weibo has brought some challenges to the topic sentiment model because of its short text and in complete structure.Therefore,the related research and improvement work of this paper will be carried out around the topic sentiment model of Weibo.We introduce the word vector technology to the popular model-TSMMF(topic sentiment model based on multi-feature fusion),use multivariate Gaussian distribution to sample neighboring words fast from the word embedding space,and replace the words generated by the Dirichlet multinomial distribution.Thus,the words with lowcooccurrence frequency and less information will be transformed into words with prominent topic and clear information.At the same time,the nearest neighbor search algorithm is used to further improve the running speed of the model when processing large-scale Weibo corpus,and then the GWE-TSMMF model is proposed.The experimental results show that the average F1 value of GWE-TSMMF model is about 0.718.The sentiment polarity analysis is better than the original model and the existing mainstream word embedding topic sentiment models (WS-TSWE and HST-SCW).

Key words: Gaussian distribution, Topic sentiment model, Weibo sentiment polarity analysis, Word embedding

中图分类号: 

  • TP391
[1]ZHANG S,WEI Z,WANG Y,et al.Sentiment analysis of Chinese micro-blog text based on extended sentiment dictionary[J].Future Generation Computer Systems,2018,81:395-403.
[2]WANG Y.Iteration-based naive Bayes sentiment classificationof microblog multimedia posts considering emoticon attributes[J].Multimedia Tools and Applications,2020,79:19151-19166.
[3]PANG B,LEE L.Opinion mining and sentiment analysis[J].Foundations and Trends in Information Retrieval,2008,2(1/2):1-135.
[4]DERMOUCHE M,KOUAS L,VELCIN J,et al.A joint model for topic-sentiment modeling from text[C]//Proceedings of the 30th Annual ACM Symposium on Applied Computing.Salamanca:ACM,2015:819-824.
[5]HUANG F L,YU G,ZHANG J L,et al.Weibo Topic SentimentMining Based on Social Relationship[J].Journal of Software,2017,28(3):694-707.
[6]MIKOLOV T,SUTSKEVER I,CHEN K,et al.DistributedRepresentations of Words and Phrases and their Compositiona-lity[J].Advances in Neural Information Processing Systems,2013,26:3111-3119.
[7]YUAN T T,YANG W Z,ZHONG L J,et al.PLSTM,a perso-nality-based sentiment analysis model for microblogs[J].Computer Application Research,2019,37(2):1-6.
[8]ZHANG X J,LU X Q,ZHOU Q.Research on multi-level diffe-rences in written texts based on word embedding [J].Computer Engineering and Applications,2019,23(55):142-149.
[9]GAO M X,JING W.Chinese short text classification method based on Word2Vec word model[J].Journal of Shandong University (Engineering Science Edition),2019,49(2):34-41.
[10]CHENG J P,WANG Z Y,WEN J R,et al.Contextual Text Understanding in Distributional Semantic Space[C]//Proceedings of the Conference on Information and Knowledge Management.New York:ACM,2015:133-142.
[11]SUN F,GUO J F,LAN Y Y,et al.Learning Word Representations by Jointly Modeling Syntagmatic and Paradigmatic Relations[C]//Proceedings of the Meeting of the Association for Computational Linguistics.Beijing:ACL,2015:136-145.
[12]LIU Y,LIU Z,CHUA T S,et al.Topical word embeddings[C]//Proceedings of the Twenty-ninth AAAI Conference on Artificial Intelligence.San Francisco:AAAI Press,2015:2418-2424.
[13]LI S H,CHUA T S,ZHU J,et al.Generative Topic Embedding:a Continuous Representation of Documents[C]//Proceedings of the Meeting of the Association for Computational Linguistics.Berlin:ACL,2016:666-675.
[14]QIANG J,CHEN P,WANG T,et al.Topic Modeling over Short Texts by Incorporating Word Embeddings[J].PAKDD,2017,10235:363-374.
[15]NGUYEN D Q,BILLINGSLEY R,DU L,et al.Improving topic models with latent feature word representations[J].Transactions of the Association for Computational Linguistics,2015,3:299-313.
[16]DAS R,ZAHEER M,DYER C.Gaussian LDA for Topic Models with Word Embeddings[C]//Proceedings of the Meeting of the Association for Computational Linguistics.Beijing:ACL,2015:795-804.
[17]YANG Z,TANG J,COHEN W.Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.New York:IJCAI,2016:2287-2293.
[18]STEFAN B,KRESTEL R.WELDA:Enhancing topic models by incorporating local word context[C]//Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries.New York:JCDL,2018:293-302.
[19]HUA S W,ZHANG Y H.Short Text Comment SentimentAnalysis of Improved Topic Models[J].Computer Systems & Applications,2019,28(3):255-259.
[20]FU X,SUN X,WU H,et al.Weakly supervised topic sentiment joint model with word embeddings[J].Knowledge-Based Systems,2018,147:43-54.
[21]XU K.Research of topic model-based approaches for sentiment and topic modeling on texts[D].Nanjing:Southeast University,2017.
[22]SILPA-ANAN C,HARTLEY R.Optimised KD-trees for fast image descriptor matching[C]//Proceedings of IEEE Confe-rence on Computer Vision and Pattern Recognition.Anchorage:IEEE,2008:1-8.
[23]WU C,ZHU J,ZHANG J,et al.A Convolutional Treelets Binary Feature Approach to Fast Keypoint Recognition[C]//Proceedings of European Conference on Computer Vision.Berlin:Springer,2012:368-382.
[24]HU L J,NOOSHABADI S.High-dimensional image descriptor matching using highly parallel KD-tree construction and appro-ximate nearest neighbor search[J].Journal of Parallel Distributed Computing,2019,132:127-140.
[25]ADITYA B,MAHESHAKYA W.Distributed Clustering viaLSH Based Data Partitioning[C]//Proceedings of the 35th International Conference on Machine Learing.Stockholm:PMLR,2018:569-578.
[26]FENG X K,CUI J T,LI H,et al.An efficient LSH indexing on discriminative short codes for high-dimensional nearest neighbors[J].Multimedia Tools and Applications,2019,78(17):24407-24429.
[27]MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing semantic coherence in topic models[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Proces-sing.Edinburgh:EMNLP,2011:262-272.
[28]HUANG F L,FENG S,WANG D L,et al.Mining Topic Sentiment in Microblogging Based on Multi-feature Fusion[J].Chinese Journal of Computers,2017,40(4):872-888.
[29]HE Y X,SUN S T,NIU F F,et al.A deep learning modelenhanced with emotion semantics for microblog sentiment analysis[J].Chinese Journal of computers,2017,40(4):773-790.
[1] 梁懿雯, 杜育松.
抵御计时攻击的基于Knuth-Yao的二元离散高斯采样算法
Timing Attack Resilient Sampling Algorithms for Binary Gaussian Based on Knuth-Yao
计算机科学, 2022, 49(6A): 485-489. https://doi.org/10.11896/jsjkx.210600017
[2] 李昭奇, 黎塔.
基于wav2vec预训练的样例关键词识别
Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining
计算机科学, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007
[3] 田野, 寿黎但, 陈珂, 骆歆远, 陈刚.
基于字段嵌入的数据库自然语言查询接口
Natural Language Interface for Databases with Content-based Table Column Embeddings
计算机科学, 2020, 47(9): 60-66. https://doi.org/10.11896/jsjkx.190800138
[4] 古雪梅,刘嘉勇,程芃森,何祥.
基于增强BiLSTM-CRF模型的推文恶意软件名称识别
Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model
计算机科学, 2020, 47(2): 245-250. https://doi.org/10.11896/jsjkx.190500063
[5] 徐胜, 祝永新.
视觉问答中问题处理算法研究
Study on Question Processing Algorithms in Visual Question Answering
计算机科学, 2020, 47(11): 226-230. https://doi.org/10.11896/jsjkx.191200015
[6] 马晓慧, 贾君枝, 周湘贞, 闫俊伢.
一种基于语义相似性的情感分类方法
Semantic Similarity-based Method for Sentiment Classification
计算机科学, 2020, 47(11): 275-279. https://doi.org/10.11896/jsjkx.191000174
[7] 韩旭丽, 曾碧卿, 曾锋, 张敏, 商齐.
基于词嵌入辅助机制的情感分析
Sentiment Analysis Based on Word Embedding Auxiliary Mechanism
计算机科学, 2019, 46(10): 258-264. https://doi.org/10.11896/jsjkx.180901687
[8] 张文博,侯晓荣.
基于高斯分布的大气光估计算法
Estimation Algorithm of Atmospheric Light Based on Gaussian Distribution
计算机科学, 2018, 45(4): 301-305. https://doi.org/10.11896/j.issn.1002-137X.2018.04.051
[9] 刘涛, 周先春, 严锡君.
基于光流特征与高斯LDA的面部表情识别算法
LDA Facial Expression Recognition Algorithm Combining Optical Flow Characteristics with Gaussian
计算机科学, 2018, 45(10): 286-290. https://doi.org/10.11896/j.issn.1002-137X.2018.10.053
[10] 翟俊海,臧立光,张素芳.
随机权分布对极限学习机性能影响的实验研究
Experimental Research on Effects of Random Weight Distributions on Performance of Extreme Learning Machine
计算机科学, 2016, 43(12): 125-129. https://doi.org/10.11896/j.issn.1002-137X.2016.12.022
[11] 袁少锋,王士同.
基于PCA与最大后验概率分类的人脸识别方法
Method of Face Recognition Based on Principal Component Analysis and Maximum a Posteriori Probability Classification
计算机科学, 2014, 41(2): 91-94.
[12] 刘刚,梁晓庚,罗绪涛.
基于MAP准则的红外图像小波域比例萎缩降噪和增强算法
Denoising Algorithm of Proportional Shrinkage with Enhancement Based on the MAP Rule in Wavelet Domain for Infrared Image
计算机科学, 2010, 37(4): 274-.
[13] .
基于ICA与ViSOM的不完整数据处理

计算机科学, 2007, 34(7): 174-177.
[14] 刘洋 李玉山 张大朴.
基于色度畸变和纹理特征的阴影消除方法

计算机科学, 2005, 32(9): 211-214.
[15] 彭红毅 朱思铭 蒋春福.
数据挖掘中基于ICA的缺失数据值的估计

计算机科学, 2005, 32(12): 203-205.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!