计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 260-265.doi: 10.11896/j.issn.1002-137X.2019.08.043

• 人工智能 • 上一篇    下一篇

基于动态权重的LDA算法

居亚亚, 杨璐, 严建峰   

  1. (苏州大学计算机科学与技术学院 江苏 苏州215006)
  • 收稿日期:2018-07-14 出版日期:2019-08-15 发布日期:2019-08-15
  • 通讯作者: 杨璐(1982-),女,副教授,硕士生导师,主要研究方向为机器学习与软件工程,E-mail:yanglu@suda.edu.cn
  • 作者简介:居亚亚(1989-),女,硕士生,主要研究方向为机器学习,E-mail:yayaju@163.com;严建峰(1978-),男,副教授,硕士生导师,主要研究方向为机器学习
  • 基金资助:
    国家自然科学基金(61572339,61272449),江苏省科技支撑计划重点项目(BE2014005)

LDA Algorithm Based on Dynamic Weight

JU Ya-ya, YANG Lu, YAN Jian-feng   

  1. (School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China)
  • Received:2018-07-14 Online:2019-08-15 Published:2019-08-15

摘要: 潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)是一种流行的三层概率主题模型,其实现了文本与文本中的单词在主题层次上的聚类。该模型以词袋(Bag of Words,BOW)模型为假设,所有单词的重要性相同,简化了建模的复杂度,但使得主题分布倾向于高频词,影响了主题模型的语义连贯性。针对此问题,提出了一种基于动态权重的LDA算法,该算法的基本思想是每个单词在建模中具有不同的重要性,在迭代过程中根据单词的主题分布动态生成相应的权重并反作用于主题建模,降低了高频词对建模的影响,提高了关键词的重要性。在4个公开数据集上的实验表明,基于动态权重的LDA算法在主题语义连贯性、文本分类准确率、泛化性能和精度方面比目前流行的LDA推理算法表现得更加优越。

关键词: 动态权重, 潜在狄利克雷分布, 主题模型

Abstract: The latent Dirichlet allocation (LDA)is a popular three-layer probability topic model,which implements the clustering of words in document and document at the topic level.This model is based on the Bag of Words(BOW) mo-del,and each word has the same importance.It simplifies the complexity of modeling,but makes the topic distributions tend to high-frequency words,which affects the semantic coherence of the topic model.To achieve this goal,an LDA algorithm based on dynamic weight was proposed.The fundamental idea of the algorithm is that each word has different importance.In the iterative process of modeling,word weights are generated dynamically according to the topic distribution of words and feedback to topic modeling,reducing the influence of high frequency words and improving the role of keywords.Experiments on four public datasets show that the LDA algorithm based on dynamic weight can be superior to the current popular LDA inference algorithms in terms of topic semantic coherence,text classification accuracy,gene-ralization performance and precision

Key words: Dynamic weight, Latent dirichlet allocation, Topic model

中图分类号: 

  • TP391
[1]SALTON G,MCGILL M J.Introduction to Modern Information Retrieval [M].New York:McGraw-Hill,1983:239-240.
[2]DEERWESTER S.Indexing by latent semantic analysis [J]. Journal of the American Society for Information Science & Technology,1990,41(6):391-407.
[3]HOFMANN T.Probabilistic latent semantic indexing[C]∥Proceedings of the 22nd Annual International ACM SIGIR Confe-rence on Research and Development in Information Retrieval.New York:IEEE Press,1999:50-57.
[4]HOFFMAN T.Unsupervised learning by probabilistic latent se- mantic indexing [J].Sigir Audit Reports,1999,40(22):28-31.
[5]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[6]LI X,OUYANG J,ZHOU X.Labelset topic model for multi-label document classification [J].Journal of Intelligent Information Systems,2016,46(1):83-97.
[7]WU M S.Modeling query-document dependencies with topic language models for information retrieval [J].Information Sciences,2015,312(C):1-12.
[8]GRIFFITHS T L,STEYVERS M.Finding scientific topics [J].Proceedings of the National academy of Sciences,2004,101(Suppl 1):5228-5235.
[9]LIU X,ZENG J,YANG X,et al.Scalable Parallel EM Algo- rithms for Latent Dirichlet Allocation in Multi-Core Systems[C]∥Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:ACM,2015:669-679.
[10]ZHANG J,ZENG J,YUAN M,et al.LDA Revisited:Entropy,Prior and Convergence [C]∥Proceedings of the 25th ACM International on Conference on Information and Knowledge Ma-nagement.New York:ACM,2016:1763-1772.
[11]MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing Semantic Coherence in Topic Models[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.Association for Computational Linguistics,2010:262-272.
[12]PETTERSON J,SMOLA A,CAETANO T,et al.Word features for Latent Dirichlet Allocation[C]∥International Conference on Neural Information Processing Systems.Curran Associates Inc.,2010:1921-1929.
[13]LI X,ZHANG A,LI C,et al.Exploring coherent topics by topic modeling with term weighting [J].Information Processing & Management,2018,54(6):1345-1358.
[14]CHEW P A,CHEW P A.Term weighting schemes for Latent Dirichlet Allocation[C]∥Human Language Technologies:the 2010 Conference of the North American Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2010:465-473.
[15]NEWMAN D,KARIMI S,CAVEDON L.External evaluation of topic models[C]∥Australasian Document Computing Sympo-sium (ADCS).Sydney,Australia:University of Sydney,2009:1-8.
[16]SHAMS M,BARAANI-DASTJERDI A.Enriched LDA (EL- DA):Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction [J].Expert Systems with Applications,2017,80(C):136-146.
[17]GEORGE K.Human behavior and the principle of least effort:An introduction to human ecology [M].Boston:Addison-Wesley Press,1949:180-183.
[18]LIN J.Divergence measures based on the Shannon entropy [J]. IEEE Transactions on Information Theory,1991,37(1):145-151.
[19]WU X,ZENG J,YAN J,et al.Finding Better Topics:Features,Priors and Constraints[C]∥Pacific-Asia Conference on Know-ledge Discovery and Data Mining.New York:Springer,2014:296-310.
[20]NEWMAN D,LAU J H,GRIESER K,et al.Automatic evaluation of topic coherence[C]∥The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics.Los Angeles,California:Association for Computational Linguistics,2010:100-108.
[21]CHANG D Y,YAN J F,YANG L,et al.Sliding-window Based Topic Modeling [J].Computer Science,2016,43(12):101-107.(in Chinese) 常东亚,严建峰,杨璐,等.基于滑动窗口的主题模型.计算机科学,2016,43(12):101-107.
[22]GAO Y,YANG L,LIU X S,et al.Study of Semantic Under- standing by LDA [J].Computer Science,2015,42(8):279-282.(in Chinese) 高阳,杨璐,刘晓升,等.LDA语义理解研究[J].计算机科学,2015,42(8):279-282.
[1] 文进, 张星宇, 沙朝锋, 刘艳君.
基于次模函数最大化的测试用例集约简
Test Suite Reduction via Submodular Function Maximization
计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[2] 潘吉飞,黄德才.
基于跳跃Hash和异步共识组的区块链动态分片模型
Blockchain Dynamic Sharding Model Based on Jump Hash and Asynchronous Consensus Group
计算机科学, 2020, 47(3): 273-280. https://doi.org/10.11896/jsjkx.190100238
[3] 周波.
融合语义模型的二分网络推荐算法
Bipartite Network Recommendation Algorithm Based on Semantic Model
计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028
[4] 王涵, 夏鸿斌.
LDA模型和列表排序混合的协同过滤推荐算法
Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model
计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032
[5] 张蕾,蔡明.
基于主题融合和关联规则挖掘的图像标注
Image Annotation Based on Topic Fusion and Frequent Patterns Mining
计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[6] 范道远, 孙吉红, 王炜, 涂吉屏, 何欣.
融合文本与分类信息的重复缺陷报告检测方法
Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information
计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232
[7] 贾宁, 郑纯军.
基于注意力LSTM的音乐主题推荐模型
Model of Music Theme Recommendation Based on Attention LSTM
计算机科学, 2019, 46(11A): 230-235.
[8] 余圆圆, 巢文涵, 何跃鹰, 李舟军.
基于双语主题模型和双语词向量的跨语言知识链接
Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding
计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037
[9] 张小川, 余林峰, 张宜浩.
基于LDA的多特征融合的短文本相似度计算
Multi-feature Fusion for Short Text Similarity Calculation Based on LDA
计算机科学, 2018, 45(9): 266-270. https://doi.org/10.11896/j.issn.1002-137X.2018.09.044
[10] 邱先标, 陈笑蓉.
一种基于SA_LDA模型的文本相似度计算方法
Text Similarity Calculation Algorithm Based on SA_LDA Model
计算机科学, 2018, 45(6A): 106-109.
[11] 董晨露,柯新生.
基于用户兴趣变化和评论的协同过滤算法研究
Study on Collaborative Filtering Algorithm Based on User Interest Change and Comment
计算机科学, 2018, 45(3): 213-217. https://doi.org/10.11896/j.issn.1002-137X.2018.03.033
[12] 鲜学丰,崔志明,赵朋朋,刘昭斌,顾才东.
基于主题模型的位置感知订阅发布系统
Location-awareness Publication Subscription System Based on Topic Model
计算机科学, 2018, 45(3): 165-170. https://doi.org/10.11896/j.issn.1002-137X.2018.03.026
[13] 陶志勇,王和章.
基于新型聚类的无线传感器网络非均匀分层路由协议
Non-uniform Hierarchical Routing Protocol Based on New Clustering for Wireless Sensor Network
计算机科学, 2018, 45(3): 115-123. https://doi.org/10.11896/j.issn.1002-137X.2018.03.019
[14] 朱引, 黄海燕.
基于主题增强的递归自编码情感分类研究
Study on Recursive Auto-encoding Sentiment Classification Based on Topic Enhancement
计算机科学, 2018, 45(12): 142-147. https://doi.org/10.11896/j.issn.1002-137X.2018.12.022
[15] 王凯祥.
面向查询的自动文本摘要技术研究综述
Survey of Query-oriented Automatic Summarization Technology
计算机科学, 2018, 45(11A): 12-16.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!