计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 260-265.doi: 10.11896/j.issn.1002-137X.2019.08.043
居亚亚, 杨璐, 严建峰
JU Ya-ya, YANG Lu, YAN Jian-feng
摘要: 潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)是一种流行的三层概率主题模型,其实现了文本与文本中的单词在主题层次上的聚类。该模型以词袋(Bag of Words,BOW)模型为假设,所有单词的重要性相同,简化了建模的复杂度,但使得主题分布倾向于高频词,影响了主题模型的语义连贯性。针对此问题,提出了一种基于动态权重的LDA算法,该算法的基本思想是每个单词在建模中具有不同的重要性,在迭代过程中根据单词的主题分布动态生成相应的权重并反作用于主题建模,降低了高频词对建模的影响,提高了关键词的重要性。在4个公开数据集上的实验表明,基于动态权重的LDA算法在主题语义连贯性、文本分类准确率、泛化性能和精度方面比目前流行的LDA推理算法表现得更加优越。
中图分类号:
[1]SALTON G,MCGILL M J.Introduction to Modern Information Retrieval [M].New York:McGraw-Hill,1983:239-240. [2]DEERWESTER S.Indexing by latent semantic analysis [J]. Journal of the American Society for Information Science & Technology,1990,41(6):391-407. [3]HOFMANN T.Probabilistic latent semantic indexing[C]∥Proceedings of the 22nd Annual International ACM SIGIR Confe-rence on Research and Development in Information Retrieval.New York:IEEE Press,1999:50-57. [4]HOFFMAN T.Unsupervised learning by probabilistic latent se- mantic indexing [J].Sigir Audit Reports,1999,40(22):28-31. [5]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan):993-1022. [6]LI X,OUYANG J,ZHOU X.Labelset topic model for multi-label document classification [J].Journal of Intelligent Information Systems,2016,46(1):83-97. [7]WU M S.Modeling query-document dependencies with topic language models for information retrieval [J].Information Sciences,2015,312(C):1-12. [8]GRIFFITHS T L,STEYVERS M.Finding scientific topics [J].Proceedings of the National academy of Sciences,2004,101(Suppl 1):5228-5235. [9]LIU X,ZENG J,YANG X,et al.Scalable Parallel EM Algo- rithms for Latent Dirichlet Allocation in Multi-Core Systems[C]∥Proceedings of the 24th International Conference on World Wide Web.Florence,Italy:ACM,2015:669-679. [10]ZHANG J,ZENG J,YUAN M,et al.LDA Revisited:Entropy,Prior and Convergence [C]∥Proceedings of the 25th ACM International on Conference on Information and Knowledge Ma-nagement.New York:ACM,2016:1763-1772. [11]MIMNO D,WALLACH H M,TALLEY E,et al.Optimizing Semantic Coherence in Topic Models[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Proces-sing.Association for Computational Linguistics,2010:262-272. [12]PETTERSON J,SMOLA A,CAETANO T,et al.Word features for Latent Dirichlet Allocation[C]∥International Conference on Neural Information Processing Systems.Curran Associates Inc.,2010:1921-1929. [13]LI X,ZHANG A,LI C,et al.Exploring coherent topics by topic modeling with term weighting [J].Information Processing & Management,2018,54(6):1345-1358. [14]CHEW P A,CHEW P A.Term weighting schemes for Latent Dirichlet Allocation[C]∥Human Language Technologies:the 2010 Conference of the North American Chapter of the Association for Computational Linguistics.Association for Computational Linguistics,2010:465-473. [15]NEWMAN D,KARIMI S,CAVEDON L.External evaluation of topic models[C]∥Australasian Document Computing Sympo-sium (ADCS).Sydney,Australia:University of Sydney,2009:1-8. [16]SHAMS M,BARAANI-DASTJERDI A.Enriched LDA (EL- DA):Combination of latent Dirichlet allocation with word co-occurrence analysis for aspect extraction [J].Expert Systems with Applications,2017,80(C):136-146. [17]GEORGE K.Human behavior and the principle of least effort:An introduction to human ecology [M].Boston:Addison-Wesley Press,1949:180-183. [18]LIN J.Divergence measures based on the Shannon entropy [J]. IEEE Transactions on Information Theory,1991,37(1):145-151. [19]WU X,ZENG J,YAN J,et al.Finding Better Topics:Features,Priors and Constraints[C]∥Pacific-Asia Conference on Know-ledge Discovery and Data Mining.New York:Springer,2014:296-310. [20]NEWMAN D,LAU J H,GRIESER K,et al.Automatic evaluation of topic coherence[C]∥The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics.Los Angeles,California:Association for Computational Linguistics,2010:100-108. [21]CHANG D Y,YAN J F,YANG L,et al.Sliding-window Based Topic Modeling [J].Computer Science,2016,43(12):101-107.(in Chinese) 常东亚,严建峰,杨璐,等.基于滑动窗口的主题模型.计算机科学,2016,43(12):101-107. [22]GAO Y,YANG L,LIU X S,et al.Study of Semantic Under- standing by LDA [J].Computer Science,2015,42(8):279-282.(in Chinese) 高阳,杨璐,刘晓升,等.LDA语义理解研究[J].计算机科学,2015,42(8):279-282. |
[1] | 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086 |
[2] | 潘吉飞,黄德才. 基于跳跃Hash和异步共识组的区块链动态分片模型 Blockchain Dynamic Sharding Model Based on Jump Hash and Asynchronous Consensus Group 计算机科学, 2020, 47(3): 273-280. https://doi.org/10.11896/jsjkx.190100238 |
[3] | 周波. 融合语义模型的二分网络推荐算法 Bipartite Network Recommendation Algorithm Based on Semantic Model 计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028 |
[4] | 王涵, 夏鸿斌. LDA模型和列表排序混合的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model 计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032 |
[5] | 张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037 |
[6] | 范道远, 孙吉红, 王炜, 涂吉屏, 何欣. 融合文本与分类信息的重复缺陷报告检测方法 Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information 计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232 |
[7] | 贾宁, 郑纯军. 基于注意力LSTM的音乐主题推荐模型 Model of Music Theme Recommendation Based on Attention LSTM 计算机科学, 2019, 46(11A): 230-235. |
[8] | 余圆圆, 巢文涵, 何跃鹰, 李舟军. 基于双语主题模型和双语词向量的跨语言知识链接 Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding 计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037 |
[9] | 张小川, 余林峰, 张宜浩. 基于LDA的多特征融合的短文本相似度计算 Multi-feature Fusion for Short Text Similarity Calculation Based on LDA 计算机科学, 2018, 45(9): 266-270. https://doi.org/10.11896/j.issn.1002-137X.2018.09.044 |
[10] | 邱先标, 陈笑蓉. 一种基于SA_LDA模型的文本相似度计算方法 Text Similarity Calculation Algorithm Based on SA_LDA Model 计算机科学, 2018, 45(6A): 106-109. |
[11] | 董晨露,柯新生. 基于用户兴趣变化和评论的协同过滤算法研究 Study on Collaborative Filtering Algorithm Based on User Interest Change and Comment 计算机科学, 2018, 45(3): 213-217. https://doi.org/10.11896/j.issn.1002-137X.2018.03.033 |
[12] | 鲜学丰,崔志明,赵朋朋,刘昭斌,顾才东. 基于主题模型的位置感知订阅发布系统 Location-awareness Publication Subscription System Based on Topic Model 计算机科学, 2018, 45(3): 165-170. https://doi.org/10.11896/j.issn.1002-137X.2018.03.026 |
[13] | 陶志勇,王和章. 基于新型聚类的无线传感器网络非均匀分层路由协议 Non-uniform Hierarchical Routing Protocol Based on New Clustering for Wireless Sensor Network 计算机科学, 2018, 45(3): 115-123. https://doi.org/10.11896/j.issn.1002-137X.2018.03.019 |
[14] | 朱引, 黄海燕. 基于主题增强的递归自编码情感分类研究 Study on Recursive Auto-encoding Sentiment Classification Based on Topic Enhancement 计算机科学, 2018, 45(12): 142-147. https://doi.org/10.11896/j.issn.1002-137X.2018.12.022 |
[15] | 王凯祥. 面向查询的自动文本摘要技术研究综述 Survey of Query-oriented Automatic Summarization Technology 计算机科学, 2018, 45(11A): 12-16. |
|