计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 266-270.doi: 10.11896/j.issn.1002-137X.2018.09.044
张小川, 余林峰, 张宜浩
ZHANG Xiao-chuan, YU Lin-feng, ZHANG Yi-hao
摘要: 近年来,LDA(Latent Dirichlet Allocation)主题模型通过挖掘文本的潜在语义主题进行文本表示,为短文本的相似度计算提供了新思路。针对短文本特征稀疏,应用LDA主题模型易导致文本相似度计算结果缺乏准确性的问题,提出了基于LDA的多特征融合的短文本相似度算法。该方法融合了主题相似度因子ST(Similarity Topic)和词语共现度因子CW(Co-occurrence Words),建立了联合相似度模型以规约不同ST区间下CW对ST产生的约束或补充条件,并最终权衡了准确性更高的相似度结果。对改进后的算法进行文本聚类实验,结果表明改进后的算法在F度量值上取得了一定程度的提升。
中图分类号:
[1]CROFT D,COUPLAND S,SHELL J,et al.A Fast and Efficient Semantic Short Text Similarity Metric[C]∥2013 13th UK Workshop on Computational Intelligence.2013:221-227. [2]CHEN P,YANG H,LV P,et al.Research on Text Similarity Based on LDA Model [J].Computer Technology and Development,2016,26(4):82-85.(in Chinese) 陈攀,杨浩,吕品,等.基于LDA模型的文本相似度研究[J].计算机技术与发展,2016,26(4):82-85. [3]LIU H Z,XU D.Based Ontology Semantic Similarity and Correlation Computing Research [J].Computer Science,2012,39(2):8-13.(in Chinese) 刘宏哲,须德.基于本体的语义相似度和相关度计算研究综述[J].计算机科学,2012,39(2):8-13. [4]CAO T,ZHOU L,ZHANG G X.A Text Similarity Calculation Based on Co-occurrence Words [J].Computer Engineering and Science,2007,29(3):52-53.(in Chinese) 曹恬,周丽,张国煊.一种基于词共现的文本相似度计算[J].计算机工程与科学,2007,29(3):52-53. [5]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet Allocation[J].the Journal of Machine Learning Research,2003,12(3):993-1022. [6]GibbsLDA++:A C/C++ Implementation of Latent Dirichlet Allocation(LDA) Using Gibbs Sampling for Parameter Estimation and Inference [EB/OL].[2016-05-15].https://sourceforge.net/projects/jgibblda/. [7]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by Latent Semantic Analysis[J].Journal of The American Society for Information Science,1990,41(6):391-407. [8]HOFMANN T.Probabilistic Latent Semantic Analysis[J].Uncertainty in Artificial Intelligence,1999,56(8):289-296. [9]ZHANG C,CHEN L,LI Q.A PST_LDA Chinese Text Similarity Calculation Method [J].Computer Application Research,2016,33(2):375-377.(in Chinese) 张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377. [10]ZHANG Q,WANG H J,WANG L W.Short Text Classification Method Based on Word Vector and LDA[J].Modern Library and Information Technology,2016,32(12):27-35.(in Chinese) 张群,王红军,王伦文.词向量与LDA相融合的短文本分类方法[J].现代图书情报技术,2016,32(12):27-35. [11]RAMAGE D,DUMAIS S T,LIEBLING D J.Characterizing Mi-croblogs with Topic Models[C]∥International Conference on Weblogs and Social Media.Washington:ICWSM,2010:130-137. [12]PHAN X H,NGUYEN L M,HORIGUCHI S.Learning to Classify Short and Sparse Text &Web with Hidden Topics from Large-scale Data Collections[C]∥Proceedings of the 17th International Conference on World Wide Web.ACM,2008:91-100. [13]LV C Z,JI D H,WU F F.Short Text Classification Based on LDA Feature Extension[J].Computer Engineering and Applications,2015,51(4):123-127.(in Chinese) 吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015,51(4):123-127. [14]HU Y J,JIANG J X,CHANG H Y.Chinese Short Text Classification Based on LDA High Frequency Word Expansion [J].Modern Library and Information Technology,2013,16(6):42-48.(in Chinese) 胡勇军,江嘉欣,常会友.基于LDA高频词扩展的中文短文本分类[J].现代图书情报技术,2013,16(6):42-48. |
[1] | 余本功, 张子薇, 王惠灵. 一种融合多层次情感和主题信息的TS-AC-EWM在线商品排序方法 TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information 计算机科学, 2022, 49(6A): 165-171. https://doi.org/10.11896/jsjkx.210400238 |
[2] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[3] | 刘蕴涵, 沙朝锋, 牛军钰. 基于Stack Overflow的数据库相关主题分析 Analysis of Topics on Database Systems in Stack Overflow 计算机科学, 2021, 48(6): 48-56. https://doi.org/10.11896/jsjkx.200800217 |
[4] | 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086 |
[5] | 周凯, 任怡, 汪哲, 管剑波, 张芳, 赵言亢. 基于主题模型的Ubuntu操作系统缺陷报告的分类及分析 Classification and Analysis of Ubuntu Bug Reports Based on Topic Model 计算机科学, 2020, 47(12): 35-41. https://doi.org/10.11896/jsjkx.200100022 |
[6] | 周波. 融合语义模型的二分网络推荐算法 Bipartite Network Recommendation Algorithm Based on Semantic Model 计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028 |
[7] | 王胜, 张仰森, 张雯, 蒋玉茹, 张睿. 基于SL-LDA的领域标签获取方法 Domain Label Acquisition Method Based on SL-LDA Model 计算机科学, 2020, 47(11): 95-100. https://doi.org/10.11896/jsjkx.190900012 |
[8] | 王涵, 夏鸿斌. LDA模型和列表排序混合的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model 计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032 |
[9] | 居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法 LDA Algorithm Based on Dynamic Weight 计算机科学, 2019, 46(8): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2019.08.043 |
[10] | 张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037 |
[11] | 范道远, 孙吉红, 王炜, 涂吉屏, 何欣. 融合文本与分类信息的重复缺陷报告检测方法 Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information 计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232 |
[12] | 贾宁, 郑纯军. 基于注意力LSTM的音乐主题推荐模型 Model of Music Theme Recommendation Based on Attention LSTM 计算机科学, 2019, 46(11A): 230-235. |
[13] | 余圆圆, 巢文涵, 何跃鹰, 李舟军. 基于双语主题模型和双语词向量的跨语言知识链接 Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding 计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037 |
[14] | 邱先标, 陈笑蓉. 一种基于SA_LDA模型的文本相似度计算方法 Text Similarity Calculation Algorithm Based on SA_LDA Model 计算机科学, 2018, 45(6A): 106-109. |
[15] | 韩朝, 苗夺谦, 任福继. 基于粗糙集理论的中文知识问答的知识谓词分析 Rough Set Based Knowledge Predicate Analysis of Chinese Knowledge Based Question Answering 计算机科学, 2018, 45(6): 183-186. https://doi.org/10.11896/j.issn.1002-137X.2018.06.032 |
|