计算机科学 ›› 2018, Vol. 45 ›› Issue (9): 266-270.doi: 10.11896/j.issn.1002-137X.2018.09.044

• 人工智能 • 上一篇    下一篇

基于LDA的多特征融合的短文本相似度计算

张小川, 余林峰, 张宜浩   

  1. 重庆理工大学计算机科学与工程学院 重庆401320
  • 收稿日期:2017-07-11 出版日期:2018-09-20 发布日期:2018-10-10
  • 通讯作者: 余林峰(1992-),男,硕士生,主要研究方向为人工智能,E-mail:517844894@qq.com
  • 作者简介:张小川(1965-),男,硕士,教授,主要研究方向为人工智能、计算机软件,E-mail:zxc@cqut.edu.cn;张宜浩(1982-),男,博士,主要研究方向为自然语言处理。
  • 基金资助:
    本文受国家自然科学基金(60443004),重庆市重大科技项目(cstc2013jcsf-jcssX0020),重庆市基础科学与前沿技术研究计划项目(cstc2015jcyjA40041)资助。

Multi-feature Fusion for Short Text Similarity Calculation Based on LDA

ZHANG Xiao-chuan, YU Lin-feng, ZHANG Yi-hao   

  1. College of Computer Science and Engineering,Chongqing University of Technology,Chongqing 401320,China
  • Received:2017-07-11 Online:2018-09-20 Published:2018-10-10

摘要: 近年来,LDA(Latent Dirichlet Allocation)主题模型通过挖掘文本的潜在语义主题进行文本表示,为短文本的相似度计算提供了新思路。针对短文本特征稀疏,应用LDA主题模型易导致文本相似度计算结果缺乏准确性的问题,提出了基于LDA的多特征融合的短文本相似度算法。该方法融合了主题相似度因子ST(Similarity Topic)和词语共现度因子CW(Co-occurrence Words),建立了联合相似度模型以规约不同ST区间下CW对ST产生的约束或补充条件,并最终权衡了准确性更高的相似度结果。对改进后的算法进行文本聚类实验,结果表明改进后的算法在F度量值上取得了一定程度的提升。

关键词: LDA, 词语共现度, 短文本相似度, 主题模型, 主题相似度

Abstract: In recent years,latent dirichlet allocation(LDA)topic model provides a new idea for short text similarity calculation by mining the latent semantic themes of text.In view of the sparse features of short text,because the application of LDA theme model may easily lead to inaccurate results of similarity computation,this paper presented a calculation method based on LDA model combining similarity topics factor ST and co-occurrence words factor CW to establish union similarity model.In the protocol of different ST intervals,CW generates constraint or supplementary conditions to ST,and obtains higher accuracy of text similarity.A text clustering experiment was used to verify the method.The experimental results show that the proposed method gains a certain improvement of F measure value

Key words: Co-occurence words, LDA, Short text similarity, Similarity topics, Topic model

中图分类号: 

  • TP391
[1]CROFT D,COUPLAND S,SHELL J,et al.A Fast and Efficient Semantic Short Text Similarity Metric[C]∥2013 13th UK Workshop on Computational Intelligence.2013:221-227.
[2]CHEN P,YANG H,LV P,et al.Research on Text Similarity Based on LDA Model [J].Computer Technology and Development,2016,26(4):82-85.(in Chinese)
陈攀,杨浩,吕品,等.基于LDA模型的文本相似度研究[J].计算机技术与发展,2016,26(4):82-85.
[3]LIU H Z,XU D.Based Ontology Semantic Similarity and Correlation Computing Research [J].Computer Science,2012,39(2):8-13.(in Chinese)
刘宏哲,须德.基于本体的语义相似度和相关度计算研究综述[J].计算机科学,2012,39(2):8-13.
[4]CAO T,ZHOU L,ZHANG G X.A Text Similarity Calculation Based on Co-occurrence Words [J].Computer Engineering and Science,2007,29(3):52-53.(in Chinese)
曹恬,周丽,张国煊.一种基于词共现的文本相似度计算[J].计算机工程与科学,2007,29(3):52-53.
[5]BLEI D M,NG A Y,JORDAN M I.Latent Dirichlet Allocation[J].the Journal of Machine Learning Research,2003,12(3):993-1022.
[6]GibbsLDA++:A C/C++ Implementation of Latent Dirichlet Allocation(LDA) Using Gibbs Sampling for Parameter Estimation and Inference [EB/OL].[2016-05-15].https://sourceforge.net/projects/jgibblda/.
[7]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by Latent Semantic Analysis[J].Journal of The American Society for Information Science,1990,41(6):391-407.
[8]HOFMANN T.Probabilistic Latent Semantic Analysis[J].Uncertainty in Artificial Intelligence,1999,56(8):289-296.
[9]ZHANG C,CHEN L,LI Q.A PST_LDA Chinese Text Similarity Calculation Method [J].Computer Application Research,2016,33(2):375-377.(in Chinese)
张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377.
[10]ZHANG Q,WANG H J,WANG L W.Short Text Classification Method Based on Word Vector and LDA[J].Modern Library and Information Technology,2016,32(12):27-35.(in Chinese)
张群,王红军,王伦文.词向量与LDA相融合的短文本分类方法[J].现代图书情报技术,2016,32(12):27-35.
[11]RAMAGE D,DUMAIS S T,LIEBLING D J.Characterizing Mi-croblogs with Topic Models[C]∥International Conference on Weblogs and Social Media.Washington:ICWSM,2010:130-137.
[12]PHAN X H,NGUYEN L M,HORIGUCHI S.Learning to
Classify Short and Sparse Text &Web with Hidden Topics from Large-scale Data Collections[C]∥Proceedings of the 17th International Conference on World Wide Web.ACM,2008:91-100.
[13]LV C Z,JI D H,WU F F.Short Text Classification Based on LDA Feature Extension[J].Computer Engineering and Applications,2015,51(4):123-127.(in Chinese)
吕超镇,姬东鸿,吴飞飞.基于LDA特征扩展的短文本分类[J].计算机工程与应用,2015,51(4):123-127.
[14]HU Y J,JIANG J X,CHANG H Y.Chinese Short Text Classification Based on LDA High Frequency Word Expansion [J].Modern Library and Information Technology,2013,16(6):42-48.(in Chinese)
胡勇军,江嘉欣,常会友.基于LDA高频词扩展的中文短文本分类[J].现代图书情报技术,2013,16(6):42-48.
[1] 余本功, 张子薇, 王惠灵.
一种融合多层次情感和主题信息的TS-AC-EWM在线商品排序方法
TS-AC-EWM Online Product Ranking Method Based on Multi-level Emotion and Topic Information
计算机科学, 2022, 49(6A): 165-171. https://doi.org/10.11896/jsjkx.210400238
[2] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[3] 刘蕴涵, 沙朝锋, 牛军钰.
基于Stack Overflow的数据库相关主题分析
Analysis of Topics on Database Systems in Stack Overflow
计算机科学, 2021, 48(6): 48-56. https://doi.org/10.11896/jsjkx.200800217
[4] 文进, 张星宇, 沙朝锋, 刘艳君.
基于次模函数最大化的测试用例集约简
Test Suite Reduction via Submodular Function Maximization
计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[5] 周凯, 任怡, 汪哲, 管剑波, 张芳, 赵言亢.
基于主题模型的Ubuntu操作系统缺陷报告的分类及分析
Classification and Analysis of Ubuntu Bug Reports Based on Topic Model
计算机科学, 2020, 47(12): 35-41. https://doi.org/10.11896/jsjkx.200100022
[6] 周波.
融合语义模型的二分网络推荐算法
Bipartite Network Recommendation Algorithm Based on Semantic Model
计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028
[7] 王胜, 张仰森, 张雯, 蒋玉茹, 张睿.
基于SL-LDA的领域标签获取方法
Domain Label Acquisition Method Based on SL-LDA Model
计算机科学, 2020, 47(11): 95-100. https://doi.org/10.11896/jsjkx.190900012
[8] 王涵, 夏鸿斌.
LDA模型和列表排序混合的协同过滤推荐算法
Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model
计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032
[9] 居亚亚, 杨璐, 严建峰.
基于动态权重的LDA算法
LDA Algorithm Based on Dynamic Weight
计算机科学, 2019, 46(8): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2019.08.043
[10] 张蕾,蔡明.
基于主题融合和关联规则挖掘的图像标注
Image Annotation Based on Topic Fusion and Frequent Patterns Mining
计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[11] 范道远, 孙吉红, 王炜, 涂吉屏, 何欣.
融合文本与分类信息的重复缺陷报告检测方法
Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information
计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232
[12] 贾宁, 郑纯军.
基于注意力LSTM的音乐主题推荐模型
Model of Music Theme Recommendation Based on Attention LSTM
计算机科学, 2019, 46(11A): 230-235.
[13] 余圆圆, 巢文涵, 何跃鹰, 李舟军.
基于双语主题模型和双语词向量的跨语言知识链接
Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding
计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037
[14] 邱先标, 陈笑蓉.
一种基于SA_LDA模型的文本相似度计算方法
Text Similarity Calculation Algorithm Based on SA_LDA Model
计算机科学, 2018, 45(6A): 106-109.
[15] 韩朝, 苗夺谦, 任福继.
基于粗糙集理论的中文知识问答的知识谓词分析
Rough Set Based Knowledge Predicate Analysis of Chinese Knowledge Based Question Answering
计算机科学, 2018, 45(6): 183-186. https://doi.org/10.11896/j.issn.1002-137X.2018.06.032
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!