一种基于SA_LDA模型的文本相似度计算方法

摘要/Abstract

摘要： 计算文本的相似度是许多文本信息处理技术的基础。然而,常用的基于向量空间模型(VSM)的相似度计算方法存在着高维稀疏和语义敏感度较差等问题,因此相似度计算的效果并不理想。在传统的LDA(Latent Dirichlet Allocation)模型的基础上,针对其需要人工确定主题数目的问题,提出了一种能通过模型自身迭代确定主题个数的自适应LDA(SA_LDA)模型。然后,将其引入文本的相似度计算中,在一定程度上解决了高维稀疏等问题。通过实验表明,该方法能自动确定模型主题的个数,并且利用该模型计算文本相似度时取得了比VSM模型更高的准确度。

关键词: SA_LDA模型, 文本挖掘, 文本相似度, 主题模型

Abstract: Many information processing techniques are based on computing the similarity of text.However,the traditional method of similarity calculation based on vector space model has the problems of high dimension and poor semantic sensitivity,so the performance is not very satisfactory.This paper proposed a self-adaptive LDA (SA_LDA) model based on traditional LDA model.It can manually determine the number of topic.Applying it in text similarity calculation,it can solve the high dimensional and sparse problem.Experiments show that this method improves the accuracy of similarity calculation and the effect of text clustering compared with VSM.

Key words: SA_LDA model, Text mining, Text similarity, Topic model

中图分类号:

TP391

邱先标, 陈笑蓉. 一种基于SA_LDA模型的文本相似度计算方法[J]. 计算机科学, 2018, 45(6A): 106-109. https://doi.org/

QIU Xian-biao, CHEN Xiao-rong. Text Similarity Calculation Algorithm Based on SA_LDA Model[J]. Computer Science, 2018, 45(6A): 106-109. https://doi.org/

参考文献

[1]XU L,SUN S,WANG Q.Text similarity algorithm based on semantic vector space model[C]∥15th IEEE/ACIS International Conference on Computer and Information Science.2016.
[2]FAN Z X,CHEN S Y,ZHA L,et al.A Text Clustering Ap- proach of Chinese News Based on Neural Network Language Model[J].International Journal of Parallel Programming,2016,44(1):198-206.
[3]CAO Q M,GUO Q,WANG Y L,et.al.Text clustering using VSM with feature clusters[J].Neural Computing & Applications,2015,26(4):995-1003.
[4]GUO L T,LI Y,MU D J,et al.A LDA model based topic detection method[J].Journal of Northwestern Polytechnical University,2016,34(4):98-102.
[5]王刚,钟国祥.一种基于本体相似度计算的文本聚类研究[J].计算机科学,2010,37(9):222-224.
[6]HU X H,ZHANG X,et al.Exploiting Wikipedia as External Knowledge for Document Clustering[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Paris,France,2009:389-396.
[7]HUANG A,MILNE D,FRANK E,et al.Clustering Documents using a Wikipedia Based Concept Representation[M]∥Advanced in Knowledge Discovery and Data Mining.Spring Berlin Heidelberg,2009:628-636.
[8]HOFMANN T.Probabilistic latent semantic indexing[C]∥ 22nd International ACMSIGIR Conference on Research and Deve-lopment in Information Retrieval.Berkeley,CA,USA,1999:50-57.
[9]BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
[10]徐戈,黄厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1437.
[11]曹娟,张勇东.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1788.
[12]TEH Y,JORDAN M,BEAL M,et al.Hierarchical diriehht pro- cesses[J].Journal of the American Statistical Association,2007,101(476):1566-1581.
[13]张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377.
[14]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.

相关文章 15

[1]	白勇, 张占龙, 熊隽迪. 基于FP-Growth算法和GRNN的电力知识文本挖掘 Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN 计算机科学, 2021, 48(8): 86-90. https://doi.org/10.11896/jsjkx.210600031
[2]	张同明, 张宁. 股票市场投资者情绪指数研究综述 Review of Research on Investor Sentiment Index in Stock Market 计算机科学, 2021, 48(6A): 143-150. https://doi.org/10.11896/jsjkx.201000016
[3]	胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[4]	文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086
[5]	朱涤尘, 夏换, 杨秀璋, 于小民, 张亚成, 武帅. 基于文本挖掘和决策树分析的中国手游产业发展研究 Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis 计算机科学, 2020, 47(6A): 530-534. https://doi.org/10.11896/JsJkx.190700124
[6]	高楠,李利娟,李伟,祝建明. 融合语义特征的关键词提取方法 Keywords Extraction Method Based on Semantic Feature Fusion 计算机科学, 2020, 47(3): 110-115. https://doi.org/10.11896/jsjkx.190700041
[7]	贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析 Analysis of Focuses of Requirements Engineering in Industry 计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048
[8]	韩成成, 林强, 满正行, 曹永春, 王海军, 王维兰. 面向病灶与其表征关联提取的核医学诊断文本挖掘 Mining Nuclear Medicine Diagnosis Text for Correlation Extraction Between Lesions and Their Representations 计算机科学, 2020, 47(11A): 524-530. https://doi.org/10.11896/jsjkx.200400062
[9]	周波. 融合语义模型的二分网络推荐算法 Bipartite Network Recommendation Algorithm Based on Semantic Model 计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028
[10]	王涵, 夏鸿斌. LDA模型和列表排序混合的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model 计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032
[11]	居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法 LDA Algorithm Based on Dynamic Weight 计算机科学, 2019, 46(8): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2019.08.043
[12]	张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[13]	范道远, 孙吉红, 王炜, 涂吉屏, 何欣. 融合文本与分类信息的重复缺陷报告检测方法 Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information 计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232
[14]	贾宁, 郑纯军. 基于注意力LSTM的音乐主题推荐模型 Model of Music Theme Recommendation Based on Attention LSTM 计算机科学, 2019, 46(11A): 230-235.
[15]	余圆圆, 巢文涵, 何跃鹰, 李舟军. 基于双语主题模型和双语词向量的跨语言知识链接 Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding 计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896／j.issn.1002-137X.2019.01.037

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed