计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 106-109.
邱先标,陈笑蓉
QIU Xian-biao,CHEN Xiao-rong
摘要: 计算文本的相似度是许多文本信息处理技术的基础。然而,常用的基于向量空间模型(VSM)的相似度计算方法存在着高维稀疏和语义敏感度较差等问题,因此相似度计算的效果并不理想。在传统的LDA(Latent Dirichlet Allocation)模型的基础上,针对其需要人工确定主题数目的问题,提出了一种能通过模型自身迭代确定主题个数的自适应LDA(SA_LDA)模型。然后,将其引入文本的相似度计算中,在一定程度上解决了高维稀疏等问题。通过实验表明,该方法能自动确定模型主题的个数,并且利用该模型计算文本相似度时取得了比VSM模型更高的准确度。
中图分类号:
[1]XU L,SUN S,WANG Q.Text similarity algorithm based on semantic vector space model[C]∥15th IEEE/ACIS International Conference on Computer and Information Science.2016. [2]FAN Z X,CHEN S Y,ZHA L,et al.A Text Clustering Ap- proach of Chinese News Based on Neural Network Language Model[J].International Journal of Parallel Programming,2016,44(1):198-206. [3]CAO Q M,GUO Q,WANG Y L,et.al.Text clustering using VSM with feature clusters[J].Neural Computing & Applications,2015,26(4):995-1003. [4]GUO L T,LI Y,MU D J,et al.A LDA model based topic detection method[J].Journal of Northwestern Polytechnical University,2016,34(4):98-102. [5]王刚,钟国祥.一种基于本体相似度计算的文本聚类研究[J].计算机科学,2010,37(9):222-224. [6]HU X H,ZHANG X,et al.Exploiting Wikipedia as External Knowledge for Document Clustering[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Paris,France,2009:389-396. [7]HUANG A,MILNE D,FRANK E,et al.Clustering Documents using a Wikipedia Based Concept Representation[M]∥Advanced in Knowledge Discovery and Data Mining.Spring Berlin Heidelberg,2009:628-636. [8]HOFMANN T.Probabilistic latent semantic indexing[C]∥ 22nd International ACMSIGIR Conference on Research and Deve-lopment in Information Retrieval.Berkeley,CA,USA,1999:50-57. [9]BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3):993-1022. [10]徐戈,黄厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1437. [11]曹娟,张勇东.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1788. [12]TEH Y,JORDAN M,BEAL M,et al.Hierarchical diriehht pro- cesses[J].Journal of the American Statistical Association,2007,101(476):1566-1581. [13]张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377. [14]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864. |
[1] | 白勇, 张占龙, 熊隽迪. 基于FP-Growth算法和GRNN的电力知识文本挖掘 Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN 计算机科学, 2021, 48(8): 86-90. https://doi.org/10.11896/jsjkx.210600031 |
[2] | 张同明, 张宁. 股票市场投资者情绪指数研究综述 Review of Research on Investor Sentiment Index in Stock Market 计算机科学, 2021, 48(6A): 143-150. https://doi.org/10.11896/jsjkx.201000016 |
[3] | 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213 |
[4] | 文进, 张星宇, 沙朝锋, 刘艳君. 基于次模函数最大化的测试用例集约简 Test Suite Reduction via Submodular Function Maximization 计算机科学, 2021, 48(12): 75-84. https://doi.org/10.11896/jsjkx.210300086 |
[5] | 朱涤尘, 夏换, 杨秀璋, 于小民, 张亚成, 武帅. 基于文本挖掘和决策树分析的中国手游产业发展研究 Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis 计算机科学, 2020, 47(6A): 530-534. https://doi.org/10.11896/JsJkx.190700124 |
[6] | 高楠,李利娟,李伟,祝建明. 融合语义特征的关键词提取方法 Keywords Extraction Method Based on Semantic Feature Fusion 计算机科学, 2020, 47(3): 110-115. https://doi.org/10.11896/jsjkx.190700041 |
[7] | 贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析 Analysis of Focuses of Requirements Engineering in Industry 计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048 |
[8] | 韩成成, 林强, 满正行, 曹永春, 王海军, 王维兰. 面向病灶与其表征关联提取的核医学诊断文本挖掘 Mining Nuclear Medicine Diagnosis Text for Correlation Extraction Between Lesions and Their Representations 计算机科学, 2020, 47(11A): 524-530. https://doi.org/10.11896/jsjkx.200400062 |
[9] | 周波. 融合语义模型的二分网络推荐算法 Bipartite Network Recommendation Algorithm Based on Semantic Model 计算机科学, 2020, 47(11A): 482-485. https://doi.org/10.11896/jsjkx.200400028 |
[10] | 王涵, 夏鸿斌. LDA模型和列表排序混合的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model 计算机科学, 2019, 46(9): 216-222. https://doi.org/10.11896/j.issn.1002-137X.2019.09.032 |
[11] | 居亚亚, 杨璐, 严建峰. 基于动态权重的LDA算法 LDA Algorithm Based on Dynamic Weight 计算机科学, 2019, 46(8): 260-265. https://doi.org/10.11896/j.issn.1002-137X.2019.08.043 |
[12] | 张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注 Image Annotation Based on Topic Fusion and Frequent Patterns Mining 计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037 |
[13] | 范道远, 孙吉红, 王炜, 涂吉屏, 何欣. 融合文本与分类信息的重复缺陷报告检测方法 Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information 计算机科学, 2019, 46(12): 192-200. https://doi.org/10.11896/jsjkx.181102232 |
[14] | 贾宁, 郑纯军. 基于注意力LSTM的音乐主题推荐模型 Model of Music Theme Recommendation Based on Attention LSTM 计算机科学, 2019, 46(11A): 230-235. |
[15] | 余圆圆, 巢文涵, 何跃鹰, 李舟军. 基于双语主题模型和双语词向量的跨语言知识链接 Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding 计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037 |
|