一种基于SA_LDA模型的文本相似度计算方法

Abstract

Abstract: Many information processing techniques are based on computing the similarity of text.However,the traditional method of similarity calculation based on vector space model has the problems of high dimension and poor semantic sensitivity,so the performance is not very satisfactory.This paper proposed a self-adaptive LDA (SA_LDA) model based on traditional LDA model.It can manually determine the number of topic.Applying it in text similarity calculation,it can solve the high dimensional and sparse problem.Experiments show that this method improves the accuracy of similarity calculation and the effect of text clustering compared with VSM.

Key words: SA_LDA model, Text mining, Text similarity, Topic model

CLC Number:

TP391

QIU Xian-biao, CHEN Xiao-rong. Text Similarity Calculation Algorithm Based on SA_LDA Model[J].Computer Science, 2018, 45(6A): 106-109.

References

[1]XU L,SUN S,WANG Q.Text similarity algorithm based on semantic vector space model[C]∥15th IEEE/ACIS International Conference on Computer and Information Science.2016.
[2]FAN Z X,CHEN S Y,ZHA L,et al.A Text Clustering Ap- proach of Chinese News Based on Neural Network Language Model[J].International Journal of Parallel Programming,2016,44(1):198-206.
[3]CAO Q M,GUO Q,WANG Y L,et.al.Text clustering using VSM with feature clusters[J].Neural Computing & Applications,2015,26(4):995-1003.
[4]GUO L T,LI Y,MU D J,et al.A LDA model based topic detection method[J].Journal of Northwestern Polytechnical University,2016,34(4):98-102.
[5]王刚,钟国祥.一种基于本体相似度计算的文本聚类研究[J].计算机科学,2010,37(9):222-224.
[6]HU X H,ZHANG X,et al.Exploiting Wikipedia as External Knowledge for Document Clustering[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Paris,France,2009:389-396.
[7]HUANG A,MILNE D,FRANK E,et al.Clustering Documents using a Wikipedia Based Concept Representation[M]∥Advanced in Knowledge Discovery and Data Mining.Spring Berlin Heidelberg,2009:628-636.
[8]HOFMANN T.Probabilistic latent semantic indexing[C]∥ 22nd International ACMSIGIR Conference on Research and Deve-lopment in Information Retrieval.Berkeley,CA,USA,1999:50-57.
[9]BLEI D,NG A,JORDAN M.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003(3):993-1022.
[10]徐戈,黄厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1437.
[11]曹娟,张勇东.一种基于密度的自适应最优LDA模型选择方法[J].计算机学报,2008,31(10):1780-1788.
[12]TEH Y,JORDAN M,BEAL M,et al.Hierarchical diriehht pro- cesses[J].Journal of the American Statistical Association,2007,101(476):1566-1581.
[13]张超,陈利,李琼.一种PST_LDA中文文本相似度计算方法[J].计算机应用研究,2016,33(2):375-377.
[14]黄承慧,印鉴,侯昉.一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J].计算机学报,2011,34(5):856-864.

Related Articles 15

[1]	BAI Yong, ZHANG Zhan-long, XIONG Jun-di. Power Knowledge Text Mining Based on FP-Growth Algorithm and GRNN [J]. Computer Science, 2021, 48(8): 86-90.
[2]	ZHANG Tong-ming, ZHANG Ning. Review of Research on Investor Sentiment Index in Stock Market [J]. Computer Science, 2021, 48(6A): 143-150.
[3]	LIU Yun-han, SHA Chao-feng, NIU Jun-yu. Analysis of Topics on Database Systems in Stack Overflow [J]. Computer Science, 2021, 48(6): 48-56.
[4]	HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[5]	WEN Jin, ZHANG Xing-yu, SHA Chao-feng, LIU Yan-jun. Test Suite Reduction via Submodular Function Maximization [J]. Computer Science, 2021, 48(12): 75-84.
[6]	MA Li-bo, QIN Xiao-lin. Topic-Location-Category Aware Point-of-interest Recommendation [J]. Computer Science, 2020, 47(9): 81-87.
[7]	ZHU Di-chen, XIA Huan, YANG Xiu-zhang, YU Xiao-min, ZHANG Ya-cheng and WU Shuai. Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis [J]. Computer Science, 2020, 47(6A): 530-534.
[8]	GAO Nan,LI Li-juan,Wei-william LEE,ZHU Jian-ming. Keywords Extraction Method Based on Semantic Feature Fusion [J]. Computer Science, 2020, 47(3): 110-115.
[9]	JIA Jing-dong, ZHANG Xiao-man, HAO Lu, TAN Huo-bin. Analysis of Focuses of Requirements Engineering in Industry [J]. Computer Science, 2020, 47(12): 25-34.
[10]	HAN Cheng-cheng, LIN Qiang, MAN Zheng-xing, CAO Yong-chun, WANG Hai-jun, WANG Wei-lan. Mining Nuclear Medicine Diagnosis Text for Correlation Extraction Between Lesions and Their Representations [J]. Computer Science, 2020, 47(11A): 524-530.
[11]	ZHOU Bo. Bipartite Network Recommendation Algorithm Based on Semantic Model [J]. Computer Science, 2020, 47(11A): 482-485.
[12]	WANG Han, XIA Hong-bin. Collaborative Filtering Recommendation Algorithm Mixing LDA Model and List-wise Model [J]. Computer Science, 2019, 46(9): 216-222.
[13]	JU Ya-ya, YANG Lu, YAN Jian-feng. LDA Algorithm Based on Dynamic Weight [J]. Computer Science, 2019, 46(8): 260-265.
[14]	ZHANG Lei,CAI Ming. Image Annotation Based on Topic Fusion and Frequent Patterns Mining [J]. Computer Science, 2019, 46(7): 246-251.
[15]	FAN Dao-yuan, SUN Ji-hong, WANG Wei, TU Ji-ping, HE Xin. Detection Method of Duplicate Defect Reports Fusing Text and Categorization Information [J]. Computer Science, 2019, 46(12): 192-200.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Text Similarity Calculation Algorithm Based on SA_LDA Model

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0