计算机科学 ›› 2012, Vol. 39 ›› Issue (9): 175-179.

• 人工智能 • 上一篇    下一篇

科技文献关键词自动标注算法研究

倪娜,刘凯,李耀东   

  1. (中国科学院自动化研究所复杂系统智能控制与管理国家重点实验室(筹) 北京100190);(中国民航信息网络股份有限公司研发中心分销产品研发部 北京100029)
  • 出版日期:2018-11-16 发布日期:2018-11-16

Study of Automatic Keywords Labeling for Scientific Literature

  • Online:2018-11-16 Published:2018-11-16

摘要: 未标注或遗失关键词给科技文献的分类和导航工作带来一定困难,针对这一问题,提出了基于文献摘要内容 的关键词自动标注算法。该算法使用标注过关键词的文献摘要作为训练文本,分别采用语言模型、Latent Dirichlet Allocation(LDA)模型,Probabilistic Author-Topic模型及语言模型+工DA模型的组合模型对训练集中的摘要文本和 关键词建模,建立关键词和组成摘要文本特征词之间的关系,然后利用这些模型在未标注关键词的科技文献摘要上进 行关键词的预测。在中英文数据上的实验结果表明,自动标注的关键词能较好地反映科技文献的内容;在所有模型 中,语言模型+LDA组合模型的效果最佳。

关键词: 语言模型,标签预测,Latent Dirichlct Allocation,Probabilistic Author-Topic Model

Abstract: Keywords of scientific literatures provided by authors are helpful for readers. But there are also some scienti- fic literatures that are not labeled with keywords due to all sorts of reasons. So this paper proposed a new abstract based automatic keywords prediction algorithm for scientific literatures without keywords. I}he abstracts of scientific litera- tures,which had been given keywords by authors,were used as the training data set. Four text modeling methods:lan- guagc modcl(LM),latent dirichlet allocation(LDA),probabilistic author-topic model, and a combination of LM and I_DA were employed to model the abstracts and the keywords in training set to build the relations between keywords and terms of abstracts. Then the trained models were used to predict keywords for the abstracts of scientific literatures without keywords. The experimental results on both Chinese data sets and English data sets show that the keywords predicted by the proposed algorithms can reflect the content of scientific literature well. Among all of the models, the combination of LM and LDA is best.

Key words: Language model, Tag prediction, Latcnt dirichlct allocation, Probabilistic author-topic model

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!