计算机科学 ›› 2020, Vol. 47 ›› Issue (11): 95-100.doi: 10.11896/jsjkx.190900012

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于SL-LDA的领域标签获取方法

王胜1, 张仰森1,2, 张雯1, 蒋玉茹1,2, 张睿1   

  1. 1 北京信息科技大学智能信息处理研究所 北京 100101
    2 国家经济安全预警工程北京实验室 北京 100044
  • 收稿日期:2019-08-31 修回日期:2019-11-04 出版日期:2020-11-15 发布日期:2020-11-05
  • 通讯作者: 张仰森(zhangyangsen@163.com)
  • 作者简介:1028742881@qq.com
  • 基金资助:
    国家自然科学基金项目(61772081,61602044);科技创新服务能力建设-科研基地建设-北京实验室-国家经济安全预警工程北京实验室项目(PXM2018_014224_000010)

Domain Label Acquisition Method Based on SL-LDA Model

WANG Sheng1, ZHANG Yang-sen1,2, ZHANG Wen1, JIANG Yu-ru1,2, ZHANG Rui1   

  1. 1 Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China
    2 Beijing Laboratory ofNational Economic Security Early Warning Engineering,Beijing100044,China
  • Received:2019-08-31 Revised:2019-11-04 Online:2020-11-15 Published:2020-11-05
  • About author:WANG Sheng,born in 1996,postgraduate.His main research interests include natural language processing and machine learning.
    ZHANG Yang-sen,born in 1962,postdoctoral,professor,Ph.D supervisor,is a member of China Computer Federation (CCF).His main research interests include natural language processing and artificial intelligence.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61772081,61602044) and Construction of Technological Innovation Service Capability-Construction of Research Base-Beijing Laboratory-National Economic Security Early Warning Project Beijing Laboratory Project(PXM2018_014224_000010).

摘要: 科学技术的发展为文献及学者的管理提出了新的挑战,为解决海量科技文献及学者的自动管理,文中提出了一种基于SL-LDA的领域标签获取方法。在海量科技文献的基础上,分析科技文献数据的分布特点,通过引入科技文献的词频特征构建了SL-LDA主题模型,利用该主题模型对同一学者的科技文献进行“主题-短语”抽取,获得初始领域关键词。接着引入领域体系,对主题模型的抽取结果与体系标签进行向量表征,经过位置特征加权后使用相似度进行体系映射,最终获得学者的领域标签。实验结果表明,在同样的文献数据量下,SL-LDA模型与传统的LDA模型、基于统计的TFIDF算法和基于网络图的Text-Rank算法相比,最终获取的标签词效果更好,准确率更高,F1值也提升到0.572,说明基于SL-LDA的领域标签抽取方法在学术领域具有较好的适用性。

关键词: SL-LDA模型, 标签映射, 科技文献, 领域标签, 主题短语抽取

Abstract: The development of science and technology poses new challenges for the management of literature and scholars.In order to solve the problem of automatic management of massive scientific literature and scholars,this paper proposes a domain label acquisition method based on SL-LDA.On the basis of massive scientific literature,the distribution characteristics of scientificliterature data are analyzed,and the SL-LDA theme model is constructed by introducing the word frequency feature of scientific literature.The theme model is used to extract the “theme-phrase” from the scientific literature of the same scholar and get the initial domain keywords.Then the domain system is introduced,the extraction results of the theme model are vector-represented with the system label.After the position feature weighting,the similarity is used for system mapping.Finally,the domain label of the scholar is obtained.Experiment results show that,compared withthe traditional LDA model,the statistical-based TFIDF algorithm and the TextRank algorithm based on network graph,the final label words obtained by SL-LDA model have better effect and higher accuracy with the same amount of literature data,and the F1 value is also raised to 0.572,indicating that the domain label acquisition method based on SL-LDA has good applicability in the academic field.

Key words: Domain tags, Label mapping, Scientific literature, SL-LDA model, theme phrase extraction

中图分类号: 

  • TP391.1
[1] BUDURA A,BOURGES-WALDEGG D,R IORDAN J.Deri-ving Expertise Profiles from Tags[C]//Proceedings of the 2009 International Conferenceon Computational Science and Engineering.2009:34-41.
[2] KHAN S,NABEEL S M.OPEMS:Online Peer-to-Peer Expert-ise Matching System[C]//Proceedings of the 1st International Conferenceon Information and Communication Technologies.2005.
[3] ZHANG J.The design and implementation of expert informationmanagement system for think tank [D].Harbin Institute of Technology,2017.
[4] DAM K H T,TOUILI T.Automatic extraction of malicious behaviors[C]//2016 11th International Conference on Malicious and Unwanted Software (MALWARE).IEEE,2016.
[5] ZHAO H B,LU W.The Study of Expert Research Field Automatic Recognition [J].New Technology of Library and Information Service,2010(2):63-67.
[6] BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].J Machine Learning Research Archive,2003,3:993-1022.
[7] GROOF R D,XU H.Automatic topic discovery of online hospital reviews using an improved LDA with Variational Gibbs Sampling[C]//IEEE International Conference on Big Data.IEEE,2018.
[8] ZHOU W X,ZHANG Y S,ZHANG L.Research on topic detection and expression method for Weibo hot events[J/OL].Application Research of Computers.[2019-02-27].https://doi.org/10.19734/j.issn.1001-3695.2018.08.0601.
[9] HU X.News hotspots detection and tracking based on LDA topic model[C]//International Conference on Progress in Informatics & Computing.IEEE,2017.
[10] MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[11] WEN Y,YUAN H,ZHANG P.Research on keyword extraction based on Word2Vec weighted TextRank[C]//2016 2nd IEEE International Conference on Computer and Communications (ICCC).IEEE,2016.
[12] LI W,ZHAO J.TextRank Algorithm by Exploiting Wikipedia for Short Text Keywords Extraction[C]//International Conference on Information Science & Control Engineering.IEEE,2016.
[13] CUI L,FAN M,YONG S,et al.A Hierarchy Method Based on LDA and SVM for News Classification[C]//IEEE International Conference on Data Mining Workshop.2015.
[14] YANG C Y,PAN Y N,ZHAO L.Study on Topic Extraction of Literatures Based on Weighted Semantic and Citation Relation [J].Library and Information Service,2016,60(9):131-138,146.
[15] CHEN Z,JI W.Exploiting noisy web data by OOV ranking for low-resource keyword search[C]//International Symposium on Chinese Spoken Language Processing.2017.
[16] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[17] AO F,WANG L,CHEN M,et al.Text and position ranking algorithm based on sample weighted[C]//International Conference on Information Science & Engineering.IEEE,2010.
[18] SONG Y,SHI S,LI J,et al.Directional skip-gram:Explicitlydistinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(Short Papers).2018:175-180.
[19] WU H,YIN S F,MA Y X,et al.WI-LDA:Technical Topic Analysis in Patents [J].Library and Information Service,2018,62(17):68-74.
[20] SHAN B,LI F.A Survey of Topic Evolution Based on LDA[J].Journal of Chinese Information Processing,2010,24(6):43-49,68.
[1] 徐小龙,赵昌耀,耿卫健,程春玲.
一种基于智能Agent的科技文献快速协作推送机制
Rapid Collaborative Scientific and Technical Literature Push Mechanism Based on Intelligent Agent
计算机科学, 2011, 38(4): 249-253.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!