Domain Label Acquisition Method Based on SL-LDA Model

WANG Sheng1, ZHANG Yang-sen1,2, ZHANG Wen1, JIANG Yu-ru1,2, ZHANG Rui1   

  1. 1 Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China
    2 Beijing Laboratory ofNational Economic Security Early Warning Engineering,Beijing100044,China
  • Received:2019-08-31 Revised:2019-11-04 Online:2020-11-15 Published:2020-11-05
  • About author:WANG Sheng,born in 1996,postgraduate.His main research interests include natural language processing and machine learning.
    ZHANG Yang-sen,born in 1962,postdoctoral,professor,Ph.D supervisor,is a member of China Computer Federation (CCF).His main research interests include natural language processing and artificial intelligence.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61772081,61602044) and Construction of Technological Innovation Service Capability-Construction of Research Base-Beijing Laboratory-National Economic Security Early Warning Project Beijing Laboratory Project(PXM2018_014224_000010).

Abstract: The development of science and technology poses new challenges for the management of literature and scholars.In order to solve the problem of automatic management of massive scientific literature and scholars,this paper proposes a domain label acquisition method based on SL-LDA.On the basis of massive scientific literature,the distribution characteristics of scientificliterature data are analyzed,and the SL-LDA theme model is constructed by introducing the word frequency feature of scientific literature.The theme model is used to extract the “theme-phrase” from the scientific literature of the same scholar and get the initial domain keywords.Then the domain system is introduced,the extraction results of the theme model are vector-represented with the system label.After the position feature weighting,the similarity is used for system mapping.Finally,the domain label of the scholar is obtained.Experiment results show that,compared withthe traditional LDA model,the statistical-based TFIDF algorithm and the TextRank algorithm based on network graph,the final label words obtained by SL-LDA model have better effect and higher accuracy with the same amount of literature data,and the F1 value is also raised to 0.572,indicating that the domain label acquisition method based on SL-LDA has good applicability in the academic field.

Key words: Domain tags, Label mapping, Scientific literature, SL-LDA model, theme phrase extraction

  • TP391.1
