计算机科学 ›› 2019, Vol. 46 ›› Issue (3): 275-282.doi: 10.11896/j.issn.1002-137X.2019.03.041

• 人工智能 • 上一篇    下一篇

基于文献计量和众包技术的前沿科技关键词挖掘

吕佳高1,梁奎阳2,蔡伟3   

  1. (北京航空航天大学软件开发环境国家重点实验室 北京 100191)1
    (北京国科知源科技有限公司 北京 100191)2
    (北京市科技信息中心 北京 100085)3
  • 收稿日期:2018-02-03 修回日期:2018-05-28 出版日期:2019-03-15 发布日期:2019-03-22
  • 作者简介:吕佳高(1995-),男,硕士生,主要研究方向为数据挖掘;蔡伟(1976-),女,助理研究员,主要研究方向为科技数据分析。
  • 基金资助:
    国家重点研发计划项目课题:众智化专业知识协同开发技术及应用(2017YFB1402403)资助

Frontier Scientific Keyword Extraction Based on Bibliometric and Crowdsourcing

LV Jia-gao1,LIANG Kui-yang2,CAI Wei3   

  1. (State Key Laboratory of Software Development Environment,Beihang University,Beijing 100191,China)1
    (Beijing Guoke Zhiyuan Technology Co.,Ltd.,Beijing 100191,China)2
    (Beijing Sci-Tech Information Center,Beijing 100085,China)3
  • Received:2018-02-03 Revised:2018-05-28 Online:2019-03-15 Published:2019-03-22

摘要: 随着科学技术高速发展,科技文献的数量与日俱增,从海量的文献数据中挖掘出前沿科技关键词是一个新的挑战,由专家进行人工分析是一种常见而传统的方式,但这种方式的效率低且成本高。文中提出了一种将文献计量与众包技术相结合的算法:首先利用自然语言处理的词性标注技术处理并获取文献中的名词,然后通过基于文献计量的科技监测方法筛选出潜在的科技关键词,最后利用众包平台的数据进一步筛选潜在的科技关键词。采用计算机领域和生物医药领域的英文文献数据进行实验,结果表明所提算法有一定的效果,其效率比人工分析的方式高,能为专家人工分析起到辅助作用。所提算法能够更好地指导前沿技术关键词的挖掘,为未来更加自动和智能的前沿技术关键词挖掘提供参考。

关键词: 关键词挖掘, 文献计量, 众包技术

Abstract: With the rapid development of science,the annual amount of scientific papers is growing,and new challenge is to extract the frontier scientific keywords from lots of papers.In traditional way,the extraction work is done by experts,which is inefficient and costs much.A new algorithm based on bibliometric analysis and crowdsourcing technique was proposed in this paper.Part-of-speech tagging is used to obtain the nouns from scientific papers,and potentialscie-ntific keywords are selected from these nouns by bibliometric analysis.The last procedure is using data from crowdsourcing platform to check potential scientific keywords and get results.English scientific papers in computer scie-nce and biomedicine are used to conduct experiments.The experiment results suggest that the proposed algorithm has effect on extraction,and it’s more efficient than expert extraction procedure,so it can assist the expert to analysis frontier scientific keywords.In conclusion,this algorithm can do automatic extraction and show possibility of more automatic and intelligent extraction procedure in the future.

Key words: Bibliometric, Crowdsourcing, Keyword extraction

中图分类号: 

  • TP391.1
[1]JINHA A E.Article 50 million:an estimate of the number of
scholarly articles in existence[J].Learned Publishing,2010,23(3):258-263.
[2]WARE M,MABE M.The STM report:An overview of scienti-
fic and scholarly journal publishing[R].Nebraska:Digital Commons at University of Nebraska-Lincoln,2015.
[3]SCOTT J.Social Networks:Critical Concepts in Sociology
(Vol.4).London:Routledge,2002:328-331.
[4]BRAAM R R,MOED H F,VAN RAAN A F J.Mapping of scien-
ce by combined co-citation and word analysis I.Structural aspects[J].Journal of the American Society for information Scien-ce,1991,42(4):233.
[5]SMALL H,GRIFFITH B C.The structure of scientific litera-
tures I:Identifying and graphing specialties[J].Science studies,1974,4(1):17-40.
[6]PERSSON O.The intellectual base and research fronts of “ja-
sis” 1986-1990[J].Journal of the American Society for Information Science,1994,45(1):31.
[7]CHEN C.CiteSpace II:Detecting and visualizing emerging
trends and transient patterns in scientific literature[J].Journal of the American Society for information Science and Technology,2006,57(3):359-377.
[8]ZHU L,ZHAO R X,KOU Y T,et al.Study on Integrated Mode of Science and Technology Monitoring Base on Literature[J].Digital Library Forum,2015(10):53-57.(in Chinese)
朱亮,赵瑞雪,寇远涛,等.一种基于文献的综合科技监测模式研究[J].数字图书馆论坛,2015(10):53-57.
[9]KLEINBERG J.Bursty and hierarchical structure in streams[C]∥
Proceedings of the Eighth ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.ACM,2002:91-101.
[10]ZHOU W J.The Criterion Related Validity of Research Frontier Exploration:a Co-words Analysis based on the Natural Language Processing[J].Library and Information,2018,38(1):1-7.(in Chinese)
周文杰.研究前沿探测的效标关联效度研究:基于自然语言处理[J].图书与情报,2018,38(1):1-7.
[11]GENG H Y,XIAO X T.The Research Progress and Trends of Cocitation Analysis in Foreign Countries[J].Journal of Information,2006,25(12):68-70.(in Chinese)
耿海英,肖仙桃.国外共引分析研究进展及发展趋势[J].情报杂志,2006,25(12):68-70.
[12]SMALL H.A SCI-MAP case study:Building a map of AIDS research[J].Scientometrics,1994,30(1):229-241.
[13]SHENG L.Recognize the Fronts and Trends of Biology and
Medical Research Domain[D].Beijing:Academy of Military Medical Sciences,2013.(in Chinese)
盛立.生物医学领域研究前沿识别与趋势预测[D].北京:中国人民解放军军事医学科学院,2013.
[14]JIANG Y.A Co-Word Analysis of Bibliometric in 1995 ~ 2004[J].Journal of the China Society for Scientific and Technical Information.2006,25(4):504-512.(in Chinese)
蒋颖.1995~ 2004 年文献计量学研究的共词分析[J].情报学报,2006,25(4):504-512.
[15]ZHENG Y N,XU X Y,LIU Z H.Study on the Method of Identifying Research Fronts Based on Keywords Co-occurrence[J].Library and Information Service,2016,60(4):85-92.(in Chinese)
郑彦宁,许晓阳,刘志辉.基于关键词共现的研究前沿识别方法研究[J].图书情报工作,2016,60(4):85-92.
[16]AN X Y,ZHONG H.The Theoretical Summary of Scienceand Technology Monitoring and the Comparative Analysis of Application System [J].Information Studies:Theory & Application,2010,33(5):124-128.(in Chinese)
安新颖,钟华.科技监测的理论综述与应用系统对比分析[J].情报理论与实践,2010,33(5):124-128.
[17]ZHONG H X.Review on Emerging Trend Detection[J].Journal of Modern Information,2017,37(12):28.(in Chinese)
钟辉新.新兴趋势探测研究综述[J].现代情报,2017,37(12):28.
[18]FENG J,ZHANG Y Q.Research on the Method of Detecting and Analyzing Scientific Fronts Based on LDA and Ontology[J].Information Studies:Theory & Application,2017,40(8):49-54.(in Chinese)
冯佳,张云秋.基于 LDA 和本体的科学前沿识别与分析方法研究[J].情报理论与实践,2017,40(8):49-54.
[19]BAI R J,LENG F H,LIAO J H.A Method of Detecting Research Front Based on Subjects Comparison of Multiple Data Sources[J].Information Studies:Theory & Application,2017,40(8):43-48.(in Chinese)
白如江,冷伏海,廖君华.一种基于多数据源主题对比的科学研究前沿识别方法[J].情报理论与实践,2017,40(8):43-48.
[20]ZHOU Q,ZHOU Q J,LENG F H.Research and Demonstration of the Method of Identifying Research Fronts Based on Media of Science and Technology[J].Journal of Modern Information,2018,38(2):62-68.(in Chinese)
周群,周秋菊,冷伏海.基于科技媒体视角的研究前沿识别方法研究与实证[J].现代情报,2018,38(2):62-68.
[21]SUN Z.Study on the Integrated Model of Research Front Based on the Multi-Source Data of Scientific Papers[J].Journal of Intelligence,2016,35(8):95-100.(in Chinese)
孙震.基于科学论文多源数据的研究前沿集成识别模型研究[J].情报杂志,2016,35(8):95-100.
[22]BRABHAM D C.Crowdsourcing as a model for problem solving:An introduction and cases[J].Convergence,2008,14(1):75-90.
[23]KAZAI G.In search of quality in crowdsourcing for search engine evaluation[C]∥European Conference on Information Retrieval.Springer Berlin Heidelberg,2011:165-176.
[24]MANNING C,SURDEANU M,BAUER J,et al.The Stanford CoreNLP natural language processing toolkit[C]∥Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics:System Demonstrations.2014:55-60.
[1] 朱敏, 梁朝晖, 姚林, 王翔坤, 曹梦琦.
学术引用信息可视化方法综述
Survey of Visualization Methods on Academic Citation Information
计算机科学, 2022, 49(4): 88-99. https://doi.org/10.11896/jsjkx.210300219
[2] 李嘉明, 赵阔, 屈挺, 刘晓翔.
基于知识图谱的区块链物联网领域研究分析
Research and Analysis of Blockchain Internet of Things Based on Knowledge Graph
计算机科学, 2021, 48(6A): 563-567. https://doi.org/10.11896/jsjkx.200600071
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!