计算机科学 ›› 2018, Vol. 45 ›› Issue (11): 220-225.doi: 10.11896/j.issn.1002-137X.2018.11.034

• 人工智能 • 上一篇    下一篇

基于合作作者与隶属机构信息的同名排歧方法

尚玉玲1, 曹建军2, 李红梅1, 郑奇斌1   

  1. (解放军理工大学指挥信息系统学院 南京210007)1
    (国防科技大学第六十三研究所 南京210007)2
  • 收稿日期:2017-10-25 发布日期:2019-02-25
  • 作者简介:尚玉玲(1990-),女,硕士生,主要研究方向为数据质量控制与数据治理,E-mail:1533765046@qq.com;曹建军(1975-),男,博士,副研究员,主要研究方向为数据质量控制与数据治理、数据智能分析与应用,E-mail:jianjuncao@yeah.net(通信作者);李红梅(1990-),女,博士生,主要研究方向为个性化推荐;郑奇斌(1990-),男,博士生,主要研究方向为数据质量控制与数据治理。
  • 基金资助:
    本文受国家自然科学基金(61371196),中国博士后科学基金(2015M582832)资助。

Co-author and Affiliate Based Name Disambiguation Approach

SHANG Yu-ling1, CAO Jian-jun2, LI Hong-mei1, ZHENG Qi-bin1   

  1. (College of Command Information Systems,PLA University of Science and Technology,Nanjing 210007,China)1
    (The 63rd Research Institute,National University of Defense Technology,Nanjing 210007,China)2
  • Received:2017-10-25 Published:2019-02-25

摘要: 同名排歧是实体分辨领域的重要研究内容之一,其旨在分辨出相同姓名对应的不同人。针对传统同名排歧方法需要丰富的信息以及无法解决信息缺乏时的排歧问题,提出了一种基于合作作者和隶属机构信息的同名排歧方法。根据作者间的合作关系以及作者与机构间的隶属关系构造实体关系图,采用广度优先搜索策略搜索图中两两同名作者间的有效路径;根据有效路径长度、数目及路径上边的类型,计算两个同名作者间的连接强度,并将其与阈值进行比较,实现同名排歧。实验结果表明,所提方法比当前最好的方法具有更好的同名排歧效果,且能够实现单一作者的同名排歧。

关键词: 连接强度, 实体分辨, 数据质量, 同名排歧, 有效路径

Abstract: Name disambiguation is one of the most challenging issues in entity resolution domain,and it aims at solving the problem that the same name is shared by different people.However,most of the conventional approaches rely heavily on sufficient information of entities,and fail to realize the name identification with insufficient information.This paper proposesd a novel name disambiguation approach based on co-authors and authors’affiliates.Specifically,entity relationship diagram is constructed based on co-authorship and authors’affiliates,and the breadth-first search scheme is utilized to search the effective path between each pair of authors with the exactly same name in the constructed entity relationship diagram.A unique metric connection strength between authors is calculated according to the length of effective path,the number of effective path and the type of edge on path.And it is compared with the threshold to achieve name disambiguation.Experimental results show that the proposed approach is better than the state-of-the-art approaches,and it is able to disambiguate the authors sharing the same name without co-authorship.

Key words: Connection strength, Data quality, Effective path, Entity resolution, Name disambiguation

中图分类号: 

  • TP311
[1]TAN M C,DIAO X C,CAO J J,et al.Relationship Type Based Connection Strength Model for Relationship-based Entity Resolution[J].Journal of Computational Information Systems,2015,11(16):5947-5957.
[2]ANDERSON A F,VELOSO A,MARCOS A G,et al.Self-trai- ning Author Name Disambiguation for Information Scarce Scenarios[J].Journal of the American Society for Information Scien-ce & Technology,2014,65(6):1257-1278.
[3]EMILIA A S,ANDERSON A F,MARCOS A G.Combining Classifiers and User Feedback for Disambiguating Author Names[C]∥Proceedings of JCDL’16.Knoxville,Tennessee,USA,2015:259-260.
[4]COTA R G,ANDERSON A F,MARCOS A G,et al.An Unsupervised Heuristic-based Hierarchical Method for Name Disambiguation in Bibliographic Citations[J].Journal of the American Society for Information Science & Technology,2010,61(9):1853-1870.
[5]FAKHRI M,PHILIPP M.Using Co-authorship Networks for Author Name Disambiguation[C]∥2016 IEEE/ACM Joint Conference on Digital Libraries(JCDL).2016:261-262.
[6]CARVALHO A P,ANDERSON A F,ALBERTO H F,et al.Incremental Unsupervised Name Disambiguation in Cleaned Digi-tal Libraries[J].Journal of Information and Data Management,2011,2(3):289-304.
[7]FAN X M,WANG J Y,PU X,et al.On Graph-based Name Di- sambiguation[J].ACM Journal of Data and Information Quality,2011,2(2):1-23.
[8]MADIAN K,PUCKTADA T,LEE C G.Online Person Name Disambiguation with Constraints[C]∥ACM/IEEE-CS Joint Conference on Digital Libraries.2015:37-46.
[9]KIM K,KHABSA M,GILES C L.Inventor Name Disambigua- tion for a Patent Database Using a Random Forest and DBSCAN[C]∥Proceedings of the 16th ACM/IEE-CS on Joint Conference on Digital Libraries.2016:269-270.
[10]ZHENG C S,JI D,CAI D F.The Method of Expert Name Di- sambiguation Based on System Combination[J].Journal of Shen-yang Aerospace University,2014,31(2):74-78.(in Chinese)
郑才松,季铎,蔡东风.基于系统融合的专家同名区分方法[J].沈阳航空航天大学学报,2014,31(2):74-78.
[11]CHEN W L.Name Disambiguation Based on the Coauthorship Association Graph of Scholar Papers[D].Hangzhou:Hangzhou Dianzi University,2017.(in Chinese)
陈未路.基于科研论文合作者关系图的同名排歧方法研究[D].杭州:杭州电子科技大学,2017.
[12]THIAGO A G,RICARD S T,ARIADNE M B,et al.A Relevancd Feedback Approach for the Author Name Disambiguation Problem[C]∥Proceedings of ACM/IEEE Joint Conference on Digital Libraries’13 Indianapolis.Indiana,USA,2013:209-218.
[13]YIN X X,HAN J W,PHILIP S Y.Object Distinction:Distinguishing Objects with Identical Names[C]∥Proceedings ofInternational Conference on Data Engineering(ICDE).2007:1242-1246.
[14]XU R F,GUI L,LU Q,et al.Incorporating Multi-kernel function and Internet Verification for Chinese Person Name Disambiguation[J].Frontiers of Computer Science,2016,10(6):1-13.
[15]HIEN T N,TRU H C.Named Entity Disambiguation:A Hybird Statistical and Rule-Based Incremental Approach[C]∥Procee-dings of the Semantic Web:the 3th Asian Semantic Web Confe-rence(ASWC).2008:420-433.
[16]FU J L,QIU J,GUO Y L,et al.Entity Linking and Name Disam- biguation Using SVM in CHINESE Micro-blogs[C]∥Proceedings of International Conference on Natural Computation.IEEE,2016:468-472.
[17]LI Y P.Bibliometric Analysis and Name Disambiguation Research Based on Knowledge Clustering[D].Nanjing:Nanjing University of Posts and Telecommunications,2016.(in Chinese)
李永萍.基于知识聚类的文献统计与重名消歧机制的研究[D].南京:南京邮电大学,2016.
[18]MIN S,ERIN H K,HA J K.Exploring author name disambigua- tion on PubMed-scale[J].Journal of Informetrics,2015,9(4):924-941.
[19]MU L M.Research of the Nature & Operation of Finite MultiSet[J].Journal of Neijiang Normal University,2009,24(4):5-8.(in Chinese)
牟廉明.有限多重集的运算及性质[J].内江师范学院学报,2009,24(4):5-8.
[20]TRAVERS J,MILGRAM S.An Experimental Study of the Small World Problem[J].Sociometry,1969,32(4):425-443.
[21]MONGLI L,TOK W L,WAI L L.Intelliclean:A Knowledge-Based Intelligent Data Cleaner[C]∥ACM Sigkdd International Conference on Knowledge Discovery & Data Mining.2000:290-294.
[1] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[2] 赵学磊, 季新生, 刘树新, 李英乐, 李海涛.
基于路径连接强度的有向网络链路预测方法
Link Prediction Method for Directed Networks Based on Path Connection Strength
计算机科学, 2022, 49(2): 216-222. https://doi.org/10.11896/jsjkx.210100107
[3] 郑小萌, 高猛, 滕俊元.
航天器软件缺陷预测数据集构建方法研究
Research on Construction Method of Defect Prediction Dataset for Spacecraft Software
计算机科学, 2021, 48(6A): 575-580. https://doi.org/10.11896/jsjkx.200900133
[4] 李卓, 徐哲, 陈昕, 李淑琴.
面向移动群智感知的位置相关在线多任务分配算法
Location-related Online Multi-task Assignment Algorithm for Mobile Crowd Sensing
计算机科学, 2019, 46(6): 102-106. https://doi.org/10.11896/j.issn.1002-137X.2019.06.014
[5] 王旸, 蔡淑琴, 邹新文, 陈梓桐.
质量嵌入的大数据产品生产系统超图模型及其生产线决策研究
Quality-embedded Hypergraph Model for Big Data Product Manufacturing System and Decision for Production Lines
计算机科学, 2019, 46(2): 11-17. https://doi.org/10.11896/j.issn.1002-137X.2019.02.002
[6] 蔡莉,梁宇,朱扬勇,何婧.
数据质量的历史沿革和发展趋势
History and Development Tendency of Data Quality
计算机科学, 2018, 45(4): 1-10. https://doi.org/10.11896/j.issn.1002-137X.2018.04.001
[7] 黄冬梅,赵丹枫,魏立斐,杜艳玲,王振华.
大数据背景下海洋数据管理的挑战与对策
Managing Marine Data as Big Data:Uprising Challenges and Tentative Solutions
计算机科学, 2016, 43(6): 17-23. https://doi.org/10.11896/j.issn.1002-137X.2016.06.003
[8] 谭明超,刁兴春,曹建军.
实体分辨研究综述
Survey on Entity Resolution
计算机科学, 2014, 41(4): 9-12.
[9] 韩京宇,陈可佳.
基于事实抽取的Web文档内容数据质量评估
Ranking Data Quality of Web Article Content by Extracting Facts
计算机科学, 2014, 41(11): 247-251. https://doi.org/10.11896/j.issn.1002-137X.2014.11.047
[10] 曹建军,刁兴春,陈 爽,邵衍振.
数据清洗及其一般性系统框架
Data Cleaning and its General System Framework
计算机科学, 2012, 39(Z11): 207-211.
[11] 林印华,张春海,刘 洁.
基于清洗规则和主数据的数据修复算法实现
Realization of Data Cleaning Based on Editing Rules and Master Data
计算机科学, 2012, 39(Z11): 174-176.
[12] 徐俊刚,裴莹.
数据ETL研究综述
Overview of Data Extraction, Transformation and Loading
计算机科学, 2011, 38(4): 15-20.
[13] 陈卫东,张维明.
属性粒度数据质量模型及其评价指标研究
Data Quality Model and Metrics Research at Attribute Granularity
计算机科学, 2010, 37(5): 139-142.
[14] 曹建军,刁兴春,汪挺,王芳潇.
领域无关数据清洗研究综述
Research on Domain-independent Data Cleaning: A Survey
计算机科学, 2010, 37(5): 26-29.
[15] 胡艳丽,张维明.
条件依赖理论及其应用展望
Theory of Conditional Functional Dependencies and its Application for Improving Data Quality
计算机科学, 2009, 36(12): 115-118.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!