计算机科学 ›› 2014, Vol. 41 ›› Issue (4): 9-12.

• 综述 • 上一篇    下一篇

实体分辨研究综述

谭明超,刁兴春,曹建军   

  1. 解放军理工大学指挥信息系统学院 南京210007;解放军理工大学指挥信息系统学院 南京210007;总参第63研究所 南京210007
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61070714),解放军理工大学预研基金(20110604)资助

Survey on Entity Resolution

TAN Ming-chao,DIAO Xing-chun and CAO Jian-jun   

  • Online:2018-11-14 Published:2018-11-14

摘要: 实体分辨是数据集成、信息检索等领域的重要研究内容,目的是发现数据集合中的不同实体和同一实体的不同描述。将实体分辨过程划分为数据分块、记录比较和匹配决策等3个主要步骤。从记录聚集方式的角度介绍了实体分辨的数据分块方法;从字符串划分粒度的角度分析了实体分辨的记录比较方法;从记录关联方式的角度阐述了实体分辨的决策模型。最后对实体分辨研究下一步需要解决的问题进行了展望。

关键词: 实体分辨,数据分块,相似度,决策模型

Abstract: Entity resolution (ER) is one of central issues of data integration and information retrieval.Its purpose is to find all real-world entities from a given dataset,and to cluster the references that refer to the same entity.ER process was partitioned into blocking step,records comparison step and matching decision step.The methods of blocking were summarized according to the way that records are clustered together,and the methods of record pair comparison are surveyed according to the size that strings are grained,and the decision models were introduced according to the way that records associate with each other.At last,the future research issues were discussed.

Key words: Entity resolution,Blocking,Similarity,Decision model

[1] Christen P.Data Matching[M].New York,USA:Springer,2012
[2] 王宏志,樊文飞.复杂数据上的实体识别技术研究[J].计算机学报,2011,34(10):1843-1852
[3] Dunn H L.Record Linkage[J].American Journal of PublicHealth and the Nations Health,1946,36(12):1412-1416
[4] Newcombe H B,Kennedy J M,Axford S J,et al.Automatic linkage of vital records[J].Science,1959,0(3381):954-959
[5] Fellegi I P,Sunter A B.A theory for record linkage[J].Journal of the American Statistical Association,1969,64(328):1183-1210
[6] Winkler W E,Thibaudeau Y.An application of the Fellegi-Sunter model of record linkage to the 1990US decennial census[R].US Bureau of the Census,1991:1-22
[7] Hernández M A,Stolfo S J.The merge/purge problem for large databases[J].ACM SIGMOD Record,ACM,1995,24(2):127-138
[8] Monge A E.Matching algorithms within a duplicate detectionsystem[J].IEEE Data Engineering Bulletin,2000,23(4):14-20
[9] Monge A,Elkan C.The field-matching problem:Algorithm and applications[C]∥Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.1996
[10] Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey[J].IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16
[11] Winkler W E.Overview of record linkage and current research directions[R].Bureau of the Census,2006
[12] Batini C,Scannapieca M.Data quality:concepts,methodologiesand techniques[M].Springer,2006
[13] Christen P.A survey of indexing techniques for scalable record linkage and deduplication[J].IEEE Transactions on Knowledge and Data Engineering,2012,24(9):1537-1555
[14] Winkler W E,Yancey W E,Porter E H.Fast record linkage of very large files in support of decennial and administrative records projects[C]∥Proceedings of the Section on Survey Research Methods.American Statistical Association,2010
[15] Yan S,Lee D,Kan M Y,et al.Adaptive sorted neighborhoodmethods for efficient record linkage[C]∥Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries.ACM,2007:185-194
[16] Draisbach U,Naumann F,Szott S,et al.Adaptive windows for duplicate detection[C]∥IEEE 28th International Conference on Data Engineering (ICDE).IEEE,2012:1073-1083
[17] Baxter R,Christen P,Churches T.A comparison of fast blocking methods for record linkage[C]∥ACM SIGKDD.2003,3:25-27
[18] De Vries T,Ke H,Chawla S,et al.Robust record linkage blocking using suffix arrays and Bloom filters[J].ACM Transactions on Knowledge Discovery from Data,2011,5(2):9
[19] Whang S E,Menestrina D,Koutrika G,et al.Entity resolution with iterative blocking[C]∥Proceedings of the 35th SIGMOD international conference on Management of data.ACM,2009:219-232
[20] Christen P,Goiser K.Quality and complexity measures for data linkage and deduplication[M]∥Quality Measures in Data Min-ing.Springer Berlin Heidelberg,2007:127-151
[21] Michelson M,Knoblock C A.Learning blocking schemes for record linkage[J].Proceedings of the National Conference on Artificial Intelligence,Menlo Park,CA;Cambridge,MA;London;AAAI Press;MIT Press;1999,2006,21(1):440
[22] Koudas N,Sarawagi S,Srivastava D.Record linkage:similaritymeasures and algorithms[C]∥Proceedings of the 2006ACM SIGMOD international conference on Management of data.ACM,2006:802-803
[23] Cohen W W,Ravikumar P,Fienberg S E.A comparison of string distance metrics for name-matching tasks[C]∥Proceedings of the IJCAI-2003Workshop on Information Integration on the Web (IIWeb-03).2003:47
[24] Snae C.A comparison and analysis of name matching algorithms[J].International Journal of Applied Science,Engineering and Technology,2007,4(1):252-257
[25] Bilenko M,Mooney R,Cohen W,et al.Adaptive name matching in information integration[J].IEEE Intelligent Systems,2003,18(5):16-23
[26] Gravano L,Ipeirotis P G,Koudas N,et al.Text joins in an RDBMS for web data integration[C]∥Proceedings of the 12th international conference on World Wide Web.ACM,2003:90-101
[27] Gill L.OX-LINK:The Oxford Medical Record Linkage System[C]∥Proc.Int’l Record Linkage Workshop and Exposition.1997:15-33
[28] 刁兴春,谭明超,曹建军.一种融合多种编辑距离的字符串相似度计算方法[J].计算机应用研究,2010,27(12):4523-4525
[29] Wang J,Li G,Yu J X,et al.Entity matching:how si-milar is similar[J].Proceedings of the VLDB Endowment,2011,4(10):622-633
[30] Herzog T N,Scheuren F J,Winkler W E.Data quality and re-cord linkage techniques[M].Springer,2007
[31] Verykios V S,Moustakides G V,Elfeky M G.A Bayesian decision model for cost optimal record matching[J].The VLDB Journal,2003,12(1):28-40
[32] Naumann F,Herschel M.An introduction to duplicate detection[J].Synthesis Lectures on Data Management,2010,2(1):1-87
[33] Fan W,Jia X,Li J,et al.Reasoning about record matching rules[J].Proceedings of the VLDB Endowment,2009,2(1):407-418
[34] Cochinwala M,Kurien V,Lalk G,et al.Efficient data reconciliation[J].Information Sciences,2001,137(1):1-15
[35] Christen P.Automatic record linkage using seeded nearestneighbour and support vector machine classification[C]∥Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining.ACM,2008:151-159
[36] Christen P.Automatic training example selection for scalableunsupervised record linkage[M]∥Advances in Knowledge Discovery and Data Mining.Springer Berlin Heidelberg,2008:511-518
[37] Arasu A,Gtz M,Kaushik R.On active learning of record matching packages[C]∥Proceedings of the 2010International Conference on Management of Data.ACM,2010:783-794
[38] Whang S E,Garcia-Molina H.Entity resolution with evolvingrules[J].Proceedings of the VLDB Endowment,2010,3(1/2):1326-1337
[39] Monge A E.Matching algorithms within a duplicate detectionsystem[J].IEEE Data Engineering Bulletin,2000,23(4):14-20
[40] Hassanzadeh O,Miller R J.Creating probabilistic databasesfrom duplicated data[J].The VLDB Journal-The International Journal on Very Large Data Bases,2009,18(5):1141-1166
[41] Hernández M A,Stolfo S J.The merge/purge problem for large databases[C]∥ACM SIGMOD Record.ACM,1995,24(2):127-138
[42] Chaudhuri S,Ganti V,Motwani R.Robust identification of fuzzy duplicates[C]∥Proceedings of 21st International Conference on Data Engineering 2005.IEEE,2005:865-876
[43] Kalashnikov D V,Mehrotra S.Domain-independent data clea-ning via analysis of entity-relationship graph[J].ACM Transactions on Database Systems (TODS),2006,31(2):716-767
[44] Dong X,Halevy A,Madhavan J.Reference reconciliation in complex information spaces[C]∥Proceedings of the 2005ACM SIGMOD International Conference on Management of Data.ACM,2005:85-96
[45] Bhattacharya I,Getoor L.Collective entity resolution in relationaldata[J].ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):5
[46] Rastogi V,Dalvi N,Garofalakis M.Large-scale collective entitymatching[J].Proceedings of the VLDB Endowment,2011,4(4):208-218
[47] Fan W.Dependencies revisited for improving data quality[C]∥Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems.ACM,2008:159-170

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!