实体分辨研究综述

Abstract

Abstract: Entity resolution (ER) is one of central issues of data integration and information retrieval．Its purpose is to find all real-world entities from a given dataset,and to cluster the references that refer to the same entity．ER process was partitioned into blocking step,records comparison step and matching decision step．The methods of blocking were summarized according to the way that records are clustered together,and the methods of record pair comparison are surveyed according to the size that strings are grained,and the decision models were introduced according to the way that records associate with each other．At last,the future research issues were discussed.

Key words: Entity resolution,Blocking,Similarity,Decision model

TAN Ming-chao,DIAO Xing-chun and CAO Jian-jun. Survey on Entity Resolution[J].Computer Science, 2014, 41(4): 9-12.

References

[1] Christen P．Data Matching[M]．New York,USA:Springer,2012
[2] 王宏志,樊文飞．复杂数据上的实体识别技术研究[J]．计算机学报,2011,34(10):1843-1852
[3] Dunn H L．Record Linkage[J]．American Journal of PublicHealth and the Nations Health,1946,36(12):1412-1416
[4] Newcombe H B,Kennedy J M,Axford S J,et al．Automatic linkage of vital records[J]．Science,1959,0(3381):954-959
[5] Fellegi I P,Sunter A B．A theory for record linkage[J]．Journal of the American Statistical Association,1969,64(328):1183-1210
[6] Winkler W E,Thibaudeau Y．An application of the Fellegi-Sunter model of record linkage to the 1990US decennial census[R]．US Bureau of the Census,1991:1-22
[7] Hernández M A,Stolfo S J．The merge/purge problem for large databases[J]．ACM SIGMOD Record,ACM,1995,24(2):127-138
[8] Monge A E．Matching algorithms within a duplicate detectionsystem[J]．IEEE Data Engineering Bulletin,2000,23(4):14-20
[9] Monge A,Elkan C．The field-matching problem:Algorithm and applications[C]∥Proceedings of the 2nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining．1996
[10] Elmagarmid A K,Ipeirotis P G,Verykios V S．Duplicate record detection:A survey[J]．IEEE Transactions on Knowledge and Data Engineering,2007,19(1):1-16
[11] Winkler W E．Overview of record linkage and current research directions[R]．Bureau of the Census,2006
[12] Batini C,Scannapieca M．Data quality:concepts,methodologiesand techniques[M]．Springer,2006
[13] Christen P．A survey of indexing techniques for scalable record linkage and deduplication[J]．IEEE Transactions on Knowledge and Data Engineering,2012,24(9):1537-1555
[14] Winkler W E,Yancey W E,Porter E H．Fast record linkage of very large files in support of decennial and administrative records projects[C]∥Proceedings of the Section on Survey Research Methods．American Statistical Association,2010
[15] Yan S,Lee D,Kan M Y,et al．Adaptive sorted neighborhoodmethods for efficient record linkage[C]∥Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries．ACM,2007:185-194
[16] Draisbach U,Naumann F,Szott S,et al．Adaptive windows for duplicate detection[C]∥IEEE 28th International Conference on Data Engineering (ICDE)．IEEE,2012:1073-1083
[17] Baxter R,Christen P,Churches T．A comparison of fast blocking methods for record linkage[C]∥ACM SIGKDD．2003,3:25-27
[18] De Vries T,Ke H,Chawla S,et al．Robust record linkage blocking using suffix arrays and Bloom filters[J]．ACM Transactions on Knowledge Discovery from Data,2011,5(2):9
[19] Whang S E,Menestrina D,Koutrika G,et al．Entity resolution with iterative blocking[C]∥Proceedings of the 35th SIGMOD international conference on Management of data．ACM,2009:219-232
[20] Christen P,Goiser K．Quality and complexity measures for data linkage and deduplication[M]∥Quality Measures in Data Min-ing．Springer Berlin Heidelberg,2007:127-151
[21] Michelson M,Knoblock C A．Learning blocking schemes for record linkage[J]．Proceedings of the National Conference on Artificial Intelligence,Menlo Park,CA；Cambridge,MA；London；AAAI Press；MIT Press；1999,2006,21(1):440
[22] Koudas N,Sarawagi S,Srivastava D．Record linkage:similaritymeasures and algorithms[C]∥Proceedings of the 2006ACM SIGMOD international conference on Management of data．ACM,2006:802-803
[23] Cohen W W,Ravikumar P,Fienberg S E．A comparison of string distance metrics for name-matching tasks[C]∥Proceedings of the IJCAI-2003Workshop on Information Integration on the Web (IIWeb-03)．2003:47
[24] Snae C．A comparison and analysis of name matching algorithms[J]．International Journal of Applied Science,Engineering and Technology,2007,4(1):252-257
[25] Bilenko M,Mooney R,Cohen W,et al．Adaptive name matching in information integration[J]．IEEE Intelligent Systems,2003,18(5):16-23
[26] Gravano L,Ipeirotis P G,Koudas N,et al．Text joins in an RDBMS for web data integration[C]∥Proceedings of the 12th international conference on World Wide Web．ACM,2003:90-101
[27] Gill L．OX-LINK:The Oxford Medical Record Linkage System[C]∥Proc．Int’l Record Linkage Workshop and Exposition．1997:15-33
[28] 刁兴春,谭明超,曹建军．一种融合多种编辑距离的字符串相似度计算方法[J]．计算机应用研究,2010,27(12):4523-4525
[29] Wang J,Li G,Yu J X,et al．Entity matching:how si-milar is similar[J]．Proceedings of the VLDB Endowment,2011,4(10):622-633
[30] Herzog T N,Scheuren F J,Winkler W E．Data quality and re-cord linkage techniques[M]．Springer,2007
[31] Verykios V S,Moustakides G V,Elfeky M G．A Bayesian decision model for cost optimal record matching[J]．The VLDB Journal,2003,12(1):28-40
[32] Naumann F,Herschel M．An introduction to duplicate detection[J]．Synthesis Lectures on Data Management,2010,2(1):1-87
[33] Fan W,Jia X,Li J,et al．Reasoning about record matching rules[J]．Proceedings of the VLDB Endowment,2009,2(1):407-418
[34] Cochinwala M,Kurien V,Lalk G,et al．Efficient data reconciliation[J]．Information Sciences,2001,137(1):1-15
[35] Christen P．Automatic record linkage using seeded nearestneighbour and support vector machine classification[C]∥Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining．ACM,2008:151-159
[36] Christen P．Automatic training example selection for scalableunsupervised record linkage[M]∥Advances in Knowledge Discovery and Data Mining．Springer Berlin Heidelberg,2008:511-518
[37] Arasu A,Gtz M,Kaushik R．On active learning of record matching packages[C]∥Proceedings of the 2010International Conference on Management of Data．ACM,2010:783-794
[38] Whang S E,Garcia-Molina H．Entity resolution with evolvingrules[J]．Proceedings of the VLDB Endowment,2010,3(1/2):1326-1337
[39] Monge A E．Matching algorithms within a duplicate detectionsystem[J]．IEEE Data Engineering Bulletin,2000,23(4):14-20
[40] Hassanzadeh O,Miller R J．Creating probabilistic databasesfrom duplicated data[J]．The VLDB Journal－The International Journal on Very Large Data Bases,2009,18(5):1141-1166
[41] Hernández M A,Stolfo S J．The merge/purge problem for large databases[C]∥ACM SIGMOD Record．ACM,1995,24(2):127-138
[42] Chaudhuri S,Ganti V,Motwani R．Robust identification of fuzzy duplicates[C]∥Proceedings of 21st International Conference on Data Engineering 2005．IEEE,2005:865-876
[43] Kalashnikov D V,Mehrotra S．Domain-independent data clea-ning via analysis of entity-relationship graph[J]．ACM Transactions on Database Systems (TODS),2006,31(2):716-767
[44] Dong X,Halevy A,Madhavan J．Reference reconciliation in complex information spaces[C]∥Proceedings of the 2005ACM SIGMOD International Conference on Management of Data．ACM,2005:85-96
[45] Bhattacharya I,Getoor L．Collective entity resolution in relationaldata[J]．ACM Transactions on Knowledge Discovery from Data (TKDD),2007,1(1):5
[46] Rastogi V,Dalvi N,Garofalakis M．Large-scale collective entitymatching[J]．Proceedings of the VLDB Endowment,2011,4(4):208-218
[47] Fan W．Dependencies revisited for improving data quality[C]∥Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems．ACM,2008:159-170

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Survey on Entity Resolution

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0