位置信息记录中基于期望最大化的名称消重算法

doi:10.11896/j.issn.1002-137X.2016.03.043

Abstract

Abstract: In check-in records with corresponding locations,each record only contains the attributes of name and location,i.e.,longitude and latitude.Traditional name deduplicating algorithms deduplicate names by matching attributes between two entities or computing similarity between names of the two entities,and thus neglect the particularity of locations.In order to improve the quality of name deduplicating in spatial records,this paper proposed an expectation maximization based name deduplicating algorithm.Firstly,we proposed a text name model containing core and background words,and gave an expectation maximization algorithm for computing parameters of the model.Secondly,we introduced location into the text name model,partitioned the whole world into tiles,computed the distributions of core and background words in each tile,and proposed a text name model including location.Finally,we used the location text name model to deduplicate names in location records,and presented corresponding name deduplicating algorithm.The experiments show that,our proposed algorithm can better recognize core word in a name than related works,and thus performs better while deduplicating name in location records.

Key words: Check-in,Location,Expectation maximization,Name deduplicating

SUN Xiao-ling, ZHENG Mian, LI Wei-qin and LUO En-tao. Expectation Maximization Based Name Deduplicating Algorithm in Spatial Records[J].Computer Science, 2016, 43(3): 238-241.

References

[1] Ye Huan-zhuo,Wu Di.A Survey of Approximately Duplicated Data Cleaning Method [J].New Technology of Library and Information Service,2010(9):56-66(in Chinese) 叶焕倬,吴迪.相似重复记录清理方法研究综述[J].现代图书情报技术,2010(9):56-66
[2] Guo Zhi-mao,Zhou Ao-ying.Research on Data Quality and Data Cleaning:a Survey[J].Journal of Software,2002,13(11):2076-2082(in Chinese) 郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082
[3] Pang Xiong-wen,Yao Zhan-lin,Li Yong-jun.Efficient duplicate records detection method for massive data[J].Journal of Huazhong University of Science and Technology(Natural Science Edition),2010,38(2):8-11(in Chinese) 庞雄文,姚占林,李拥军.大数据量的高效重复记录检测方法[J].华中科技大学学报(自然科学版),2010,38(2):8-11
[4] Zheng Y.Location-based social networks:Users[M]∥Computing with Spatial Trajectories.Springer New York,2011:243-276
[5] Ye M,Yin P,Lee W C.Location recommendation for location-based social networks[C]∥Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems.ACM,2010:458-461
[6] Sakib M N,Bin Halim J,Huang C T.Determining Location and Movement Pattern Using Anonymized WiFi Access Point BSSID[C]∥2014 7th International Conference on Security Technology (SecTech).IEEE,2014:11-14
[7] Chang J,Sun E.Location 3:How users share and respond to location-based data on social networking sites[C]∥Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.2011:74-80
[8] Tata S,Patel J M.Estimating the selectivity of tf-idf based cosine similarity predicates[J].ACM SIGMOD Record,2007,36(2):7-12
[9] Xu Yi-zhen,Wang Yong-cheng.A Fast Algorithm for Matching Multiple Patterns[J].Journal of Shanghai Jiaotong University,2002,36(4):516-520(in Chinese) 许一震,王永成.一种快速的多模式字符串匹配算法[J].上海交通大学学报,2002,36(4):516-520
[10] Sun De-cai,Sun Xing-ming,Zhang Wei,et al.A Fitter Algorithm for Approximate String Matching Based on Match-Region Features[J].Journal of Computer Research and Development,2010,47(4):663-670(in Chinese) 孙德才,孙星明,张伟,等.基于匹配区域特征的相似字符串匹配过滤算法[J].计算机研究与发展,2010,47(4):663-670
[11] Bilenko M,Mooney R J.Adaptive duplicate detection usinglearnable string similarity measures[C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:39-48
[12] Oncina J,Sebban M.Learning stochastic edit distance:Application in handwritten character recognition[J].Pattern recognition,2006,39(9):1575-1587
[13] McCallum A,Bellare K,Pereira F.A conditional random fieldfor discriminatively-trained finite-state string edit distance[C]∥Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence(UAI-05).Arliagton,Virginia:AVAI Press,2005
[14] Huang Lin-sheng,Deng Zhi-hong,Tang Shi-wei,et al.A Chi-nese organization’s full name and matching abbreviation algorithm Based on edit-distance[J].Journal of Shandong University (Natural Science),2012,47(5):43-48(in Chinese) 黄林晟,邓志鸿,唐世渭,等.基于编辑距离的中文组织机构名简称-全称匹配算法[J].山东大学学报(理学版),2012,47(5):43-48
[15] Fritz S,McCallum I,Schill C,et al.Geo-Wiki:An online platform for improving global land cover[J].Environmental Modelling & Software,2012,31:110-123
[16] Moon T K.The expectation-maximization algorithm[J].Signal Processing Magazine,IEEE,1996,13(6):47-60
[17] Chen Qing-zhi,Chen Guo-long,Guo Wen-zhong,et al.A Hybrid Clustering Algorithm for Information Security Evaluation Log Data[J].Journal of Chongqing Institute of Technology(Natural Science),2009,23(10):77-82,8(in Chinese) 陈庆枝,陈国龙,郭文忠,等.信息安全评估日志数据的一种混合聚类算法[J].重庆工学院学报(自然科学),2009,23(10):77-82,118

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Expectation Maximization Based Name Deduplicating Algorithm in Spatial Records

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0