位置信息记录中基于期望最大化的名称消重算法

doi:10.11896/j.issn.1002-137X.2016.03.043

摘要/Abstract

摘要： 在包含位置信息的签到记录中,每条记录仅包含名称和位置(经纬度)两个属性。传统的名称消重算法通过匹配实体的属性值或者计算实体间的名称相似性进行消重,忽略了位置信息的特殊性。为了提高位置信息记录中名称消重的质量,提出了一种基于期望最大化的位置名称消重算法。首先,提出了一种包含核心单词和背景单词的文本名称模型,并给出了计算模型参数值的期望最大化算法。其次,在文本名称模型中引入位置信息,将整个地图划分为若干个网格,分别计算每个网格中核心单词和背景单词的分布情况,并提出了一种考虑位置的文本名称模型。最后,将位置文本名称模型用于位置信息记录中的名称消重,并给出了相应的名称消重算法。实验表明, 与传统的名称消重模型相比,提出的位置名称消重模型可以更好地识别出名称中包含的核心词汇,因而在名称消重时具有更好的性能。

关键词: 签到,位置,期望最大化,名称消重

Abstract: In check-in records with corresponding locations,each record only contains the attributes of name and location,i.e.,longitude and latitude.Traditional name deduplicating algorithms deduplicate names by matching attributes between two entities or computing similarity between names of the two entities,and thus neglect the particularity of locations.In order to improve the quality of name deduplicating in spatial records,this paper proposed an expectation maximization based name deduplicating algorithm.Firstly,we proposed a text name model containing core and background words,and gave an expectation maximization algorithm for computing parameters of the model.Secondly,we introduced location into the text name model,partitioned the whole world into tiles,computed the distributions of core and background words in each tile,and proposed a text name model including location.Finally,we used the location text name model to deduplicate names in location records,and presented corresponding name deduplicating algorithm.The experiments show that,our proposed algorithm can better recognize core word in a name than related works,and thus performs better while deduplicating name in location records.

Key words: Check-in,Location,Expectation maximization,Name deduplicating

孙晓玲,郑勉,李伟勤,罗恩韬. 位置信息记录中基于期望最大化的名称消重算法[J]. 计算机科学, 2016, 43(3): 238-241. https://doi.org/10.11896/j.issn.1002-137X.2016.03.043

SUN Xiao-ling, ZHENG Mian, LI Wei-qin and LUO En-tao. Expectation Maximization Based Name Deduplicating Algorithm in Spatial Records[J]. Computer Science, 2016, 43(3): 238-241. https://doi.org/10.11896/j.issn.1002-137X.2016.03.043

参考文献

[1] Ye Huan-zhuo,Wu Di.A Survey of Approximately Duplicated Data Cleaning Method [J].New Technology of Library and Information Service,2010(9):56-66(in Chinese) 叶焕倬,吴迪.相似重复记录清理方法研究综述[J].现代图书情报技术,2010(9):56-66
[2] Guo Zhi-mao,Zhou Ao-ying.Research on Data Quality and Data Cleaning:a Survey[J].Journal of Software,2002,13(11):2076-2082(in Chinese) 郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(11):2076-2082
[3] Pang Xiong-wen,Yao Zhan-lin,Li Yong-jun.Efficient duplicate records detection method for massive data[J].Journal of Huazhong University of Science and Technology(Natural Science Edition),2010,38(2):8-11(in Chinese) 庞雄文,姚占林,李拥军.大数据量的高效重复记录检测方法[J].华中科技大学学报(自然科学版),2010,38(2):8-11
[4] Zheng Y.Location-based social networks:Users[M]∥Computing with Spatial Trajectories.Springer New York,2011:243-276
[5] Ye M,Yin P,Lee W C.Location recommendation for location-based social networks[C]∥Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems.ACM,2010:458-461
[6] Sakib M N,Bin Halim J,Huang C T.Determining Location and Movement Pattern Using Anonymized WiFi Access Point BSSID[C]∥2014 7th International Conference on Security Technology (SecTech).IEEE,2014:11-14
[7] Chang J,Sun E.Location 3:How users share and respond to location-based data on social networking sites[C]∥Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media.2011:74-80
[8] Tata S,Patel J M.Estimating the selectivity of tf-idf based cosine similarity predicates[J].ACM SIGMOD Record,2007,36(2):7-12
[9] Xu Yi-zhen,Wang Yong-cheng.A Fast Algorithm for Matching Multiple Patterns[J].Journal of Shanghai Jiaotong University,2002,36(4):516-520(in Chinese) 许一震,王永成.一种快速的多模式字符串匹配算法[J].上海交通大学学报,2002,36(4):516-520
[10] Sun De-cai,Sun Xing-ming,Zhang Wei,et al.A Fitter Algorithm for Approximate String Matching Based on Match-Region Features[J].Journal of Computer Research and Development,2010,47(4):663-670(in Chinese) 孙德才,孙星明,张伟,等.基于匹配区域特征的相似字符串匹配过滤算法[J].计算机研究与发展,2010,47(4):663-670
[11] Bilenko M,Mooney R J.Adaptive duplicate detection usinglearnable string similarity measures[C]∥Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2003:39-48
[12] Oncina J,Sebban M.Learning stochastic edit distance:Application in handwritten character recognition[J].Pattern recognition,2006,39(9):1575-1587
[13] McCallum A,Bellare K,Pereira F.A conditional random fieldfor discriminatively-trained finite-state string edit distance[C]∥Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence(UAI-05).Arliagton,Virginia:AVAI Press,2005
[14] Huang Lin-sheng,Deng Zhi-hong,Tang Shi-wei,et al.A Chi-nese organization’s full name and matching abbreviation algorithm Based on edit-distance[J].Journal of Shandong University (Natural Science),2012,47(5):43-48(in Chinese) 黄林晟,邓志鸿,唐世渭,等.基于编辑距离的中文组织机构名简称-全称匹配算法[J].山东大学学报(理学版),2012,47(5):43-48
[15] Fritz S,McCallum I,Schill C,et al.Geo-Wiki:An online platform for improving global land cover[J].Environmental Modelling & Software,2012,31:110-123
[16] Moon T K.The expectation-maximization algorithm[J].Signal Processing Magazine,IEEE,1996,13(6):47-60
[17] Chen Qing-zhi,Chen Guo-long,Guo Wen-zhong,et al.A Hybrid Clustering Algorithm for Information Security Evaluation Log Data[J].Journal of Chongqing Institute of Technology(Natural Science),2009,23(10):77-82,8(in Chinese) 陈庆枝,陈国龙,郭文忠,等.信息安全评估日志数据的一种混合聚类算法[J].重庆工学院学报(自然科学),2009,23(10):77-82,118

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed