Computer Science ›› 2016, Vol. 43 ›› Issue (6): 254-256.doi: 10.11896/j.issn.1002-137X.2016.06.050

Previous Articles     Next Articles

Set Similarity Approximation Algorithm Based on Parity of Data Sketch

JIA Jian-wei and CHEN Ling   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Jaccard similarity is one of the most important methods in set similarity computation.When approximately computing the Jaccard similarity of two sets using the b-bits hash function,if there are multiple elements being similar to the input element with similarity up to 1,the b-bits hash function can’t differentiate these elements very well.In order to improve the accuracy of data sketch and the application performance based on set similarity,this paper proposed a set similarity approximation algorithm based on parity of data sketch.After getting the two permutation sets with minwise hash function,we used two n-bits indicator vectors to represent the parity of elements in the permutation set appearing in indicator vectors,and estimated the Jaccard similarity of original sets based on these two parity vectors.We inferred the parity sketch based on both Markov chain and Poisson distribution models,and verified their equivalence.Experiments on Enron dataset show that the proposed parity sketch is more accurate than the b-bits hash function,and performs much better in both applications of duplicate document detection and associate rule mining.

Key words: Data sketch,Set similarity,Parity,Approximation algorithm

[1] Arasu A,Ganti V,Kaushik R.Efficient exact set-similarity joins[C]∥Proceedings of the 32nd International Conference on Very Large Data Bases.VLDB Endowment,2006:918-929
[2] Xiao C,Wang W,Lin X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems (TODS),2011,36(3):15-20
[3] Manku G S,Jain A,Das Sarma A.Detecting near-duplicates for Web crawling[C]∥Proceedings of the 16th International Confe-rence on World Wide Web.ACM,2007:141-150
[4] Zhao Qin-qin,Lu Kai,Wang Bin.SPCF:A Memory Based Collaborative Filtering Algorithm via Propagation[J].Chinese Journal of Computers,2013,6(3):671-676(in Chinese) 赵琴琴,鲁凯,王斌.SPCF:一种基于内存的传播式协同过滤推荐算法[J].计算机学报,2013,36(3):671-676
[5] Koren Y,Bell R.Advances in collaborative filtering[M].Recommender systems handbook.Springer US,2011:145-186
[6] Li Shu-kui.Research on Time series similarity problem[D].Wuhan:Huazhong University of Science and Technology,2008(in Chinese) 李俊奎.时间序列相似性问题研究 [D].武汉:华中科技大学,2008
[7] Feng Yu-cai,Feng Jian-lin.Incremental Updating Algorithmsfor Mining Association Rules[J].Journal of Software,1998(4):301-306(in Chinese) 冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报,1998,9(4):301-306
[8] Stein B.Principles of hash-based text retrieval[C]∥Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:527-534
[9] 凌康.基于位置敏感哈希的相似性搜索技术研究[D].南京:南京大学,2012
[10] Indyk P.A small approximately min-wise independent family of hash functions[J].Journal of Algorithms,2011,38(1):84-90
[11] Broder A Z,Charikar M,Frieze A M,et al.Min-wise indepen-dent permutations[J].Journal of Computer and System Sciences,2010,60(3):630-659
[12] Li P,Knig C.b-Bit minwise hashing[C]∥Proceedings of the 19th International Conference on World Wide Web.ACM,2010:671-680
[13] Schuster E F,Philippou A N.The odds in some odd-even games[J].American Mathematical Monthly,1975,82(6):646-648
[14] Shetty J,Adibi J.The Enron email dataset database schema and brief statistical report[J].Information Sciences Institute Technical Report,University of Southern California,2004,4(4):210-215
[15] Yu Xiao-sheng,Hu Sun-zhi.Research on Eliminating Duplicate Records Based on SNM Improved Algorithm[J].Journal of Chongqing University of Technology(Natural Science),2016,0(4):91-96(in Chinese) 余肖生,胡孙枝.基于SNM改进算法的相似重复记录消除[J].重庆理工大学学报(自然科学),2016,0(4):91-96

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!