基于数据摘要奇偶性的集合相似性近似算法

doi:10.11896/j.issn.1002-137X.2016.06.050

计算机科学 ›› 2016, Vol. 43 ›› Issue (6): 254-256.doi: 10.11896/j.issn.1002-137X.2016.06.050

基于数据摘要奇偶性的集合相似性近似算法

贾建伟,陈崚

扬州大学信息工程学院扬州225002,扬州大学信息工程学院扬州225002

出版日期:2018-12-01 发布日期:2018-12-01
基金资助:
本文受国家自然科学基金面上项目(61070240)资助

Set Similarity Approximation Algorithm Based on Parity of Data Sketch

JIA Jian-wei and CHEN Ling

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： 在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。

关键词: 数据摘要,集合相似性,奇偶性,近似算法

Abstract: Jaccard similarity is one of the most important methods in set similarity computation.When approximately computing the Jaccard similarity of two sets using the b-bits hash function,if there are multiple elements being similar to the input element with similarity up to 1,the b-bits hash function can’t differentiate these elements very well.In order to improve the accuracy of data sketch and the application performance based on set similarity,this paper proposed a set similarity approximation algorithm based on parity of data sketch.After getting the two permutation sets with minwise hash function,we used two n-bits indicator vectors to represent the parity of elements in the permutation set appearing in indicator vectors,and estimated the Jaccard similarity of original sets based on these two parity vectors.We inferred the parity sketch based on both Markov chain and Poisson distribution models,and verified their equivalence.Experiments on Enron dataset show that the proposed parity sketch is more accurate than the b-bits hash function,and performs much better in both applications of duplicate document detection and associate rule mining.

Key words: Data sketch,Set similarity,Parity,Approximation algorithm

贾建伟,陈崚. 基于数据摘要奇偶性的集合相似性近似算法[J]. 计算机科学, 2016, 43(6): 254-256. https://doi.org/10.11896/j.issn.1002-137X.2016.06.050

JIA Jian-wei and CHEN Ling. Set Similarity Approximation Algorithm Based on Parity of Data Sketch[J]. Computer Science, 2016, 43(6): 254-256. https://doi.org/10.11896/j.issn.1002-137X.2016.06.050

参考文献

[1] Arasu A,Ganti V,Kaushik R.Efficient exact set-similarity joins[C]∥Proceedings of the 32nd International Conference on Very Large Data Bases.VLDB Endowment,2006:918-929
[2] Xiao C,Wang W,Lin X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems (TODS),2011,36(3):15-20
[3] Manku G S,Jain A,Das Sarma A.Detecting near-duplicates for Web crawling[C]∥Proceedings of the 16th International Confe-rence on World Wide Web.ACM,2007:141-150
[4] Zhao Qin-qin,Lu Kai,Wang Bin.SPCF:A Memory Based Collaborative Filtering Algorithm via Propagation[J].Chinese Journal of Computers,2013,6(3):671-676(in Chinese) 赵琴琴,鲁凯,王斌.SPCF:一种基于内存的传播式协同过滤推荐算法[J].计算机学报,2013,36(3):671-676
[5] Koren Y,Bell R.Advances in collaborative filtering[M].Recommender systems handbook.Springer US,2011:145-186
[6] Li Shu-kui.Research on Time series similarity problem[D].Wuhan:Huazhong University of Science and Technology,2008(in Chinese) 李俊奎.时间序列相似性问题研究 [D].武汉:华中科技大学,2008
[7] Feng Yu-cai,Feng Jian-lin.Incremental Updating Algorithmsfor Mining Association Rules[J].Journal of Software,1998(4):301-306(in Chinese) 冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报,1998,9(4):301-306
[8] Stein B.Principles of hash-based text retrieval[C]∥Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:527-534
[9] 凌康.基于位置敏感哈希的相似性搜索技术研究[D].南京:南京大学,2012
[10] Indyk P.A small approximately min-wise independent family of hash functions[J].Journal of Algorithms,2011,38(1):84-90
[11] Broder A Z,Charikar M,Frieze A M,et al.Min-wise indepen-dent permutations[J].Journal of Computer and System Sciences,2010,60(3):630-659
[12] Li P,Knig C.b-Bit minwise hashing[C]∥Proceedings of the 19th International Conference on World Wide Web.ACM,2010:671-680
[13] Schuster E F,Philippou A N.The odds in some odd-even games[J].American Mathematical Monthly,1975,82(6):646-648
[14] Shetty J,Adibi J.The Enron email dataset database schema and brief statistical report[J].Information Sciences Institute Technical Report,University of Southern California,2004,4(4):210-215
[15] Yu Xiao-sheng,Hu Sun-zhi.Research on Eliminating Duplicate Records Based on SNM Improved Algorithm[J].Journal of Chongqing University of Technology(Natural Science),2016,0(4):91-96(in Chinese) 余肖生,胡孙枝.基于SNM改进算法的相似重复记录消除[J].重庆理工大学学报(自然科学),2016,0(4):91-96

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于数据摘要奇偶性的集合相似性近似算法

Set Similarity Approximation Algorithm Based on Parity of Data Sketch

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0