计算机科学 ›› 2016, Vol. 43 ›› Issue (6): 254-256.doi: 10.11896/j.issn.1002-137X.2016.06.050
贾建伟,陈崚
JIA Jian-wei and CHEN Ling
摘要: 在应用b位哈希函数近似计算两个集合的Jaccard相似性时,如果有多个元素与输入元素的Jaccard相似性都很高(接近于1),那么b位哈希函数不能对这些元素进行很好的区分。为了提高数据摘要函数的准确性并提高基于相似性的应用的性能,提出了一种基于数据摘要奇偶性的集合相似性近似算法。在应用minwise哈希函数得到两个变异集合后,用两个n位指示向量来表示变异集合中的元素在指示向量中出现的奇偶性,并基于这两个奇偶性向量来估计原集合间的Jaccard相似性。通过马尔科夫链和泊松分布两种模型对奇偶性数据摘要进行了推导,并证明了这两种方法的等价性。Enron数据集上的实验表明,提出的奇偶性数据摘要算法与传统的b位哈希函数相比具有更高的准确性,并且在重复文档检测和关联规则挖掘两种应用中具有更高的性能。
[1] Arasu A,Ganti V,Kaushik R.Efficient exact set-similarity joins[C]∥Proceedings of the 32nd International Conference on Very Large Data Bases.VLDB Endowment,2006:918-929 [2] Xiao C,Wang W,Lin X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems (TODS),2011,36(3):15-20 [3] Manku G S,Jain A,Das Sarma A.Detecting near-duplicates for Web crawling[C]∥Proceedings of the 16th International Confe-rence on World Wide Web.ACM,2007:141-150 [4] Zhao Qin-qin,Lu Kai,Wang Bin.SPCF:A Memory Based Collaborative Filtering Algorithm via Propagation[J].Chinese Journal of Computers,2013,6(3):671-676(in Chinese) 赵琴琴,鲁凯,王斌.SPCF:一种基于内存的传播式协同过滤推荐算法[J].计算机学报,2013,36(3):671-676 [5] Koren Y,Bell R.Advances in collaborative filtering[M].Recommender systems handbook.Springer US,2011:145-186 [6] Li Shu-kui.Research on Time series similarity problem[D].Wuhan:Huazhong University of Science and Technology,2008(in Chinese) 李俊奎.时间序列相似性问题研究 [D].武汉:华中科技大学,2008 [7] Feng Yu-cai,Feng Jian-lin.Incremental Updating Algorithmsfor Mining Association Rules[J].Journal of Software,1998(4):301-306(in Chinese) 冯玉才,冯剑琳.关联规则的增量式更新算法[J].软件学报,1998,9(4):301-306 [8] Stein B.Principles of hash-based text retrieval[C]∥Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:527-534 [9] 凌康.基于位置敏感哈希的相似性搜索技术研究[D].南京:南京大学,2012 [10] Indyk P.A small approximately min-wise independent family of hash functions[J].Journal of Algorithms,2011,38(1):84-90 [11] Broder A Z,Charikar M,Frieze A M,et al.Min-wise indepen-dent permutations[J].Journal of Computer and System Sciences,2010,60(3):630-659 [12] Li P,Knig C.b-Bit minwise hashing[C]∥Proceedings of the 19th International Conference on World Wide Web.ACM,2010:671-680 [13] Schuster E F,Philippou A N.The odds in some odd-even games[J].American Mathematical Monthly,1975,82(6):646-648 [14] Shetty J,Adibi J.The Enron email dataset database schema and brief statistical report[J].Information Sciences Institute Technical Report,University of Southern California,2004,4(4):210-215 [15] Yu Xiao-sheng,Hu Sun-zhi.Research on Eliminating Duplicate Records Based on SNM Improved Algorithm[J].Journal of Chongqing University of Technology(Natural Science),2016,0(4):91-96(in Chinese) 余肖生,胡孙枝.基于SNM改进算法的相似重复记录消除[J].重庆理工大学学报(自然科学),2016,0(4):91-96 |
No related articles found! |
|