计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 35-40.doi: 10.11896/jsjkx.231000193

• 创刊五十周年特别专题 • 上一篇    下一篇

过滤器数据结构研究综述

王瀚橙, 戴海鹏, 陈树森, 陈志鹏, 陈贵海   

  1. 计算机软件新技术国家重点实验室(南京大学) 南京210023
  • 收稿日期:2023-10-26 修回日期:2023-11-23 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 戴海鹏(haipengdai@nju.edu.cn)
  • 作者简介:(hanchengwang@smail.nju.edu.cn)
  • 基金资助:
    国家自然科学基金(62272223)

Filter Data Structures:A Survey

WANG Hancheng, DAI Haipeng, CHEN Shusen, CHEN Zhipeng, CHEN Guihai   

  1. State Key Laboratory for Novel Software Technology,Nanjing University,Nanjing 210023,China
  • Received:2023-10-26 Revised:2023-11-23 Online:2024-01-15 Published:2024-01-12
  • About author:WANG Hancheng,born in 1998,postgraduate,is a student member of CCF(No.F2622G).His main research in-terests include data indexing and query optimization.
    DAI Haipeng,born in 1985,Ph.D,associate professor,Ph.D supervisor,is a distinguished member of CCF(No.19625S).His main research interests include data mining and mobile computing.
  • Supported by:
    National Natural Science Foundation of China(62272223).

摘要: 过滤器数据结构可以近似地判断某个元素是否属于给定集合。典型的过滤器数据结构,如布隆过滤器、布谷鸟过滤器、商过滤器,以牺牲查询准确性为代价换取更低的内存空间消耗和查询时间开销。因此,得益于空间时间高效性,过滤器数据结构现已被广泛应用于计算机网络、物联网、数据库系统、文件系统、生物信息学、机器学习等领域的近似成员资格查询操作中。自20世纪70年代以来,过滤器数据结构受到了广泛的研究,在诸多领域取得了重要的进展,其研究思路也在不断变化。文中整理了近五十年来关于过滤器数据结构的经典研究成果,从过滤器数据结构的原理出发对已有工作进行分类总结,并比较不同工作之间的引证关系和改进思路,最后讨论了过滤器数据结构的未来研究方向。

关键词: 过滤器, 近似成员资格查询, 概率数据结构, 布隆过滤器, 布谷鸟过滤器, 商过滤器

Abstract: Filter data structures can approximately determine whether an element exists in a given set.Typical filter data structures,such as Bloom filters,cuckoo filters,and quotient filters,sacrifice query accuracy for lower memory space consumption and lower query time overhead.Due to their spatial and temporal efficiency,filter data structures are now widely used in approximate membership query operations in computer networks,the Internet of Things,database systems,file systems,bioinformatics,machine learning,and other fields.Since the 1970s,filters have been extensively studied.Their research ideas are constantly changing.This paper compiles the classic studies on filter data structures in the past fifty years,summarizes existing studies based on the mechanism of filter data structures and analyze the relationship between different studies.Finally,future research directions in filter data structures are discussed.

Key words: Filter, Approximate membership query, Probabilistic data structure, Bloom filter, Cuckoo filter, Quotient filter

中图分类号: 

  • TP391
[1]YU M,FABRIKANT A,REXFORD J.BUFFALO:Bloom filter forwarding architecture for large organizations[C]//Proceedings of International Conference on Emerging Networking Experiments and Technologies.2009:313-324.
[2]ATHANASSOULIS M,AILAMAKI A.BF-Tree:Approximate tree indexing[J].Proceedings of the VLDB Endowment,2014,7(14):1881-1892.
[3]LI P,LUO B,ZHU W,et al.Cluster-based distributed dynamic cuckoo filter system forRedis[J].International Journal of Parallel,Emergent and Distributed Systems,2020,35(3):340-353.
[4]WANG F,CHEN H,LIAO L,et al.The power of betterchoice:Reducing relocations in cuckoo filter[C]//Proceedings of International Conference on Distributed Computing Systems.2019:358-367.
[5]GU R,LI S,DAI H,et al.Adaptive online cache capacity optimi-zation via lightweight working set size estimation atscale[C]//Proceedings of USENIX Annual Technical Conference.2023:467-484.
[6]DAI H,LI M,LIU A X,et al.Finding persistent items in distributed datasets[J].IEEE/ACM Transactions on Networking,2020,28(1):1-14.
[7]DAI H,SHAHZAD M,LIU A X,et al.Identifying and estimating persistent items in data streams[J].IEEE/ACM Transactions on Networking,2018,26(6):2429-2442.
[8]XIA R,DAI H,DU Z,et al.Mining robust frequent items in data streams[C]//Proceedings of International Conference on Joint Cloud Computing.2020:110-117.
[9]DAI H,LI M,WANG H,et al.Distantly supervised entity lin-king with selection consistency constraint[C]//Proceedings of International Conference on Database Systems for Advanced Applications.2023:784-799.
[10]LIU J,DAI H,XIA R,et al.DUET:A generic framework for finding special quadratic elements in data streams[C]//Procee-dings of ACM Web Conference.2022:2989-2997.
[11]YU S,LI X,WANG H,et al.C_CART:An instance confidence-based decision tree algorithm for classification[J].Intelligent Data Analysis,2021,25(4):929-948.
[12]LI M,DAI H,WANG X,et al.Thresholded monitoring in distributed data streams[C]//Proceedings of International Confe-rence on Distributed Computing Systems.2019:218-227.
[13]YU S,LI X,WANG H,et al.BIDI:A classification algorithm with instance difficulty invariance[J].Expert Systems with Applications,2021,165(1):1-13.
[14]LI M,DAI H,WANG X,et al.Thresholded monitoring in distributed data streams[J].IEEE/ACM Transactions on Networking,2020,28(3):1033-1046.
[15]DAI H,SHAHZAD M,LIU A X,et al.Finding persistent items in data streams[J].Proceedings of the VLDB Endowment,2016,10(4):289-300.
[16]DAI M,XU S,SHAO S,et al.Blockchain-based reliable fog-cloud service solution for IIoT[J].Chinese Journal of Electro-nics,2021,30(2):359-366.
[17]DAI H,LI M,LIU A X,et al.Finding persistent items in distributed datasets[C]//Proceedings of International Conference on Computer Communications.2018:1403-1411.
[18]WANG H,DAI H,LI M,et al.Bamboo filters:Make resizingsmooth[C]//Proceedings of International Conference on Data Engineering.2022:979-991.
[19]LI M,CHEN D,DAI H,et al.Seesaw counting filter:An efficient guardian for vulnerable negative keys during dynamic filtering[C]//Proceedings of the ACM Web Conference.2022:2759-2767.
[20]LI M,XIE R,CHEN D,et al.A pareto optimal bloom filter fa-mily with hash adaptivity[J].The VLDB Journal,2023,32(3):525-548.
[21]WANG H,DAI H,GU R,et al.Wormhole filters:Caching your hash on persistent memory[C]//Proceedings of European Conference on Computer Systems.2024:1-16.
[22]LI Y,WANG Z,YANG R,et al.Learned bloom filter for multi-key membership testing[C]//Proceedings of the International Conference on Database Systems for Advanced Applications.2023:62-79.
[23]DAI H,YU J,LI M,et al.Bloom filter with noisy coding framework for multi-set membership testing[J].IEEE Transactions on Knowledge and Data Engineering,2022,35(7):6710-6724.
[24]LI L,WU J,HU H,et al.Secure cloud architecture for 5G core network[J].Chinese Journal of Electronics,2021,30(3):516-522.
[25]DAI H,ZHONG Y,LIU A X,et al.Noisy bloom filters formulti-set membership testing[C]//Proceedings of International Conference on Measurement and Modeling of Computer Science.2016:139-151.
[26]LI P,YU X,XU H,et al.Secure localization technology based on dynamic trust management in wireless sensor networks[J].Chinese Journal of Electronics,2021,30(4):759-768.
[27]BLOOM B H.Space/time tradeoffs in hash coding with allowable errors[J].Communications of the ACM,1970,13(7):422-426.
[28]FAN L,CAO P,ALMEIDA J M,et al.Summary cache:A scalable wide area web cache sharing protocol[J].IEEE/ACM Transactions on Networking,2000,8(3):281-293.
[29]BONOMI F,MITZENMACHER M,PANIGRAHY R,et al.An improved construction for counting bloom filters[C]//Procee-dings of the Annual European Symposium.2006:684-695.
[30]MITZENMACHER M.Compressed bloom filters[J].IEEE/ACM Transactions on Networking,2002,10(5):604-612.
[31]GUO D,WU J,CHEN H,et al.The dynamic bloom filters[J].IEEE Transactions on Knowledge and Data Engineering,2009,22(1):120-133.
[32]WU Y,HE J,YAN S,et al.Elastic bloom filter:Deletable andexpandable filter using elastic fingerprints[J].IEEE Transactions on Computers,2021,71(4):984-991.
[33]KRASKA T,BEUTEL A,CHI E H,et al.The case for learned index structures[C]//Proceedings of the 2018 International Conference on Management of Data.2018:489-504.
[34]PENG Y,GUO J,LI F,et al.Persistent bloom filter:Membership testing for the entire history[C]//Proceedings of the 2018 International Conference on Management of Data.2018:1037-1052.
[35]XIE R,LI M,MIAO Z,et al.Hash adaptive bloom filter[C]//Proceedings of the 37th IEEE International Conference on Data Engineering.2021:636-647.
[36]ROTHENBERG C E,MACAPUNA C A B,VERDI F L,et al.The deletable bloom filter:a new member of the bloom family[J].IEEE Communications Letters,2010,14(6):557-559.
[37]MOSHARRAF N,JAYASUMANA A P,RAY I.Compactedbloom filter[C]//Proceedings of the 2nd IEEE International Conference on Collaboration and Internet Computing.2016:304-311.
[38]ALMEIDA P S,BAQUERO C,PREGUIÇA N M,et al.Scalable bloom filters[J].Information Processing Letters,2007,101(6):255-261.
[39]LU J,YANG T,WANG Y,et al.One hashing bloom filter[C]//Proceedings of the International Symposium on Quality of Ser-vice.2015:289-298.
[40]BHATTACHARYA A,BEDATHUR S,BAGCHI A.Adaptive learned bloom filters under incremental workloads[C]//Proceedings of the India Joint International Conference on Data Science and Management of Data.2020:107-115.
[41]LIU Q,ZHENG L,SHEN Y,et al.Stable learnedbloom filters for data streams[J].Proceedings of the VLDB Endowment,2020,13(11):2355-2367.
[42]GOSWAMI M,PAGH R,SILVESTRI F,et al.Distance sensitive bloom filters without false negatives[C]//Proceedings of the Annual Symposium on Discrete Algorithms.Society for Industrial and Applied Mathematics.2017:257-269.
[43]COHEN S,MATIAS Y.Spectral bloom filters[C]//Proceedings of the International Conference on Management of Data.2003:241-252.
[44]MATSUMOTO Y,HAZEYAMA H,KADOBAYASHI Y.Adaptive bloom filter:A space-efficient counting algorithm for unpredictable network traffic[J].IEICE Transactions on Information and Systems,2008,91(5):1292-1299.
[45]FAN B,ANDERSEN D G,KAMINSKY M,et al.Cuckoo filter:Practically better than bloom[C]//Proceedings of International Conference on Emerging Networking Experiments and Techno-logies.2014:75-88.
[46]BENDER M A,FARACHCOLTON M,JOHNSON R,et al.Don’t thrash:How to cache your hash on flash[J].Proceedings of the VLDB Endowment,2012,5(11):1627-1637.
[47]CHEN H,LIAO L,JIN H,et al.The dynamic cuckoo filter[C]//Proceedings of the International Conference on Network Protocols.2017:1-10.
[48]BRESLOW A D,JAYASENA N.Morton filters:Faster,space-efficient cuckoo filters via biasing,compression,and decoupled logical sparsity[J].Proceedings of the VLDB Endowment,2018,11(9):1041-1055.
[49]BRESLOW A D,JAYASENA N.Morton filters:fast,com-pressed sparse cuckoo filters[J].The VLDB Journal,2020,29(2):731-754.
[50]WANG M,ZHOU M,SHI S,et al.Vacuum filters:More space-efficient and faster replacement for bloom and cuckoo filters[J].Proceedings of the VLDB Endowment,2019,13(2):197-210.
[51]GRAF T M,LEMIRE D.Xor filters:Faster and smaller thanbloom and cuckoo filters[J].ACM Journal of Experimental Algorithmics,2020,25(1):1-16.
[52]ZHANG F,CHEN H,JIN H,et al.The logarithmic dynamic cuckoo filter[C]//Proceedings of the 37th IEEE International Conference on Data Engineering.2021:948-959.
[53]FU P,LUO L,GUO D,et al.Jump filter:Dynamic sketch design for big data governance[J].Journal of Software,2023,34(3):1193-1212.
[54]HUANG K,YANG T.Additive and subtractive cuckoo filters[C]//Proceedings of the International Symposium on Quality of Service.2020:1-10.
[55]HUANG K,YANG T.Tagged cuckoo filters[C]//Proceedings of the International Conference on Computer Communications and Networks.2021:1-10.
[56]PANDEY P,BENDER M A,JOHNSON R,et al.A general-purpose counting filter:Making every bit count[C]//Proceedings of the International Conference on Management of Data.2017:775-787.
[57]PANDEY P,CONWAY A,DURIE J,et al.Vector quotient filters:Overcoming the time/space trade-off in filter design[C]//Proceedings of the International Conference on Management of Data.2021:1386-1399.
[58]DAYAN N,BERCEA I,REVIRIEGO P,et al.InfiniFilter:Expanding filters to infinity and beyond[J].Proceedings of the ACM on Management of Data,2023,1(2):1-27.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!