Computer Science ›› 2024, Vol. 51 ›› Issue (4): 4-10.doi: 10.11896/jsjkx.231000226

• Compact Data Structure • Previous Articles     Next Articles

IntervalSketch:Approximate Statistical Method for Interval Items in Data Stream

CHEN Xinyang1,2, CHEN Hanze1, ZHOU Jiasheng1, HUANG Jiaqing1, YU Jiashuo1, ZHU Longlong1,2, ZHANG Dong2,3   

  1. 1 College of Computer Science and Big Data,Fuzhou University,Fuzhou 350108,China
    2 Quan Cheng Laboratory,Jinan 250100,China
    3 Zhicheng College,Fuzhou University,Fuzhou 350002,China
  • Received:2023-10-31 Revised:2024-01-23 Online:2024-04-15 Published:2024-04-10
  • Supported by:
    National Key R & D Program of China(2023YFB2904000,2023YFB2904005),Quan Cheng Laboratory(QCLZD202304) and Research Project of Provincial Laboratory of Shandong,China(SYS202201).

Abstract: The proportion of streaming databases is gradually increasing,and extracting the required information in the data streams of streaming databases is an important task.In this paper,we study interval items which refer to pairs of elements arriving with a fixed interval,and apply them to network scenarios.It is the first work to define and count interval items in data streams.To efficiently count the top-K interval items,IntervalSketch is proposed.IntervalSketch firstly chunks the data stream based on simulated annealing to accelerate the statistical speed,secondly,it uses Sketch to store the interval items,and lastly reduces the memory of storing the interval items in Sketch through the feature grouping storage strategy,which enhances the accuracy of counting the interval items.Extensive comparative experiments are carried out on two real datasets.Experimental results show that IntervalSketch significantly outperforms the baseline solution with the same memory,and the processing time is1/3~1/2 of the baseline solution,the average absolute error and the average relative error are1/3 of the baseline solution.

Key words: Sketch, Database, Data mining

CLC Number: 

  • TP311.13
[1]CORMODE G,MUTHUKRISHNAN S.An improved datastream summary:The count-min sketch and its applications[C]//LATIN 2004:Theoretical Informatics:6th Latin American Symposium.Berlin Heidelberg:Springer,2004:29-38.
[2]YANG T,JIANG J,LIU P,et al.Elastic sketch:Adaptive and fast network-wide measurements[C]//Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication.2018:561-575.
[3]TANG L,HUANG Q,LEE P P C.Mv-sketch:A fast and compact invertible sketch for heavy flow detection in network data streams[C]//IEEE Conference on Computer Communications(IEEE 2019).2019:2026-2034.
[4]FAN Z,ZHANG Y,YANG T,et al.Periodicsketch:Finding periodic items in data streams[C]//2022 IEEE 38th International Conference on Data Engineering(ICDE).IEEE,2022:96-109.
[5]LIU Z,KONG C,YANG K,et al.Hypercalm sketch:One-pass mining periodic batches in data streams[C]//2023 IEEE 39th International Conference on Data Engineering(ICDE).IEEE.2023.
[6]FAN Z,GUO J,LI X,et al.Finding Simplex Items in DataStreams[C]//2023 IEEE 39th International Conference on Data Engineering(ICDE).IEEE,2023:1953-1966.
[7]LI X,FAN Z,LI H,et al.SteadySketch:Finding Steady Flows in Data Streams[C]//2023 IEEE/ACM 31st International Symposium on Quality of Service(IWQoS).IEEE,2023:1-9.
[8]ZHONG Z,YAN S,LI Z,et al.Burstsketch:Finding bursts in data streams[C]//Proceedings of the 2021 International Confe-rence on Management of Data.2021:2375-2383.
[9]FAN Z,HU Z,WU Y,et al.Pisketch:finding persistent and infrequent flows[C]//Proceedings of the ACM SIGCOMM Workshop on Formal Foundations and Security of Programmable Network Infrastructures.2022:8-14.
[10]WANG J,ZHANG Y.Opportunity model for e-commerce rec-ommendation:right product;right time[C]//Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval.2013:303-312.
[11]JAIN V R,BAGREE R,KUMAR A,et al.wildCENSE:GPS based animal tracking system[C]//2008 International Confe-rence on Intelligent Sensors Networks and Information Proces-sing.IEEE,2008:617-622.
[12]ROYH A,ZENG J,BAGGAG P,et al.C.Snoeren“Inside the social network's(datacenter) network[C]//Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication.2015:23-137.
[13]TAN L,LYANG F,YI J K.Optimisation of Sketch algorithm based on AVX instruction set[J].Computer Science,2021,48(11A):585-587.
[14]AGRAWALR,SRIKANT R.Fast algorithms for mining asso-ciation rules[C]//Proceedings 20th International Conference.Very Large Data Bases VLDB,1994,1215:487-499.
[15]YUAN Q,SHANG J,CAO X,et al.Detecting multiple periods and periodic patterns in event time sequences[C]//Proceedings of the 2017 ACM on Conference on Information and Knowledge Management.2017:617-626.
[16]CHANG J H,PAR K N H.A novel weighting technique formining sequence data streams[C]//IT Convergence and Security 2012.Springer Netherlands,2013:929-936..
[17]CHEN Y L,CHIANG M C,KOM T.Discovering time-interval sequential patterns in sequence databases[J].Expert Systems with Applications,2003,25(3):343-354.
[18]AGRAWAL R,IMIELIŃSKI T,SWAMI A.Mining asso-ciation rules between sets of items in large databases[C]//Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data.1993:207-216.
[19]HAN J,PEI J,YIN Y.Mining frequent patterns without candidate generation[J].ACM Sigmod Record,2000,29(2):1-12.
[20]Anonymized Internet Traces 2016[OL].https://catalog.caida.org/dataset/passive_2016_pcap.
[21]Taobao user behavior data set[OL].ttps://tianchi.aliyun.com/dataset/649.
[1] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[2] MAO Chenyu, HUANG He, SUN Yu'e, DU Yang. Global Top-K Frequent Flow Measurement for Continuous Periods in Distributed Networks [J]. Computer Science, 2024, 51(4): 28-38.
[3] WU Yanni, ZHOU Zhengyan, CHEN Hanze, ZHANG Dong. RBFRadar:Detecting Remarkable Burst Flows with Programmable Data Plane [J]. Computer Science, 2024, 51(4): 48-55.
[4] XU Haiyang, LIU Hailong, YANG Chaoyun, WANG Shuo, LI Zhanhuai. MMOS:Memory Resource Sharing Methods to Support Overselling in Multi-tenant Databases [J]. Computer Science, 2024, 51(2): 27-35.
[5] ZOU Chunling, ZHU Zhengzhou. Fusion Model of Housekeeping Service Course Recommendation Based on Knowledge Graph [J]. Computer Science, 2024, 51(2): 47-54.
[6] SHEN Zhehui, WANG Kailai, KONG Xiangjie. Exploring Station Spatio-Temporal Mobility Pattern:A Short and Long-term Traffic Prediction Framework [J]. Computer Science, 2023, 50(7): 98-106.
[7] YANG Zhenkai, CAO Yibing, ZHAO Xinke, ZHENG Jingbiao. Temporal Hierarchical Data Management Based on Nested Intervals Scheme in Relational Database [J]. Computer Science, 2023, 50(6A): 220500290-5.
[8] ZHANG Jian, ZHANG Ye. College Students Employment Dynamic Prediction of Multi-feature Fusion Based on GRU-LSTM [J]. Computer Science, 2023, 50(6A): 220500056-6.
[9] HUANG Zhicheng, LIU Xianhui. Microservice Splitting Approach Based on Database Table [J]. Computer Science, 2023, 50(11A): 230200102-7.
[10] ZHAO Xuejian, ZHAO Ke. Bio-inspired Frequent Itemset Mining Strategy Based on Genetic Algorithm [J]. Computer Science, 2023, 50(11A): 220700200-8.
[11] WANG Yitan, WANG Yishu, YUAN Ye. Survey of Learned Index [J]. Computer Science, 2023, 50(1): 1-8.
[12] LI Rong-fan, ZHONG Ting, WU Jin, ZHOU Fan, KUANG Ping. Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation [J]. Computer Science, 2022, 49(8): 33-39.
[13] WANG Run-an, ZOU Zhao-nian. Query Performance Prediction Based on Physical Operation-level Models [J]. Computer Science, 2022, 49(8): 49-55.
[14] YAO Xiao-ming, DING Shi-chang, ZHAO Tao, HUANG Hong, LUO Jar-der, FU Xiao-ming. Big Data-driven Based Socioeconomic Status Analysis:A Survey [J]. Computer Science, 2022, 49(4): 80-87.
[15] DOU Zhi, WANG Ning, WANG Shi-jie, WANG Zhi-hui, LI Hao-jie. Sketch Colorization Method with Drawing Prior [J]. Computer Science, 2022, 49(4): 195-202.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!