计算机科学 ›› 2020, Vol. 47 ›› Issue (10): 83-90.doi: 10.11896/jsjkx.190900014

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于FLINK的滑动窗口内三角形计数算法研究

王旭1, 杨晓春2   

  1. 1 北京理工大学计算机学院 北京100081
    2 东北大学计算机科学与工程学院 沈阳110819
  • 收稿日期:2019-09-02 修回日期:2019-11-04 出版日期:2020-10-15 发布日期:2020-10-16
  • 通讯作者: 杨晓春(yangxc@cse.neu.edu.cn)
  • 作者简介:yoo.wangxu@gmail.com
  • 基金资助:
    国家自然科学基金(61572122,61532021)

Study of Triangle Counting Algorithm with Sliding Windows Based on FLINK

WANG Xu1, YANG Xiao-chun2   

  1. 1 School of Computer Science and Technology,Beijing Institute of Technology,Beijing 100081,China
    2 School of Computer and Engineering,Northeastern University,Shenyang 110819,China
  • Received:2019-09-02 Revised:2019-11-04 Online:2020-10-15 Published:2020-10-16
  • About author:WANG Xu,born in 1997,postgraduate,is a member of China Computer Federation.Her main research interests include data stream processing and so on.
    YANG Xiao-chun,born in 1973,professor,doctoral supervisor,is a member of China Computer Federation.Her main research interests include databasetheo-ry and systems,data quality proces-sing,text and data stream processing,and AI enabled data systems.
  • Supported by:
    National Natural Science Foundation of China (61572122,61532021)

摘要: 三角形计数旨在计算图中全局三角形和局部三角形的数量,是图数据挖掘中的一类重要工作。三角形的数量被广泛应用于角色识别、推荐系统、社区发现、垃圾邮件和欺诈检测等领域。在以流形式给出的图中,边具有时间性,同时现实生活中的图存在着大量的重复边。为充分利用图中的时间信息以挖掘网络知识,研究在多图流上计算滑动窗口内全局和局部三角形数量的问题,使用窗口机制同时研究多个窗口以利用隐含的时间关系获取更多信息。文中提出基于FLINK窗口操作的三角形计数算法和基于滑动窗口的三角形增量计数算法,以现有的边采样工作为基础,使用边集存储窗口历史数据实现一遍流计算,从而准确地计算面向多图流的滑动窗口内全局和局部三角形数量。基于FLINK窗口操作的三角形计数算法使用FLINK提供的窗口机制,基于滑动窗口的三角形增量计数算法,通过计算窗口滑入和滑出数据来实现窗口计数,避免了相邻两个窗口间重合边的大量重复计算,无缝地处理多个时间窗口,对于滑入和滑出数据中的重复数据,使用去重机制来进一步减小计算量。理论证明两种算法可以实现滑动窗口内三角形准确计数,并通过实验分析了窗口大小、滑动距离、数据分布和数据流速等因素对窗口处理时间的影响。与TRIEST算法相比,当窗口较小时,基于FLINK窗口操作的三角形计数算法和基于滑动窗口的三角形增量计数算法速度更快;当窗口较大时,保证了计算结果的准确性。

关键词: FLINK, 滑动窗口, 三角形计数, 图流挖掘, 准确流算法

Abstract: Triangle Counting,calculating global and local triangle counts,is an important work in data mining,whose number is widely used in important role identification,recommendation systems,community discovery,spam and fraud detection etc.In the graphs presented as a stream of edges,edges are temporal,and there are a large of duplicate edges in real-world graphs.For full use of the time information in the graph and mining the network knowledge,this work studied the problem of estimating global and local triangle counts on a multigraph stream with sliding windows,that simultaneously studied multiple windows by window mechanism to obtain more information in implicit time relationships.A triangle counting algorithm based on FLINK window ope-ration and a triangle increment counting algorithm based on sliding window are proposed.Like the existing edge sampling work,the edge set is used to store window history data for accurately calculating global and local triangle counts on a multigraph stream with sliding windows in one-pass.The triangle counting algorithm based on FLINK window operator uses the window mechanism provided by FLINK.While the triangle increment counting algorithm based on sliding window realizes window counting by calculating slide-in and slide-out data through the window,reducing a large number of repeated calculations of coincident edges between adjacent windows,seamlessly processing multiple time windows,and for duplicate data in slide-in and slide-out,uses a deduplication mechanism to further reduce the calculation.The theory has been proven that both the algorithms can accurately count the triangles in the sliding window,and the effects of window size,sliding distance,data distribution and data flow rate on window processing time were analyzed through experiments.Compared with the TRIEST algorithm,when the window is small,the triangle counting algorithm based on FLINK window operation and the triangle increment counting algorithm based on sliding window have faster speed.When the window is large,the accuracy of the calculation result is guaranteed.

Key words: Accurate stream algorithm, FLINK, Graph stream mining, Sliding window, Triangle counting

中图分类号: 

  • TP301.6
[1]WELSER H T,GLEAVE E,FISHER D,et al.Visualizing the signatures of social roles in online discussion groups[J].Journal of Social Structure,2007,8(2):1-32.
[2]TSOURAKAKIS C E,DRINEAS P,MICHELAKIS E,et al.Spectral counting of triangles via element-wise sparsifification and triangle-based link recommendation[J].Social Network Analysis & Mining,2011,1(2):75-81.
[3]GLEICH D F,SESHADHRI C.Vertex neighborhoods,low conductance cuts,and good seeds for local community methods[C]//18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York NY:ACM Press,2012:597-605.
[4]BECCHETTI L,BOLDI P,CASTILLO C,et al.Efficient semi-streaming algorithms for local triangle counting in massive graphs[C]//ACM Conference on Knowledge Discovery and Data Mining.New York NY:ACM Press,2008:16-24.
[5]KAVASSERY-PARAKKAT N,HANJANI K M,PAVAN A.Improved Triangle Counting in Graph Streams:Power of Multi-Sampling[C]//IEEE International Conference on Advances in Social Networks Analysis and Mining.Washington,DC:IEEE Computer Society Press,2018:28-31,33.
[6]JHA M,SESHADHRI C,PINAR A.Counting Triangles in Real-World Graph Streams:Dealing with Repeated Edges and Time Windows[C]//49th Asilomar Conference on Signals,Systems and Computers.Washington DC:IEEE Computer Society Press,2015:1507-1514.
[7]BERA S K,CHAKRABARTI A.Towards Tighter Space Boundsfor Counting Triangles and Other Substructures in Graph Streams[C]//34th Symposium on Theoretical Aspects of Computer Science.Dagstuhl,Germany,2017,11:1-14.
[8]LIM Y,JUNG M,KANG U.Memory-efficient and accuratesampling for counting local triangles in graph streams:from simple to multigraphs [J].ACM Transactions on Knowledge Discovery from Data,2018,12(1):1-28.
[9]JUNG M,LEE S,LIM Y,et al.FURL:Fixed-memory and Uncertainty Reducing Local Triangle Counting for Graph Streams[J].Data Mining and Knowledge Discovery,2019,33(5):1225-1253.
[10]SHIN K.WRS:Waiting Room Sampling for Accurate Triangle Counting in Real Graph Streams[C]//2017 IEEE International Conference on Data Mining.Washington,DC:IEEE Computer Society Press,2017:1087-1092.
[11]BULTEAU L,FROESE V,KUTZKOV K,et al.TriangleCounting in Dynamic Graph Streams[J].Algorithmica,2016,76(1):259-278.
[12]SHIN K,KIM J,HOOI B,et al.Think Before You Discard:Ac-
curate Triangle Counting in Graph Streams with Deletions[C]//Joint European Conference on Machine Learning and Knowledge Discovery in Databases,Lecture Notes in Computer Science.2018:141-157.
[13]DE STEFANI L,TEROLLI E,UPFAL E.Tiered Sampling:An Efficient Method for Approximate Counting Sparse Motifs in Massive Graph Streams[C]//2017 IEEE International Conference on Big Data.Washington,DC:IEEE Computer Society Press,2017:776-786.
[14]WANG P,QI Y,SUN Y,et al.Approximately counting triangles in large graph streams including edge duplicates with a fixed memory usage[J].Proceedings of the VLDB Endowment,2017,11(2):162-175.
[15]SHIN K,HAMMOUD M,LEE E,et al.Tri-Fly:Distributed Estimation of Global and Local Triangle Counts in Graph Streams[C]//Pacific-Asia Conference on Knowledge Discovery and Data Mining,Lecture Notes in Computer Science.2018:651-663.
[16]SHIN K,LEE E,OH J,et al.DiSLR:Distributed Sampling with Limited Redundancy For Triangle Counting in Graph Streams[J].arXiv:1802.04249v1,2018.
[17]PINGHUI W,PENG J,YIYAN Q,et al.A Streaming Algo-rithm of Approximating Global and Local Triangle Counts in Parellal[C]//35th IEEE International Conference on Data Engineering.Washington,DC:IEEE Computer Society Press,2019:758-769.
[18]LORENZO D S,ALESSANDRO E,MATECO R,et al.TRIEST:Counting Local and Global Triangles in Fully-dynamic Streams with Fixed Memory Size[J].ACM Transactions on Knowledge Discovery from Data,2017,11(4):1-50.
[19]JHA M.Counting Triangles in Graph Streams[J].Encyclopedia of Algorithms,2016:458-464.
[20]VITTER J S.Random sampling with a reservoir[J].ACMTransactions on Mathematical Software,1985,11(1):37-57.
[21]TANGWONGSAN K,PAVAN A,TIRTHAPURA S.ParallelTriangle Counting in Massive Streaming Graphs[C]//ACM international conference on Information and Knowledge Management.New York NY:ACM Press,2013:781-786.
[1] 龚建锋.
抗能量分析的带符号滑动窗口标量乘算法
Resisting Power Analysis Algorithm of Scalar Multiplication Based on Signed Sliding Window
计算机科学, 2021, 48(6A): 533-537. https://doi.org/10.11896/jsjkx.191200097
[2] 宋莺, 钟忺, 孙宝林, 桂超.
MANET中基于滑动窗口的网络编码协作算法
Sliding Window-based Network Coding Cooperative Algorithm in MANET
计算机科学, 2020, 47(11): 322-326. https://doi.org/10.11896/jsjkx.191000181
[3] 彭成, 贺婧, 池昊.
一种确定滑动窗口规模的边界距离算法
Boundary Distance Algorithm for Determining Sliding Window Size
计算机科学, 2019, 46(6A): 482-487.
[4] 郭威, 于建江, 汤克明, 徐涛.
动态数据流分析的在线超限学习算法综述
Survey of Online Sequential Extreme Learning Algorithms for Dynamic Data Stream Analysis
计算机科学, 2019, 46(4): 1-7. https://doi.org/10.11896/j.issn.1002-137X.2019.04.001
[5] 杜娟, 沈思昀.
基于改进多权值滑动窗口的立体匹配方法的实现及应用
Implementation and Application of Stereo Matching Method Based onImproved Multi-weight Sliding Window
计算机科学, 2019, 46(11A): 241-245.
[6] 毛义坪, 余磊, 官泽瑾.
基于分数阶微分的多聚焦图像融合
Multi-focus Image Fusion Based on Fractional Differential
计算机科学, 2019, 46(11A): 315-319.
[7] 李青,肖迎元,王晓晔,李玉坤.
无线传感器网络中基于聚簇结构的Skyline查询方法
Clustering Architecture-based Skyline Query Processing in Wireless Sensor Networks
计算机科学, 2017, 44(10): 177-181. https://doi.org/10.11896/j.issn.1002-137X.2017.10.033
[8] 张建辉,王会青,孙宏伟,郭芷榕,白莹莹.
基于二分迭代SAX的时序相似性度量算法
Similarity Measure Algorithm of Time Series Based on Binary-dividing SAX
计算机科学, 2017, 44(1): 247-252. https://doi.org/10.11896/j.issn.1002-137X.2017.01.046
[9] 李彦,周艺,刘雨声,梁智.
无线体域网中基于时隙滑动窗口的概率路由协议
Probability Routing Protocol Based on Slotted Sliding Window in Wireless Body Area Network
计算机科学, 2016, 43(Z11): 259-263. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.060
[10] 王丽珍,周丽华,邓世昆.
一种基于滑动窗口模式匹配的加权预测方法
Weighted Prediction Method Based on Sliding Window and Pattern Matching
计算机科学, 2016, 43(Z11): 591-596. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.134
[11] 钟忺,杨光,卢炎生.
基于双阈值滑动窗口子镜头分割和完全连通图的关键帧提取方法
Method of Key Frames Extraction Based on Double-threshold Values Sliding Window Sub-shot Segmentation and Fully Connected Graph
计算机科学, 2016, 43(6): 289-293. https://doi.org/10.11896/j.issn.1002-137X.2016.06.057
[12] 曲晓雅,刘真.
基于阈值滑动窗口机制的虚拟机迁移判决算法
Dynamic Threshold Window Algorithm for Virtual Machines Migration Decision
计算机科学, 2016, 43(4): 64-69. https://doi.org/10.11896/j.issn.1002-137X.2016.04.013
[13] 陈小东,孙力娟,韩崇,郭剑.
基于模糊聚类的数据流概念漂移检测算法
Detecting Concept Drift of Data Stream Based on Fuzzy Clustering
计算机科学, 2016, 43(4): 219-223. https://doi.org/10.11896/j.issn.1002-137X.2016.04.045
[14] 吕太之.
基于滑动窗口Hurst指数的心电分析研究
Research of Heart Rate Variable Analysis Based on Sliding Window Hurst
计算机科学, 2016, 43(2): 259-262. https://doi.org/10.11896/j.issn.1002-137X.2016.02.054
[15] 常东亚,严建峰,杨璐,刘晓升.
基于滑动窗口的主题模型
Sliding-window Based Topic Modeling
计算机科学, 2016, 43(12): 101-107. https://doi.org/10.11896/j.issn.1002-137X.2016.12.018
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!