计算机科学 ›› 2018, Vol. 45 ›› Issue (11): 249-255.doi: 10.11896/j.issn.1002-137X.2018.11.039

• 人工智能 • 上一篇    下一篇

SNS:一种快速无偏的分层图抽样算法

朱君鹏, 李晖, 陈梅, 戴震宇   

  1. (贵州大学计算机科学与技术学院 贵阳550025)
    (贵州省先进计算与医疗信息服务工程实验室 贵阳550025)
  • 收稿日期:2017-10-08 发布日期:2019-02-25
  • 作者简介:朱君鹏(1989-),男,硕士生,主要研究方向为大规模数据管理与分析,E-mail:jpzhu.gm@gmail.com;李 晖(1982-),男,博士,硕士生导师,主要研究方向为大规模数据管理与分析,E-mail:cse.HuiLi@gzu.edu.cn(通信作者);陈 梅(1964-),女,硕士生导师,主要研究方向为数据库及其应用系统。
  • 基金资助:
    本文受国家自然科学基金项目(61562010, 61462012,U1531246),贵州省重大应用基础研究项目(JZ20142001),贵州省数据分析云服务创新团队(黔科合人才团队字53),贵州大学研究生创新基金项目(研理工2017078)资助。

SNS:A Fast and Unbiased Stratified Graph Sampling Algorithm

ZHU Jun-peng, LI Hui, CHEN Mei, DAI Zhen-yu   

  1. (College of Computer Science and Technology,Guizhou University,Guiyang 550025,China)
    (Guizhou Engineering Laboratory of Advance Computing and Medical Information Service,Guiyang 550025,China)
  • Received:2017-10-08 Published:2019-02-25

摘要: 抽样作为一种有效的统计分析方法,常被用于大规模图数据分析领域以提升性能。现有的图抽样算法大多存在高度节点或低度节点过度入样的问题,较大程度地影响了算法的性能。复杂网络具有无标度特性,即节点的度服从幂律分布,节点个体之间存在较大差异。在基于点选择策略的抽样方法的基础上,通过结合节点的近似度分布策略,设计并实现了高效无偏的分层图抽样算法SNS。在3个真实的图数据集上的实验结果表明,SNS算法比其他图抽样算法保留了更多的拓扑属性,且执行效率比FFS更高。SNS算法在度的无偏性、抽样结果拓扑属性近似性方面的表现均优于现有算法。

关键词: 分层抽样, 图抽样, 向量聚类, 性能评估, 有偏抽样

Abstract: As an effective method of statistical analysis,sampling is commonly used in the field of analyzing the large-scale graph data to improve the performance.However,most of the existing graph sampling algorithms often have the problem of excessive sampling of high and low nodes,resulting in lower accuracy derived from the scale-free characteristic of complex networks.The scale-free characteristic means the degrees of different nodes follow a power law distribution,and the difference between nodes is huge.On the basis of the sampling method on node selection strategy,combining the approximate degree distribution strategy of nodes,this paper proposed and realized an efficient and unbiased stratified graph sampling algorithm named SNS.The experimental results show that SNS algorithm preserves more topological properties on three real data sets than other graph sampling algorithms,and consumes less time than FFS algorithm.Therefore,SNS algorithm is superior to the existing algorithms in terms of the unbiasedness of degree and the accuracy of sampling results.

Key words: Biased sampling, Graph sampling, Performance estimation, Stratified sampling, Vector clustering

中图分类号: 

  • TP393
[1]LESKOVEC J,FALOUTSOS C.Sampling from large graphs [C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2006:631-636.
[2]AHMED N K,NEVILLE J,KOMPELLA R.Network Samp- ling:From Static to Streaming Graphs[J].Acm Transactions on Knowledge Discovery from Data,2013,8(2):1-56.
[3]WATTS D J,STROGATZ S H.Collective dynamics of ‘small-world’ networks[J].Nature,1998:393(6684):440-442.
[4]FALOUTSOS M,FALOUTSOS P,FALOUTSOS C.On power-law relationships of the Internet topology[J].Acm Sigcomm Computer Communication Review,1999:29(4):251-262.
[5]HAN J.Data Mining:Concepts and Techniques[M].San Francisco,CA,USA:Morgan Kaufmann Publishers Inc,2005.
[6]HAMERLY G.Making k -means even faster[C] ∥SIAM International Conference on Data Mining 2010.2010:130-140.
[7]JORGENSEN M.EM Algorithm[M]∥Encyclopedia of Envi- ronmetrics.John Wiley & Sons,Ltd,2006:267-292.
[8]PARK H S,JUN C H.A simple and fast algorithm for K-Medoids clustering[J].Expert Systems with Applications,2009,36(2):3336-3341.
[9]DASZYKOWSKI M,WALCZAK B.Density-Based Clustering Methods[M]∥Comprehensive Chemometrics.Elsevier,2010.
[10]CAMPELLO R J G B,MOULAVI D,SANDER J.Density- Based Clustering Based on Hierarchical Density Estimates[C]∥Pacific-Asia Conference on Knowledge Discovery & Data Mi-ning.2013:160-172.
[11]SNAP[OL].http://snap.stanford.edu/data/index.html.
[12]DASGUPTA A,KUMAR R,SIVAKUMAR D.Social sampling[C]∥ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2012:235-243.
[13]GJOKA M,KURANT M,BUTTS C T,et al.Walking in Facebook:A Case Study of Unbiased Sampling of OSNs[C]∥INFOCOM,2010 Proceedings IEEE.IEEE,2010:1-9.
[14]PALMER C R,FALOUTSOS C.Density biased sampling:an improved method for data mining and clustering[J].Acm Sigmod Record,2000,29(2):82-92.
[15]CALLAGHAN M,CALLAGHAN M,CALLAGHAN M,et al.LinkBench:a database benchmark based on the Facebook social graph[C]∥ACM SIGMOD International Conference on Mana-gement of Data.ACM,2013:1185-1196.
[16]FLESCA S,GRECO S.Querying graph databases[C]∥International Conference on Extending Database Technology:Advances in Database Technology.Springer-Verlag,2000:510-524.
[17]MAASS S,MIN C,KASHYAP S,et al.Mosaic:Processing a Trillion-Edge Graph on a Single Machine[C]∥Twelfth Euro-pean Conference on Computer Systems.ACM,2017:527-543.
[18]ZHAO J,WANG P,LUI J C S,et al.Sampling Online Social Networks by Random Walk with Indirect Jumps[OL].http://www.researchgate.net/publication/319391447_Sampling_Online_Social_Networks_by_Random_Walk_with_Indirect_Jumps.
[19]WAGNER C,SINGER P,KARIMI F,et al.Sampling from Social Networks with Attributes[C]∥Conference www’17Proceedings of the 26th International Conference on World Wide Wep.2017:1181-1190.
[20]HASAN M A.Methods and Applications of Network Sampling[C]∥SIAM Conference on Data Mining.2016.
[21]YU L.Sampling and characterizing online social networks[C]∥IISA 2016.2016:245-249.
[22]REZVANIAN A,MEYBODI M R.Sampling algorithms for weighted networks[J].Social Network Analysis & Mining,2016,6(1):1-22.
[23]VOUDIGARI E,SALAMANOS N,PAPAGEORGIOU T,et al.Rank degree:An efficient algorithm for graph sampling[C]∥IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.IEEE,2016:120-129.
[24]ZHAO J,LUI J C S,TOWSLEY D,et al.A tale of three graphs:Sampling design on hybrid social-affiliation networks[C]∥IEEE,International Conference on Data Engineering.IEEE,2015:939-950.
[25]CUI Y A,LI X,WANG Z X,et al.A Comparison on Methodologies of Sampling Online Social Media[J].Chinese Journal of Computers,2014,37(8):1859-1876.(in Chinese)
崔颖安,李雪,王志晓,等.在线社交媒体数据抽样方法的比较研究[J].计算机学报,2014,37(8):1859-1876.
[26]TANG J T,WANG T,WANG J.Shortest Path Approximate Algorithm for Complex Network Analysis[J].Journal of Software,2011,22(10):2279-2290.(in Chinese)
唐晋韬,王挺,王戟.适合复杂网络分析的最短路径近似算法[J].软件学报,2011,22(10):2279-2290.
[1] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[2] 刘林云, 陈开颜, 李雄伟, 张阳, 谢方方.
基于卷积神经网络的旁路密码分析综述
Overview of Side Channel Analysis Based on Convolutional Neural Network
计算机科学, 2022, 49(5): 296-302. https://doi.org/10.11896/jsjkx.210300286
[3] 王一超, 廖秋承, 左思成, 谢锐, 林新华.
一种ARM处理器面向高性能计算的性能评估
Performance Evaluation of ARM-ISA SoC for High Performance Computing
计算机科学, 2019, 46(8): 95-99. https://doi.org/10.11896/j.issn.1002-137X.2019.08.015
[4] 李红军,崔西宁,牟明,韩伟.
一种面向分布式嵌入式计算机的性能评估模型
Research on Distributed Embedded Computer Performance Evaluation Model
计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033
[5] 林新华,秦强,李硕,文敏华,松岗聪.
使用Stencil评估Intel AVX2 Vgather指令
Evaluating Intel AVX2 Vgather Instructions with Stencils
计算机科学, 2017, 44(1): 20-24. https://doi.org/10.11896/j.issn.1002-137X.2017.01.004
[6] 王伟,王嘉郡,王明明,张文静,陈金广.
以网络性能为核心的移动自组网Flooding攻击防御技术
Defense Technology Based on Dynamic Space-Time Performance for Flooding Attacks in Mobile Ad Hoc Networks
计算机科学, 2017, 44(1): 159-166. https://doi.org/10.11896/j.issn.1002-137X.2017.01.031
[7] 张成伟,程文青,黑晓军.
基于Android平台的3G移动网络测量研究及性能分析
Measurement Study of 3G Mobile Networks Using Android Platform
计算机科学, 2015, 42(2): 24-28. https://doi.org/10.11896/j.issn.1002-137X.2015.02.005
[8] 蒋一波,王雨晨,王万良,张祯,陈琼.
一种基于机器学习的MANET网络入侵检测性能评估方法研究
Performance Analysis Method for Intrusion Detection in MANETs Based on Machine Learning Algorithms
计算机科学, 2013, 40(Z11): 170-174.
[9] 王会梅,鲜明,王国玉.
网络抗拒绝服务攻击性能的集对评估方法
Set Pair Analysis Method for Evaluating Denial of Service Attack Resistance Ability
计算机科学, 2012, 39(4): 53-55.
[10] 张丹,赵荣彩,单征,韩林,瞿进.
可重构系统中软硬任务划分方法研究
Research on Hardware/Software Task Partitioning for Reconfigurable System
计算机科学, 2012, 39(3): 276-278.
[11] 刘明骞,李兵兵,刘涵.
数字调制信号识别性能的评估方法
Performance Evaluation Algorithm of Digital Modulation Signals Recognition
计算机科学, 2011, 38(5): 64-66.
[12] 陈飞 商琳 骆斌 陈世福.
联系发现:一种新的数据挖掘方法综述

计算机科学, 2006, 33(11): 123-127.
[13] .
一种兼顾协议正确性验证和性能评估的Petri网方法

计算机科学, 2005, 32(12): 48-52.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!