计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 56-63.doi: 10.11896/j.issn.1002-137X.2019.08.009

• 大数据与数据科学* • 上一篇    下一篇

一种可指定分布的海量数据生成方法

李博嘉1, 张仰森1,2, 陈若愚1,2   

  1. (北京信息科技大学智能信息处理研究所 北京100101)1
    (网络文化与数字传播北京市重点实验室 北京100101)2
  • 收稿日期:2018-07-20 出版日期:2019-08-15 发布日期:2019-08-15
  • 通讯作者: 陈若愚(1982-),男,博士,讲师,CCF会员,主要研究方向为自然语言处理、人工智能,E-mail:ruoyu-chen@foxmail.com
  • 作者简介:李博嘉(1992-),男,硕士生,CCF学生会员,主要研究方向为大数据、人工智能,E-mail:1012139091@qq.com;张仰森(1962-),男,博士,教授,CCF会员,主要研究方向为自然语言处理、人工智能
  • 基金资助:
    国家自然科学基金(61772081),北京市教委科研计划项目(KM201711232014)

Method for Generating Massive Data with Assignable Distribution

LI Bo-jia1, ZHANG Yang-sen1,2, CHEN Ruo-yu1,2   

  1. (Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China)1
    (Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing 100101,China)2
  • Received:2018-07-20 Online:2019-08-15 Published:2019-08-15

摘要: 受到隐私保护等因素的影响,企业和政府数据公开缓慢;同时,由于网络带宽的限制,科研机构下载使用海量公开数据存在困难。现有的数据生成工具很少能在生成数据的分布形态、相关关系、准确性以及系统的可伸缩性等方面同时满足科研工作的要求。针对海量数据生成问题,提出了一种分布式数据生成模型,根据用户配置中指定的数据分布形态及相关关系,利用蓄水池抽样或随机采样算法对Web信息知识库进行采样、相关关系计算以及拼接等操作,生成数据属性符合用户配置的数据。通过在Apache Spark分布式计算引擎上进行数据生成实验,结果表明,生成数据符合指定的数据分布及相关关系要求,数据生成速度与数据规模、集群规模呈线性关系,从而证明该方法生成的数据具有较高的准确性和分布多样性,相应的系统具有较好的可伸缩性。

关键词: 分布式计算, 数据分布检验, 数据生成, 相关关系计算, 蓄水池抽样

Abstract: Affected by factors such as privacy protection,corporate and government data are slow to be exposed.At the same time,due to the influence of network bandwidth,it is difficult for scientific research institutions to download and use massive public data.It is rare that the existing data generation tools can concurrently meet the requirements of scien-tific research work in terms of the generation of data distribution pattern,correlation,accuracy and scalability of the system.Specific to the problem of mass data generation,this paper put forward a distributed data generation model.According to the data distribution pattern and correlative relation specified in the user’s configuration,the reservoir sampling or random sampling algorithm is used for the sampling,calculation of relative relationship and splicing of the Web data knowledge base to generate the data of which the attribute accords with the user’s configuration.Through the data generation test on the distributed computing engine Apache Spark,the generated data meets the specified data distribution and correlation requirements,and the data generation speed is linear with the data size and cluster size from the statistical point of view.It shows that the data generated by the proposed data method has high accuracy and diversity of distribution,and the proposed data generation system has good scalability

Key words: Correlation calculation, Data distribution test, Data generation, Distributed computing, Reservoir sampling

中图分类号: 

  • TP391
[1]PAN W.The current situation and trend of big data development in China[J].The Science of Leadership Forum,2017(4):28-44.(in Chinese) 潘文.我国大数据发展现状与趋势[J].领导科学论坛,2017(4):28-44.
[2]BUSARI M,WILLIAMSON C.PRoWGen:A synthetic work- load generation tool for simulation evaluation of web proxy caches[J].Computer Networks,2002,38(6):779-794.
[3]RABL T,POESS M,DANISCH M,et al.Rapid development of data generators using meta generators in PDGF[C]∥International Workshop on Testing Database Systems.ACM,2013:1-6.
[4]RABL T,FRANK M,SERGIEH H M,et al.A data generator for cloud-scale benchmarking[C]∥Performance Evaluation,Measurement and Characterization of Complex Systems.Sprin-ger,2011:41-56.
[5]GHAZAL A,RABL T,HU M Q,Raab F,Meikel Poess,Alain Crolotte,and Hans-Arno Jacobsen.Bigbench:Towards an industry standard benchmark for big data analytics[C]∥Proceesings of the 2013 ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013.
[6]HUANG S S,HUANG J,DAI J Q,et al.The hibench benchmark suite:Characterization of the mapreduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51.
[7]MING Z,LUO C,GAO W,et al.BDGS:A scalable big data generator suite in big data benchmarking[C]∥Advancing Big Data Benchmarks.Springer International Publishing,2014:138-154.
[8]YIN J,LU X,ZHAO X,et al.BURSE:A bursty and self-similar workload generator for cloud computing[J].IEEE Trans.on Parallel & Distributed Systems,2015,26(3):668-680.
[9]QIU Z P,XIAO R P,ZHANG R.Simulate generating web log algorithm using fields’priority relevance[J].Computer Systems &Applications,2017,26(3):126-133.(in Chinese) 丘志鹏,肖如良,张 锐.优先关联的Web日志数据逼真生成算法[J].计算机系统应用,2017,26(3):126-133.
[10]ZHAO H Q,LIU J L.Research on complex event big data processing system test data generation method based on Bayesian network[J/OL].Application Research of Computers,2018(8):1-2.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180507.1706.040.html.(in Chinese) 赵会群,刘金銮.基于贝叶斯网络的复杂事件大数据处理系统测试数据生成方法研究[J/OL].计算机应用研究,2018(8):1-2.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180507.1706.040.html.
[11]XU P,LIU J Y,LIN B,et al.Generation of fuzzing test case based on recurrent neural networks[J/OL].Application Research of Computers,2019(10):1-3.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180619.1517.062.html.(in Chinese) 徐鹏,刘嘉勇,林波,等.基于循环神经网络的模糊测试用例生成[J/OL].计算机应用研究,2019(10):1-3.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180619.1517.062.html.
[12]WANG K F,ZUO W M,TAN Y,et al.Generative confrontation network:from generating data to creating intelligence[J].ACTA Automatic Sinica,2018,44(5):769-774.(in Chinese) 王坤峰,左旺孟,谭营,等.生成式对抗网络:从生成数据到创造智能[J].自动化学报,2018,44(5):769-774.
[13]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems.Montreal,Canada:Curran Associates,Inc.,2014:2672-2680
[14]CHEN X,DUAN Y,HOUTHOOFT R,et al.Info GAN:interpretable representation learning by information maximizing generative adversarial nets[C]∥Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona,Spain:Curran Associates,Inc.,2016.
[15]LI F J,YANG Z Q.An integrated sampling method over imbalanced network flows[J].Fire Contrd & Command Contrd.,2015,40(20):74-79.(in Chinese) 李富景,杨志强.一种面向不均衡网络流的综合抽样方法[J].火力与指挥控制,2015,40(20):74-79.
[16]GUAN L,HU G J,WANG Z.Research on network security situational awareness technology base on big data[J].Netinfo Security,2016(9):45-50.(in Chinese) 管磊,胡光俊,王专.基于大数据的网络安全态势感知技术研究[J].信息网络安全,2016(9):45-50.
[17]纯真IP数据库[DB/OL].http://www.onlinedown.net/soft/19051.html.
[18]Alexa站点流量统计[DB/OL].http://www.alexa.cn/.
[19]THOMAS W.MacFarland.Student’s t-Test for Independent Samples[M].Springer International Publishing:2014-06-15.
[20]FAN D M.The P value in hypothesis testing[J].Journal of Zheng Zhou Economic Management Institute,2002(4):70-71.(in Chinese) 樊冬梅.假设检验中的P值[J].郑州经济管理干部学院学报,2002(4):70-71.
[1] 王如斌, 李瑞远, 何华均, 刘通, 李天瑞.
面向海量空间数据的分布式距离连接算法
Distributed Distance Join Algorithm for Massive Spatial Data
计算机科学, 2022, 49(1): 95-100. https://doi.org/10.11896/jsjkx.210100060
[2] 钱甜甜, 张帆.
基于分布式边缘计算的情绪识别系统
Emotion Recognition System Based on Distributed Edge Computing
计算机科学, 2021, 48(6A): 638-643. https://doi.org/10.11896/jsjkx.201000010
[3] 苑晨宇, 谢在鹏, 朱晓瑞, 屈志昊, 徐媛媛.
一种基于分布式编码的卷积优化算法
Convolutional Optimization Algorithm Based on Distributed Coding
计算机科学, 2021, 48(2): 47-54. https://doi.org/10.11896/jsjkx.200800187
[4] 邵炜晖,许维胜,徐志宇,王宁,农静.
基于区块链的虚拟电厂模型研究
Study on Virtual Power Plant Model Based on Blockchain
计算机科学, 2018, 45(2): 25-31. https://doi.org/10.11896/j.issn.1002-137X.2018.02.005
[5] 朱坤,黄瑞章,张娜娜.
一种基于MapReduce模型的高效频繁项集挖掘算法
Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model
计算机科学, 2017, 44(7): 31-37. https://doi.org/10.11896/j.issn.1002-137X.2017.07.006
[6] 李红军,崔西宁,牟明,韩伟.
一种面向分布式嵌入式计算机的性能评估模型
Research on Distributed Embedded Computer Performance Evaluation Model
计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033
[7] 朱凯龙,陆余良,杨斌.
分布式环境下的路由器级互联网抗毁性研究
Study on Invulnerability of Router-level Internet Based on MapReduce
计算机科学, 2017, 44(11): 168-174. https://doi.org/10.11896/j.issn.1002-137X.2017.11.025
[8] 邓强,杨燕,王浩.
一种改进的多视图聚类集成算法
Improved Multi-view Clustering Ensemble Algorithm
计算机科学, 2017, 44(1): 65-70. https://doi.org/10.11896/j.issn.1002-137X.2017.01.012
[9] 何明,吴小飞,常盟盟,任万鹏.
基于用户共现矩阵乘子的分布式协同过滤推荐
Distributed Collaborative Filtering Recommendation Based on User Co-occurrence Matrix Multiplier
计算机科学, 2016, 43(Z11): 428-435. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.097
[10] 殷晓波,罗恩.
一种松弛的优化均衡流式图划分算法研究
Relaxed Optimal Balanced Streaming Graph Partitioning Algorithm
计算机科学, 2016, 43(4): 231-234. https://doi.org/10.11896/j.issn.1002-137X.2016.04.047
[11] 李刚,于磊,孙回回,张兴隆,侯韶凡.
基于变异粒子群算法的字符串型测试数据生成
String-type Test Data Generation Based on Mutation Particle Swarm Optimization
计算机科学, 2016, 43(11): 252-256. https://doi.org/10.11896/j.issn.1002-137X.2016.11.049
[12] 徐凤生,闫立梅,史开泉.
具有属性析取萎缩-扩张特征的动态数据智能挖掘
Dynamic Data Intelligent Mining with Attributes Disjunctive Reduction and Expansion Characteristics
计算机科学, 2015, 42(5): 215-220. https://doi.org/10.11896/j.issn.1002-137X.2015.05.043
[13] 孙彦超,王兴芬.
基于Hadoop框架的MapReduce计算模式的优化设计
MapReduce Designed to Optimize Computing Model Based on Hadoop Framework
计算机科学, 2014, 41(Z11): 333-336.
[14] 秦高德,文高进.
大型分布式计算中的分级节能调度
Hierarchical Scheduling of Large Scale Distributed Computation
计算机科学, 2013, 40(4): 91-95.
[15] 曹建军,刁兴春,张慧,谭明超,邓波.
信息系统模拟数据生成研究综述
Simulated Data Generation for Information System: A Survey
计算机科学, 2012, 39(Z6): 322-324.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!