Computer Science ›› 2019, Vol. 46 ›› Issue (8): 56-63.doi: 10.11896/j.issn.1002-137X.2019.08.009

• Big Data & Data Science • Previous Articles     Next Articles

Method for Generating Massive Data with Assignable Distribution

LI Bo-jia1, ZHANG Yang-sen1,2, CHEN Ruo-yu1,2   

  1. (Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China)1
    (Beijing Key Laboratory of Internet Culture and Digital Dissemination Research,Beijing 100101,China)2
  • Received:2018-07-20 Online:2019-08-15 Published:2019-08-15

Abstract: Affected by factors such as privacy protection,corporate and government data are slow to be exposed.At the same time,due to the influence of network bandwidth,it is difficult for scientific research institutions to download and use massive public data.It is rare that the existing data generation tools can concurrently meet the requirements of scien-tific research work in terms of the generation of data distribution pattern,correlation,accuracy and scalability of the system.Specific to the problem of mass data generation,this paper put forward a distributed data generation model.According to the data distribution pattern and correlative relation specified in the user’s configuration,the reservoir sampling or random sampling algorithm is used for the sampling,calculation of relative relationship and splicing of the Web data knowledge base to generate the data of which the attribute accords with the user’s configuration.Through the data generation test on the distributed computing engine Apache Spark,the generated data meets the specified data distribution and correlation requirements,and the data generation speed is linear with the data size and cluster size from the statistical point of view.It shows that the data generated by the proposed data method has high accuracy and diversity of distribution,and the proposed data generation system has good scalability

Key words: Data generation, Reservoir sampling, Distributed computing, Correlation calculation, Data distribution test

CLC Number: 

  • TP391
[1] PAN W.The current situation and trend of big data development in China[J].The Science of Leadership Forum,2017(4):28-44.(in Chinese) 潘文.我国大数据发展现状与趋势[J].领导科学论坛,2017(4):28-44.
[2] BUSARI M,WILLIAMSON C.PRoWGen:A synthetic work- load generation tool for simulation evaluation of web proxy caches[J].Computer Networks,2002,38(6):779-794.
[3] RABL T,POESS M,DANISCH M,et al.Rapid development of data generators using meta generators in PDGF[C]∥International Workshop on Testing Database Systems.ACM,2013:1-6.
[4] RABL T,FRANK M,SERGIEH H M,et al.A data generator for cloud-scale benchmarking[C]∥Performance Evaluation,Measurement and Characterization of Complex Systems.Sprin-ger,2011:41-56.
[5] GHAZAL A,RABL T,HU M Q,Raab F,Meikel Poess,Alain Crolotte,and Hans-Arno Jacobsen.Bigbench:Towards an industry standard benchmark for big data analytics[C]∥Proceesings of the 2013 ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013.
[6] HUANG S S,HUANG J,DAI J Q,et al.The hibench benchmark suite:Characterization of the mapreduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51.
[7] MING Z,LUO C,GAO W,et al.BDGS:A scalable big data generator suite in big data benchmarking[C]∥Advancing Big Data Benchmarks.Springer International Publishing,2014:138-154.
[8] YIN J,LU X,ZHAO X,et al.BURSE:A bursty and self-similar workload generator for cloud computing[J].IEEE Trans.on Parallel & Distributed Systems,2015,26(3):668-680.
[9] QIU Z P,XIAO R P,ZHANG R.Simulate generating web log algorithm using fields’priority relevance[J].Computer Systems &Applications,2017,26(3):126-133.(in Chinese) 丘志鹏,肖如良,张 锐.优先关联的Web日志数据逼真生成算法[J].计算机系统应用,2017,26(3):126-133.
[10] ZHAO H Q,LIU J L.Research on complex event big data processing system test data generation method based on Bayesian network[J/OL].Application Research of Computers,2018(8):1-2.[2018-06-26]. Chinese) 赵会群,刘金銮.基于贝叶斯网络的复杂事件大数据处理系统测试数据生成方法研究[J/OL].计算机应用研究,2018(8):1-2.[2018-06-26].
[11] XU P,LIU J Y,LIN B,et al.Generation of fuzzing test case based on recurrent neural networks[J/OL].Application Research of Computers,2019(10):1-3.[2018-06-26]. Chinese) 徐鹏,刘嘉勇,林波,等.基于循环神经网络的模糊测试用例生成[J/OL].计算机应用研究,2019(10):1-3.[2018-06-26].
[12] WANG K F,ZUO W M,TAN Y,et al.Generative confrontation network:from generating data to creating intelligence[J].ACTA Automatic Sinica,2018,44(5):769-774.(in Chinese) 王坤峰,左旺孟,谭营,等.生成式对抗网络:从生成数据到创造智能[J].自动化学报,2018,44(5):769-774.
[13] GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems.Montreal,Canada:Curran Associates,Inc.,2014:2672-2680
[14] CHEN X,DUAN Y,HOUTHOOFT R,et al.Info GAN:interpretable representation learning by information maximizing generative adversarial nets[C]∥Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona,Spain:Curran Associates,Inc.,2016.
[15] LI F J,YANG Z Q.An integrated sampling method over imbalanced network flows[J].Fire Contrd & Command Contrd.,2015,40(20):74-79.(in Chinese) 李富景,杨志强.一种面向不均衡网络流的综合抽样方法[J].火力与指挥控制,2015,40(20):74-79.
[16] GUAN L,HU G J,WANG Z.Research on network security situational awareness technology base on big data[J].Netinfo Security,2016(9):45-50.(in Chinese) 管磊,胡光俊,王专.基于大数据的网络安全态势感知技术研究[J].信息网络安全,2016(9):45-50.
[17] 纯真IP数据库[DB/OL].
[18] Alexa站点流量统计[DB/OL].
[19] THOMAS W.MacFarland.Student’s t-Test for Independent Samples[M].Springer International Publishing:2014-06-15.
[20] FAN D M.The P value in hypothesis testing[J].Journal of Zheng Zhou Economic Management Institute,2002(4):70-71.(in Chinese) 樊冬梅.假设检验中的P值[J].郑州经济管理干部学院学报,2002(4):70-71.
[1] GUO Xin, ZHANG Geng, CHEN Qian, WANG Su-ge. Candidate Sentences Extraction for Machine Reading Comprehension [J]. Computer Science, 2020, 47(5): 198-203.
[2] ZHU Kun, HUANG Rui-zhang and ZHANG Na-na. Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model [J]. Computer Science, 2017, 44(7): 31-37.
[3] GAO Xue-di, ZHOU Li-juan, ZHANG Shu-dong and LIU Hao-ming. Research on Test Data Automatic Generation Based on Improved Genetic Algorithm [J]. Computer Science, 2017, 44(3): 209-214.
[4] JIAO Chong-yang, ZHOU Qing-lei and ZHANG Wen-ning. MPSO and Its Application in Test Data Automatic Generation [J]. Computer Science, 2017, 44(12): 249-254.
[5] ZHU Kai-long, LU Yu-liang and YANG Bin. Study on Invulnerability of Router-level Internet Based on MapReduce [J]. Computer Science, 2017, 44(11): 168-174.
[6] HE Ming, WU Xiao-fei, CHANG Meng-meng and REN Wan-peng. Distributed Collaborative Filtering Recommendation Based on User Co-occurrence Matrix Multiplier [J]. Computer Science, 2016, 43(Z11): 428-435.
[7] YIN Xiao-bo and LUO En. Relaxed Optimal Balanced Streaming Graph Partitioning Algorithm [J]. Computer Science, 2016, 43(4): 231-234.
[8] LI Gang, YU Lei, SUN Hui-hui, ZHANG Xing-long and HOU Shao-fan. String-type Test Data Generation Based on Mutation Particle Swarm Optimization [J]. Computer Science, 2016, 43(11): 252-256.
[9] SUN Yan-chao and WANG Xing-fen. MapReduce Designed to Optimize Computing Model Based on Hadoop Framework [J]. Computer Science, 2014, 41(Z11): 333-336.
[10] GU Yi-jun,XIE Yi and XIA Tian. Keyframe Extraction Based on Representative Evaluation of Contents [J]. Computer Science, 2014, 41(8): 286-288.
[11] QIN Gao-de and WEN Gao-jin. Hierarchical Scheduling of Large Scale Distributed Computation [J]. Computer Science, 2013, 40(4): 91-95.
[12] ZHU Zhao-meng,ZHANG Gong-xuan,ZHANG Yong-ping,GUO Jian and ZHANG Wei. Design of Content-based Data Forwarding Network and Algorithm [J]. Computer Science, 2013, 40(4): 64-68.
[13] . Simulated Data Generation for Information System: A Survey [J]. Computer Science, 2012, 39(Z6): 322-324.
[14] . Design and Implementation of Resource Replication Based on Thread Pooling [J]. Computer Science, 2012, 39(Z11): 428-433.
[15] . Automatic Test Data Generation Tool of Dynamic Variable Parameters Based on Genetic Algorithm [J]. Computer Science, 2012, 39(5): 124-127.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .