计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 56-63.doi: 10.11896/j.issn.1002-137X.2019.08.009
李博嘉1, 张仰森1,2, 陈若愚1,2
LI Bo-jia1, ZHANG Yang-sen1,2, CHEN Ruo-yu1,2
摘要: 受到隐私保护等因素的影响,企业和政府数据公开缓慢;同时,由于网络带宽的限制,科研机构下载使用海量公开数据存在困难。现有的数据生成工具很少能在生成数据的分布形态、相关关系、准确性以及系统的可伸缩性等方面同时满足科研工作的要求。针对海量数据生成问题,提出了一种分布式数据生成模型,根据用户配置中指定的数据分布形态及相关关系,利用蓄水池抽样或随机采样算法对Web信息知识库进行采样、相关关系计算以及拼接等操作,生成数据属性符合用户配置的数据。通过在Apache Spark分布式计算引擎上进行数据生成实验,结果表明,生成数据符合指定的数据分布及相关关系要求,数据生成速度与数据规模、集群规模呈线性关系,从而证明该方法生成的数据具有较高的准确性和分布多样性,相应的系统具有较好的可伸缩性。
中图分类号:
[1]PAN W.The current situation and trend of big data development in China[J].The Science of Leadership Forum,2017(4):28-44.(in Chinese) 潘文.我国大数据发展现状与趋势[J].领导科学论坛,2017(4):28-44. [2]BUSARI M,WILLIAMSON C.PRoWGen:A synthetic work- load generation tool for simulation evaluation of web proxy caches[J].Computer Networks,2002,38(6):779-794. [3]RABL T,POESS M,DANISCH M,et al.Rapid development of data generators using meta generators in PDGF[C]∥International Workshop on Testing Database Systems.ACM,2013:1-6. [4]RABL T,FRANK M,SERGIEH H M,et al.A data generator for cloud-scale benchmarking[C]∥Performance Evaluation,Measurement and Characterization of Complex Systems.Sprin-ger,2011:41-56. [5]GHAZAL A,RABL T,HU M Q,Raab F,Meikel Poess,Alain Crolotte,and Hans-Arno Jacobsen.Bigbench:Towards an industry standard benchmark for big data analytics[C]∥Proceesings of the 2013 ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013. [6]HUANG S S,HUANG J,DAI J Q,et al.The hibench benchmark suite:Characterization of the mapreduce-based data analysis[C]∥2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW).IEEE,2010:41-51. [7]MING Z,LUO C,GAO W,et al.BDGS:A scalable big data generator suite in big data benchmarking[C]∥Advancing Big Data Benchmarks.Springer International Publishing,2014:138-154. [8]YIN J,LU X,ZHAO X,et al.BURSE:A bursty and self-similar workload generator for cloud computing[J].IEEE Trans.on Parallel & Distributed Systems,2015,26(3):668-680. [9]QIU Z P,XIAO R P,ZHANG R.Simulate generating web log algorithm using fields’priority relevance[J].Computer Systems &Applications,2017,26(3):126-133.(in Chinese) 丘志鹏,肖如良,张 锐.优先关联的Web日志数据逼真生成算法[J].计算机系统应用,2017,26(3):126-133. [10]ZHAO H Q,LIU J L.Research on complex event big data processing system test data generation method based on Bayesian network[J/OL].Application Research of Computers,2018(8):1-2.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180507.1706.040.html.(in Chinese) 赵会群,刘金銮.基于贝叶斯网络的复杂事件大数据处理系统测试数据生成方法研究[J/OL].计算机应用研究,2018(8):1-2.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180507.1706.040.html. [11]XU P,LIU J Y,LIN B,et al.Generation of fuzzing test case based on recurrent neural networks[J/OL].Application Research of Computers,2019(10):1-3.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180619.1517.062.html.(in Chinese) 徐鹏,刘嘉勇,林波,等.基于循环神经网络的模糊测试用例生成[J/OL].计算机应用研究,2019(10):1-3.[2018-06-26].http://kns.cnki.net/kcms/detail/51.1196.TP.20180619.1517.062.html. [12]WANG K F,ZUO W M,TAN Y,et al.Generative confrontation network:from generating data to creating intelligence[J].ACTA Automatic Sinica,2018,44(5):769-774.(in Chinese) 王坤峰,左旺孟,谭营,等.生成式对抗网络:从生成数据到创造智能[J].自动化学报,2018,44(5):769-774. [13]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative adversarial nets[C]∥Proceedings of the 27th International Conference on Neural Information Processing Systems.Montreal,Canada:Curran Associates,Inc.,2014:2672-2680 [14]CHEN X,DUAN Y,HOUTHOOFT R,et al.Info GAN:interpretable representation learning by information maximizing generative adversarial nets[C]∥Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona,Spain:Curran Associates,Inc.,2016. [15]LI F J,YANG Z Q.An integrated sampling method over imbalanced network flows[J].Fire Contrd & Command Contrd.,2015,40(20):74-79.(in Chinese) 李富景,杨志强.一种面向不均衡网络流的综合抽样方法[J].火力与指挥控制,2015,40(20):74-79. [16]GUAN L,HU G J,WANG Z.Research on network security situational awareness technology base on big data[J].Netinfo Security,2016(9):45-50.(in Chinese) 管磊,胡光俊,王专.基于大数据的网络安全态势感知技术研究[J].信息网络安全,2016(9):45-50. [17]纯真IP数据库[DB/OL].http://www.onlinedown.net/soft/19051.html. [18]Alexa站点流量统计[DB/OL].http://www.alexa.cn/. [19]THOMAS W.MacFarland.Student’s t-Test for Independent Samples[M].Springer International Publishing:2014-06-15. [20]FAN D M.The P value in hypothesis testing[J].Journal of Zheng Zhou Economic Management Institute,2002(4):70-71.(in Chinese) 樊冬梅.假设检验中的P值[J].郑州经济管理干部学院学报,2002(4):70-71. |
[1] | 王如斌, 李瑞远, 何华均, 刘通, 李天瑞. 面向海量空间数据的分布式距离连接算法 Distributed Distance Join Algorithm for Massive Spatial Data 计算机科学, 2022, 49(1): 95-100. https://doi.org/10.11896/jsjkx.210100060 |
[2] | 钱甜甜, 张帆. 基于分布式边缘计算的情绪识别系统 Emotion Recognition System Based on Distributed Edge Computing 计算机科学, 2021, 48(6A): 638-643. https://doi.org/10.11896/jsjkx.201000010 |
[3] | 苑晨宇, 谢在鹏, 朱晓瑞, 屈志昊, 徐媛媛. 一种基于分布式编码的卷积优化算法 Convolutional Optimization Algorithm Based on Distributed Coding 计算机科学, 2021, 48(2): 47-54. https://doi.org/10.11896/jsjkx.200800187 |
[4] | 邵炜晖,许维胜,徐志宇,王宁,农静. 基于区块链的虚拟电厂模型研究 Study on Virtual Power Plant Model Based on Blockchain 计算机科学, 2018, 45(2): 25-31. https://doi.org/10.11896/j.issn.1002-137X.2018.02.005 |
[5] | 朱坤,黄瑞章,张娜娜. 一种基于MapReduce模型的高效频繁项集挖掘算法 Efficient Frequent Patterns Mining Algorithm Based on MapReduce Model 计算机科学, 2017, 44(7): 31-37. https://doi.org/10.11896/j.issn.1002-137X.2017.07.006 |
[6] | 李红军,崔西宁,牟明,韩伟. 一种面向分布式嵌入式计算机的性能评估模型 Research on Distributed Embedded Computer Performance Evaluation Model 计算机科学, 2017, 44(4): 153-156. https://doi.org/10.11896/j.issn.1002-137X.2017.04.033 |
[7] | 朱凯龙,陆余良,杨斌. 分布式环境下的路由器级互联网抗毁性研究 Study on Invulnerability of Router-level Internet Based on MapReduce 计算机科学, 2017, 44(11): 168-174. https://doi.org/10.11896/j.issn.1002-137X.2017.11.025 |
[8] | 邓强,杨燕,王浩. 一种改进的多视图聚类集成算法 Improved Multi-view Clustering Ensemble Algorithm 计算机科学, 2017, 44(1): 65-70. https://doi.org/10.11896/j.issn.1002-137X.2017.01.012 |
[9] | 何明,吴小飞,常盟盟,任万鹏. 基于用户共现矩阵乘子的分布式协同过滤推荐 Distributed Collaborative Filtering Recommendation Based on User Co-occurrence Matrix Multiplier 计算机科学, 2016, 43(Z11): 428-435. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.097 |
[10] | 殷晓波,罗恩. 一种松弛的优化均衡流式图划分算法研究 Relaxed Optimal Balanced Streaming Graph Partitioning Algorithm 计算机科学, 2016, 43(4): 231-234. https://doi.org/10.11896/j.issn.1002-137X.2016.04.047 |
[11] | 李刚,于磊,孙回回,张兴隆,侯韶凡. 基于变异粒子群算法的字符串型测试数据生成 String-type Test Data Generation Based on Mutation Particle Swarm Optimization 计算机科学, 2016, 43(11): 252-256. https://doi.org/10.11896/j.issn.1002-137X.2016.11.049 |
[12] | 徐凤生,闫立梅,史开泉. 具有属性析取萎缩-扩张特征的动态数据智能挖掘 Dynamic Data Intelligent Mining with Attributes Disjunctive Reduction and Expansion Characteristics 计算机科学, 2015, 42(5): 215-220. https://doi.org/10.11896/j.issn.1002-137X.2015.05.043 |
[13] | 孙彦超,王兴芬. 基于Hadoop框架的MapReduce计算模式的优化设计 MapReduce Designed to Optimize Computing Model Based on Hadoop Framework 计算机科学, 2014, 41(Z11): 333-336. |
[14] | 秦高德,文高进. 大型分布式计算中的分级节能调度 Hierarchical Scheduling of Large Scale Distributed Computation 计算机科学, 2013, 40(4): 91-95. |
[15] | 曹建军,刁兴春,张慧,谭明超,邓波. 信息系统模拟数据生成研究综述 Simulated Data Generation for Information System: A Survey 计算机科学, 2012, 39(Z6): 322-324. |
|