计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 13-19.doi: 10.11896/jsjkx.190500155
所属专题: 数据库技术
齐文1, 鲍玉斌2, 宋杰3
QI Wen1, BAO Yu-bin2, SONG Jie3
摘要: 大数据时代的到来给传统的数据查询带来了性能挑战,即使查询算法有着O(n)的线性复杂度,但当n极大时其时间开销也难以满足用户需求。在很多实际应用中,人们并不需要精确的查询结果,但要求在给定时间内完成查询,因此可适当牺牲查询精度以满足性能约束。采样查询通过约简查询范围来提高查询性能,现有的采样方法多针对特定的算法和特定的应用场景,缺乏大数据环境下一般性的采样查询方法以及保证性能和精度的研究。文中研究大数据环境下列存储的采样查询处理,从数据划分和数据采样两方面改进大数据的查询效率。提出了基于加速比和势分布的采样方法,其支持各类采样算法,实现了分布式环境下采样查询的随机性保证、性能保证和近似性评价,并兼容了精确查询。该方法可以快速应用到已有大量数据的列存储中,具备良好的扩展性和可维护性。以Top-K为查询用例的实验结果证明,在不同数据量、不同数据分布和不同采样算法下,实际采样率与给定采样率的误差低于2%,查询准确度 (Accuracy) 稳定,方差在0.10和0.12之间,因此提出的基于段势的数据划分的采样效率高于平均划分和线性划分。
中图分类号:
[1]SHEN D R,YU G,WANG X T,et al.Survey on NoSQL for management of big data[J].Journal of Software,2013,24(8):1786-1803.(in Chinese) 申德荣,于戈,王习特,等.支持大数据管理的NoSQL系统研究综述[J].软件学报,2013(8):1786-1803.[2]LUO C,JIANG Z,HOU W C,et al.A sampling approach for skyline query cardinality estimation[J].Knowledge and Information Systems,2012,32(2):281-301.[3]GIBBON P B.Approximate Query Processing:Taming the TeraBytes[C]//International Conference on Vldb.DBLP,2001.[4]GRAHAM C,GAROFALAKIS M,HAAS P T,et al.Synopses for Massive Data:Samples,Histograms,Wavelets,Sketches[J].Foundations and Trends in Databases,2011,4(1/2/3):1-294.[5]CHAUDHURI S,DING B,KANDULA S.Approximate Query Processing:No Silver Bullet[C]//the 2017 ACM International Conference.ACM,2017.[6]LIU L,HU G.A Parameter-Free Linear Sampling Method[J].IEEE Access,2019,7:17935-17940.[7]ZHAO J,SUN J,ZHAI Y,et al.A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment[J].International Journal of Pattern Recognition and Artificial Intelligence,2018,32(2):4.[8]HAMIDI H,MOUSAVI R.Analysis and Evaluation of a Framework for Sampling Database in Recommenders[J].Journal of Global Information Management,2018,26(1),41-57.[9]WU W,NAUGHTON J F,SINGH H.Sampling-based query re-optimization[C]//Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1721-1736.[10]LI J,LIN J.Research on the Influence of Sampling Methods for the Accuracy of Web Services QoS Prediction[J].IEEE Access,2019,7:39990-39999.[11]DECASTRO-GARCÍA N,MUÑOZ CASTAÑEDA Á L,ES- CUDERO GARCÍA D,et al.Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm[J].Complexity,2019,2019:1-16.[12]LIU W,SU J.Online digital library sampling based on query related graph[J].The Electronic Library,2018,36(6):1082-1098.[13]STOEHR N,MEYER J,MARKL V,et al.Heatflip:Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data[C]//2018 IEEE International Conference on Big Data (Big Data).2018:3723-3732.[14]ZHANG J,NIU B.A clustering-based sampling method for building query response time models[J].Computer Systems Science and Engineering,2017,32(4):319-331.[15]HE Y ,HUANG J Z,LONG H,et al.I-Sampling:A New Block-Based Sampling Method for Large-Scale Dataset[C]//EEE International Congress on Big Data (BigData Congress).2017:360-367. |
[1] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[2] | 陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273 |
[3] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[4] | 孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010 |
[5] | 王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207 |
[6] | 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083 |
[7] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
[8] | 王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042 |
[9] | 滕建, 滕飞, 李天瑞. 基于3D卷积和LSTM编码解码的出行需求预测 Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder 计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022 |
[10] | 张育龙, 王强, 陈明康, 孙静涛. 图像去雨算法在云物联网应用中的研究综述 Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems 计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055 |
[11] | 曹萌, 于洋, 梁英, 史红周. 基于区块链的大数据交易关键技术与发展趋势 Key Technologies and Development Trends of Big Data Trade Based on Blockchain 计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163 |
[12] | 刘亚臣, 黄雪莹. 卫星监测时空大数据蠕变特征提取及预警算法 Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data 计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071 |
[13] | 张光君, 张翔. 应用“大数据+区块链”优化立法评估制度的机理与路径 Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” 计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105 |
[14] | 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式 Two-step Authorization Pattern of Data Product Circulation 计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217 |
[15] | 赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156 |
|