计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 13-19.doi: 10.11896/jsjkx.190500155

所属专题: 数据库技术

• 大数据与数据科学 • 上一篇    下一篇

基于列存储的大数据采样查询处理

齐文1, 鲍玉斌2, 宋杰3   

  1. (辽东学院信息工程学院 辽宁 丹东118000)1;
    (东北大学计算机科学与工程学院 沈阳110819)2;
    (东北大学软件学院 沈阳110819)3
  • 收稿日期:2019-05-28 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 齐文(1974-),男,硕士,副教授,CCF会员,主要研究方向为大数据查询与分析,E-mail:ddqiwen@163.com。
  • 作者简介:鲍玉斌(1968-),男,博士,教授,CCF高级会员,主要研究方向为数据仓库;宋杰(1980-),男,博士,教授,CCF高级会员,主要研究方向为大数据存储与管理、高能效计算。
  • 基金资助:
    本文受国家自然科学基金(61672143,61433008)资助。

Column-oriented Store Based Sampling Query Process on Big Data

QI Wen1, BAO Yu-bin2, SONG Jie3   

  1. ( School of Information and Engineering,Eastern Liaoning University,Dandong,Liaoning 118000,China)1;
    ( School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)2;
    ( Software College,Northeastern University,Shenyang 110819,China)3
  • Received:2019-05-28 Online:2019-12-15 Published:2019-12-17

摘要: 大数据时代的到来给传统的数据查询带来了性能挑战,即使查询算法有着O(n)的线性复杂度,但当n极大时其时间开销也难以满足用户需求。在很多实际应用中,人们并不需要精确的查询结果,但要求在给定时间内完成查询,因此可适当牺牲查询精度以满足性能约束。采样查询通过约简查询范围来提高查询性能,现有的采样方法多针对特定的算法和特定的应用场景,缺乏大数据环境下一般性的采样查询方法以及保证性能和精度的研究。文中研究大数据环境下列存储的采样查询处理,从数据划分和数据采样两方面改进大数据的查询效率。提出了基于加速比和势分布的采样方法,其支持各类采样算法,实现了分布式环境下采样查询的随机性保证、性能保证和近似性评价,并兼容了精确查询。该方法可以快速应用到已有大量数据的列存储中,具备良好的扩展性和可维护性。以Top-K为查询用例的实验结果证明,在不同数据量、不同数据分布和不同采样算法下,实际采样率与给定采样率的误差低于2%,查询准确度 (Accuracy) 稳定,方差在0.10和0.12之间,因此提出的基于段势的数据划分的采样效率高于平均划分和线性划分。

关键词: 采样查询, 大数据, 加速比, 列存储, 数据划分

Abstract: The era of big data bring performance challenges to traditional data query,even if the query algorithm is O(n) linear complexity,but when the n is extremely large,its time cost is also unbearable.In many practical applications,exact query results may be unnecessary but the queries should be accomplished at a given time,so appropriately losing the query accuracy is acceptable to meet performance constraints.Sampling queries can improve query perfor-mance by reducing query ranges.Existing researches are often studied for specific algorithms and specific application scenarios,and there is a lack of research on general sampling and query methods in the big data environment,as well as research on performance and accuracy guarantee.This paper studied the sampling and query processing in the big data environment,which improves the query efficiency of big data from data partition and data reduction.This paper proposed a sampling method based on speedup and potential distribution,which supports all kinds of sampling algorithms,and achieves randomicity guarantee,performance assurance and approximation evaluation of sampling queries in distri-buted environment,and is compatible with precise queries.This method can be applied to the column store for the big data with good expansibility and maintainability.The experimental results show that as the Top-K query case,the proposed method has better loading performance,while the sampling errors are less than 2%,and the variances of query accuracy are between 0.1 and 0.12 under various sampling rates,data volumes and sampling algorithms.The sampling efficiency of proposed partition is also higher than that of linear partition based or uniform partition based sampling.

Key words: Accumulation ratio, Big data, Column-oriented store, Data partitioning, Sampling query

中图分类号: 

  • TP391
[1]SHEN D R,YU G,WANG X T,et al.Survey on NoSQL for management of big data[J].Journal of Software,2013,24(8):1786-1803.(in Chinese)
申德荣,于戈,王习特,等.支持大数据管理的NoSQL系统研究综述[J].软件学报,2013(8):1786-1803.
[2]LUO C,JIANG Z,HOU W C,et al.A sampling approach for skyline query cardinality estimation[J].Knowledge and Information Systems,2012,32(2):281-301.
[3]GIBBON P B.Approximate Query Processing:Taming the TeraBytes[C]//International Conference on Vldb.DBLP,2001.
[4]GRAHAM C,GAROFALAKIS M,HAAS P T,et al.Synopses for Massive Data:Samples,Histograms,Wavelets,Sketches[J].Foundations and Trends in Databases,2011,4(1/2/3):1-294.
[5]CHAUDHURI S,DING B,KANDULA S.Approximate Query Processing:No Silver Bullet[C]//the 2017 ACM International Conference.ACM,2017.
[6]LIU L,HU G.A Parameter-Free Linear Sampling Method[J].IEEE Access,2019,7:17935-17940.
[7]ZHAO J,SUN J,ZHAI Y,et al.A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment[J].International Journal of Pattern Recognition and Artificial Intelligence,2018,32(2):4.
[8]HAMIDI H,MOUSAVI R.Analysis and Evaluation of a Framework for Sampling Database in Recommenders[J].Journal of Global Information Management,2018,26(1),41-57.
[9]WU W,NAUGHTON J F,SINGH H.Sampling-based query re-optimization[C]//Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1721-1736.
[10]LI J,LIN J.Research on the Influence of Sampling Methods for the Accuracy of Web Services QoS Prediction[J].IEEE Access,2019,7:39990-39999.
[11]DECASTRO-GARCÍA N,MUÑOZ CASTAÑEDA Á L,ES- CUDERO GARCÍA D,et al.Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm[J].Complexity,2019,2019:1-16.
[12]LIU W,SU J.Online digital library sampling based on query related graph[J].The Electronic Library,2018,36(6):1082-1098.
[13]STOEHR N,MEYER J,MARKL V,et al.Heatflip:Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data[C]//2018 IEEE International Conference on Big Data (Big Data).2018:3723-3732.
[14]ZHANG J,NIU B.A clustering-based sampling method for building query response time models[J].Computer Systems Science and Engineering,2017,32(4):319-331.
[15]HE Y ,HUANG J Z,LONG H,et al.I-Sampling:A New Block-Based Sampling Method for Large-Scale Dataset[C]//EEE International Congress on Big Data (BigData Congress).2017:360-367.
[1] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[2] 陈晶, 吴玲玲.
多源异构环境下的车联网大数据混合属性特征检测方法
Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment
计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[3] 王美珊, 姚兰, 高福祥, 徐军灿.
面向医疗集值数据的差分隐私保护技术研究
Study on Differential Privacy Protection for Medical Set-Valued Data
计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[4] 孙轩, 王焕骁.
政务大数据安全防护能力建设:基于技术和管理视角的探讨
Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives
计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[5] 王俊, 王修来, 庞威, 赵鸿飞.
面向科技前瞻预测的大数据治理研究
Research on Big Data Governance for Science and Technology Forecast
计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[6] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[7] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[8] 王雪岑, 张昱, 刘迎婕, 于戈.
基于表示学习的在线学习交互质量评价方法
Evaluation of Quality of Interaction in Online Learning Based on Representation Learning
计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042
[9] 滕建, 滕飞, 李天瑞.
基于3D卷积和LSTM编码解码的出行需求预测
Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder
计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022
[10] 张育龙, 王强, 陈明康, 孙静涛.
图像去雨算法在云物联网应用中的研究综述
Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems
计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055
[11] 曹萌, 于洋, 梁英, 史红周.
基于区块链的大数据交易关键技术与发展趋势
Key Technologies and Development Trends of Big Data Trade Based on Blockchain
计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163
[12] 刘亚臣, 黄雪莹.
卫星监测时空大数据蠕变特征提取及预警算法
Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data
计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071
[13] 张光君, 张翔.
应用“大数据+区块链”优化立法评估制度的机理与路径
Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain”
计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105
[14] 叶雅珍, 刘国华, 朱扬勇.
数据产品流通的两阶段授权模式
Two-step Authorization Pattern of Data Product Circulation
计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217
[15] 赵会群, 吴凯锋.
一种大数据估价算法
Big Data Valuation Algorithm
计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!