计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 13-19.doi: 10.11896/jsjkx.190500155

• 大数据与数据科学 • 上一篇    下一篇

基于列存储的大数据采样查询处理

齐文1, 鲍玉斌2, 宋杰3   

  1. (辽东学院信息工程学院 辽宁 丹东118000)1;
    (东北大学计算机科学与工程学院 沈阳110819)2;
    (东北大学软件学院 沈阳110819)3
  • 收稿日期:2019-05-28 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 齐文(1974-),男,硕士,副教授,CCF会员,主要研究方向为大数据查询与分析,E-mail:ddqiwen@163.com。
  • 作者简介:鲍玉斌(1968-),男,博士,教授,CCF高级会员,主要研究方向为数据仓库;宋杰(1980-),男,博士,教授,CCF高级会员,主要研究方向为大数据存储与管理、高能效计算。
  • 基金资助:
    本文受国家自然科学基金(61672143,61433008)资助。

Column-oriented Store Based Sampling Query Process on Big Data

QI Wen1, BAO Yu-bin2, SONG Jie3   

  1. ( School of Information and Engineering,Eastern Liaoning University,Dandong,Liaoning 118000,China)1;
    ( School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)2;
    ( Software College,Northeastern University,Shenyang 110819,China)3
  • Received:2019-05-28 Online:2019-12-15 Published:2019-12-17

摘要: 大数据时代的到来给传统的数据查询带来了性能挑战,即使查询算法有着O(n)的线性复杂度,但当n极大时其时间开销也难以满足用户需求。在很多实际应用中,人们并不需要精确的查询结果,但要求在给定时间内完成查询,因此可适当牺牲查询精度以满足性能约束。采样查询通过约简查询范围来提高查询性能,现有的采样方法多针对特定的算法和特定的应用场景,缺乏大数据环境下一般性的采样查询方法以及保证性能和精度的研究。文中研究大数据环境下列存储的采样查询处理,从数据划分和数据采样两方面改进大数据的查询效率。提出了基于加速比和势分布的采样方法,其支持各类采样算法,实现了分布式环境下采样查询的随机性保证、性能保证和近似性评价,并兼容了精确查询。该方法可以快速应用到已有大量数据的列存储中,具备良好的扩展性和可维护性。以Top-K为查询用例的实验结果证明,在不同数据量、不同数据分布和不同采样算法下,实际采样率与给定采样率的误差低于2%,查询准确度 (Accuracy) 稳定,方差在0.10和0.12之间,因此提出的基于段势的数据划分的采样效率高于平均划分和线性划分。

关键词: 大数据, 列存储, 采样查询, 数据划分, 加速比

Abstract: The era of big data bring performance challenges to traditional data query,even if the query algorithm is O(n) linear complexity,but when the n is extremely large,its time cost is also unbearable.In many practical applications,exact query results may be unnecessary but the queries should be accomplished at a given time,so appropriately losing the query accuracy is acceptable to meet performance constraints.Sampling queries can improve query perfor-mance by reducing query ranges.Existing researches are often studied for specific algorithms and specific application scenarios,and there is a lack of research on general sampling and query methods in the big data environment,as well as research on performance and accuracy guarantee.This paper studied the sampling and query processing in the big data environment,which improves the query efficiency of big data from data partition and data reduction.This paper proposed a sampling method based on speedup and potential distribution,which supports all kinds of sampling algorithms,and achieves randomicity guarantee,performance assurance and approximation evaluation of sampling queries in distri-buted environment,and is compatible with precise queries.This method can be applied to the column store for the big data with good expansibility and maintainability.The experimental results show that as the Top-K query case,the proposed method has better loading performance,while the sampling errors are less than 2%,and the variances of query accuracy are between 0.1 and 0.12 under various sampling rates,data volumes and sampling algorithms.The sampling efficiency of proposed partition is also higher than that of linear partition based or uniform partition based sampling.

Key words: Big data, Column-oriented store, Sampling query, Data partitioning, Accumulation ratio

中图分类号: 

  • TP391
[1] SHEN D R,YU G,WANG X T,et al.Survey on NoSQL for management of big data[J].Journal of Software,2013,24(8):1786-1803.(in Chinese) 申德荣,于戈,王习特,等.支持大数据管理的NoSQL系统研究综述[J].软件学报,2013(8):1786-1803.
[2] LUO C,JIANG Z,HOU W C,et al.A sampling approach for skyline query cardinality estimation[J].Knowledge and Information Systems,2012,32(2):281-301.
[3] GIBBON P B.Approximate Query Processing:Taming the TeraBytes[C]//International Conference on Vldb.DBLP,2001.
[4] GRAHAM C,GAROFALAKIS M,HAAS P T,et al.Synopses for Massive Data:Samples,Histograms,Wavelets,Sketches[J].Foundations and Trends in Databases,2011,4(1/2/3):1-294.
[5] CHAUDHURI S,DING B,KANDULA S.Approximate Query Processing:No Silver Bullet[C]//the 2017 ACM International Conference.ACM,2017.
[6] LIU L,HU G.A Parameter-Free Linear Sampling Method[J].IEEE Access,2019,7:17935-17940.
[7] ZHAO J,SUN J,ZHAI Y,et al.A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment[J].International Journal of Pattern Recognition and Artificial Intelligence,2018,32(2):4.
[8] HAMIDI H,MOUSAVI R.Analysis and Evaluation of a Framework for Sampling Database in Recommenders[J].Journal of Global Information Management,2018,26(1),41-57.
[9] WU W,NAUGHTON J F,SINGH H.Sampling-based query re-optimization[C]//Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1721-1736.
[10] LI J,LIN J.Research on the Influence of Sampling Methods for the Accuracy of Web Services QoS Prediction[J].IEEE Access,2019,7:39990-39999.
[11] DECASTRO-GARCÍA N,MUÑOZ CASTAÑEDA Á L,ES- CUDERO GARCÍA D,et al.Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm[J].Complexity,2019,2019:1-16.
[12] LIU W,SU J.Online digital library sampling based on query related graph[J].The Electronic Library,2018,36(6):1082-1098.
[13] STOEHR N,MEYER J,MARKL V,et al.Heatflip:Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data[C]//2018 IEEE International Conference on Big Data (Big Data).2018:3723-3732.
[14] ZHANG J,NIU B.A clustering-based sampling method for building query response time models[J].Computer Systems Science and Engineering,2017,32(4):319-331.
[15] HE Y ,HUANG J Z,LONG H,et al.I-Sampling:A New Block-Based Sampling Method for Large-Scale Dataset[C]//EEE International Congress on Big Data (BigData Congress).2017:360-367.
[1] 叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式[J]. 计算机科学, 2021, 48(1): 119-124.
[2] 赵会群, 吴凯锋. 一种大数据估价算法[J]. 计算机科学, 2020, 47(9): 110-116.
[3] 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术[J]. 计算机科学, 2020, 47(9): 117-122.
[4] 朝乐门. 数据科学导论的课程设计及教学改革[J]. 计算机科学, 2020, 47(7): 1-7.
[5] 顾荣杰, 吴治平, 石焕. 基于TFR 模型的公安云平台数据分级分类安全访问控制模型研究[J]. 计算机科学, 2020, 47(6A): 400-403.
[6] 李泳. 基于BigQuant 大数据平台的股票投资策略开发[J]. 计算机科学, 2020, 47(6A): 612-615.
[7] 葛雨明, 韩庆文, 王妙琼, 曾令秋, 李璐. 汽车大数据应用模式与挑战分析[J]. 计算机科学, 2020, 47(6): 59-65.
[8] 刘纪芹, 史开泉. 大数据分解-融合及其智能获取[J]. 计算机科学, 2020, 47(6): 66-73.
[9] 曾伟良, 吴淼森, 孙为军, 谢胜利. 自动驾驶出租车调度系统研究综述[J]. 计算机科学, 2020, 47(5): 181-189.
[10] 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用[J]. 计算机科学, 2020, 47(11A): 425-429.
[11] 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统[J]. 计算机科学, 2020, 47(11A): 667-672.
[12] 郝秀梅, 史开泉. 大数据智能检索与大数据区块元智能分离[J]. 计算机科学, 2020, 47(11): 113-121.
[13] 魏霖静, 宁璐璐, 郭斌, 侯振兴, 甘诗润. 基于混合蛙跳算法的K-mediods聚类挖掘与并行优化[J]. 计算机科学, 2020, 47(10): 126-129.
[14] 汪洋, 李鹏, 季一木, 樊卫北, 张玉杰, 王汝传, 陈国良. 高性能计算与天文大数据研究综述[J]. 计算机科学, 2020, 47(1): 1-6.
[15] 孔繁钰, 周愉峰, 陈纲. 基于时空特征挖掘的交通流量预测方法[J]. 计算机科学, 2019, 46(7): 322-326.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .