Computer Science ›› 2019, Vol. 46 ›› Issue (12): 13-19.doi: 10.11896/jsjkx.190500155

Special Issue: Database Technology

• Big Data & Data Science • Previous Articles     Next Articles

Column-oriented Store Based Sampling Query Process on Big Data

QI Wen1, BAO Yu-bin2, SONG Jie3   

  1. ( School of Information and Engineering,Eastern Liaoning University,Dandong,Liaoning 118000,China)1;
    ( School of Computer Science and Engineering,Northeastern University,Shenyang 110819,China)2;
    ( Software College,Northeastern University,Shenyang 110819,China)3
  • Received:2019-05-28 Online:2019-12-15 Published:2019-12-17

Abstract: The era of big data bring performance challenges to traditional data query,even if the query algorithm is O(n) linear complexity,but when the n is extremely large,its time cost is also unbearable.In many practical applications,exact query results may be unnecessary but the queries should be accomplished at a given time,so appropriately losing the query accuracy is acceptable to meet performance constraints.Sampling queries can improve query perfor-mance by reducing query ranges.Existing researches are often studied for specific algorithms and specific application scenarios,and there is a lack of research on general sampling and query methods in the big data environment,as well as research on performance and accuracy guarantee.This paper studied the sampling and query processing in the big data environment,which improves the query efficiency of big data from data partition and data reduction.This paper proposed a sampling method based on speedup and potential distribution,which supports all kinds of sampling algorithms,and achieves randomicity guarantee,performance assurance and approximation evaluation of sampling queries in distri-buted environment,and is compatible with precise queries.This method can be applied to the column store for the big data with good expansibility and maintainability.The experimental results show that as the Top-K query case,the proposed method has better loading performance,while the sampling errors are less than 2%,and the variances of query accuracy are between 0.1 and 0.12 under various sampling rates,data volumes and sampling algorithms.The sampling efficiency of proposed partition is also higher than that of linear partition based or uniform partition based sampling.

Key words: Accumulation ratio, Big data, Column-oriented store, Data partitioning, Sampling query

CLC Number: 

  • TP391
[1]SHEN D R,YU G,WANG X T,et al.Survey on NoSQL for management of big data[J].Journal of Software,2013,24(8):1786-1803.(in Chinese)
申德荣,于戈,王习特,等.支持大数据管理的NoSQL系统研究综述[J].软件学报,2013(8):1786-1803.
[2]LUO C,JIANG Z,HOU W C,et al.A sampling approach for skyline query cardinality estimation[J].Knowledge and Information Systems,2012,32(2):281-301.
[3]GIBBON P B.Approximate Query Processing:Taming the TeraBytes[C]//International Conference on Vldb.DBLP,2001.
[4]GRAHAM C,GAROFALAKIS M,HAAS P T,et al.Synopses for Massive Data:Samples,Histograms,Wavelets,Sketches[J].Foundations and Trends in Databases,2011,4(1/2/3):1-294.
[5]CHAUDHURI S,DING B,KANDULA S.Approximate Query Processing:No Silver Bullet[C]//the 2017 ACM International Conference.ACM,2017.
[6]LIU L,HU G.A Parameter-Free Linear Sampling Method[J].IEEE Access,2019,7:17935-17940.
[7]ZHAO J,SUN J,ZHAI Y,et al.A Novel Clustering-Based Sampling Approach for Minimum Sample Set in Big Data Environment[J].International Journal of Pattern Recognition and Artificial Intelligence,2018,32(2):4.
[8]HAMIDI H,MOUSAVI R.Analysis and Evaluation of a Framework for Sampling Database in Recommenders[J].Journal of Global Information Management,2018,26(1),41-57.
[9]WU W,NAUGHTON J F,SINGH H.Sampling-based query re-optimization[C]//Proceedings of the 2016 International Conference on Management of Data.ACM,2016:1721-1736.
[10]LI J,LIN J.Research on the Influence of Sampling Methods for the Accuracy of Web Services QoS Prediction[J].IEEE Access,2019,7:39990-39999.
[11]DECASTRO-GARCÍA N,MUÑOZ CASTAÑEDA Á L,ES- CUDERO GARCÍA D,et al.Effect of the Sampling of a Dataset in the Hyperparameter Optimization Phase over the Efficiency of a Machine Learning Algorithm[J].Complexity,2019,2019:1-16.
[12]LIU W,SU J.Online digital library sampling based on query related graph[J].The Electronic Library,2018,36(6):1082-1098.
[13]STOEHR N,MEYER J,MARKL V,et al.Heatflip:Temporal-Spatial Sampling for Progressive Heat Maps on Social Media Data[C]//2018 IEEE International Conference on Big Data (Big Data).2018:3723-3732.
[14]ZHANG J,NIU B.A clustering-based sampling method for building query response time models[J].Computer Systems Science and Engineering,2017,32(4):319-331.
[15]HE Y ,HUANG J Z,LONG H,et al.I-Sampling:A New Block-Based Sampling Method for Large-Scale Dataset[C]//EEE International Congress on Big Data (BigData Congress).2017:360-367.
[1] CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[2] HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[3] SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[4] WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[5] WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[6] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[7] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[8] WANG Xue-cen, ZHANG Yu, LIU Ying-jie, YU Ge. Evaluation of Quality of Interaction in Online Learning Based on Representation Learning [J]. Computer Science, 2021, 48(2): 207-211.
[9] TENG Jian, TENG Fei, LI Tian-rui. Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder [J]. Computer Science, 2021, 48(12): 195-203.
[10] ZHANG Yu-long, WANG Qiang, CHEN Ming-kang, SUN Jing-tao. Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems [J]. Computer Science, 2021, 48(12): 231-242.
[11] LIU Ya-chen, HUANG Xue-ying. Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data [J]. Computer Science, 2021, 48(11A): 258-264.
[12] ZHANG Guang-jun, ZHANG Xiang. Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” [J]. Computer Science, 2021, 48(10): 324-333.
[13] YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
[14] ZHAO Hui-qun, WU Kai-feng. Big Data Valuation Algorithm [J]. Computer Science, 2020, 47(9): 110-116.
[15] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!