计算机科学 ›› 2017, Vol. 44 ›› Issue (5): 172-177.doi: 10.11896/j.issn.1002-137X.2017.05.031

• 软件与数据库技术 • 上一篇    下一篇

分布式流数据加载和查询技术优化

易佳,薛晨,王树鹏   

  1. 中国科学院信息工程研究所 北京100093,国家计算机网络与信息安全管理中心 北京100029,中国科学院信息工程研究所 北京100093
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61271275,61202067)资助

Optimization on Distributed Stream Data Loading and Querying

YI Jia, XUE Chen and WANG Shu-peng   

  • Online:2018-11-13 Published:2018-11-13

摘要: 分布式流查询是一种基于数据流的实时查询计算方法,近年来得到了广泛的关注和快速发展。综述了分布式流处理框架在实时关系型查询上取得的研究成果;对涉及分布式数据加载、分布式流计算框架、分布式流查询的产品进行了分析和比较;提出了基于Spark Streaming和Apache Kafka构建的分布式流查询模型,以并发加载多个文件源的形式,设计内存文件系统实现数据的快速加载,相较于基于Apache Flume的加载技术提速1倍以上。在Spark Streaming的基础上,实现了基于Spark SQL的分布式流查询接口,并提出了自行编码解析SQL语句的方法,实现了分布式查询。测试结果表明,在查询语句复杂的情况下,自行编码解析SQL的查询效率具有明显的优势。

关键词: 大数据,流处理系统,分布式流查询,查询优化,Kafka快速加载

Abstract: Distributed stream query is a kind of real-time query computation method based on data stream,which has been widely concerned and developed rapidly in recent years.This paper summarized the research results of the distributed stream processing framework in real-time relational query.There is an in-depth comparison of some products,including the distributed data loading framework,distributed stream computing framework and distributed stream query systems.The paper proposed a distributed stream query model based on Spark Streaming and Apache Kafka,and designed a fast data loading technology based on virtual memory file system,which gets the data loading speed one time faster compare to Apache Flume.On the basis of Spark Streaming,a distributed stream query interface based on Spark SQL was realized,and a method for parsing SQL queries was proposed to implement distributed query in data stream.The experiment results demonstrate that,in the case of complex SQL queries,the method of analyzing SQL by writing code by oneself has obvious advantages.

Key words: Big data,Stream processing system,Distributed stream query,Query optimization,Kafka fast loading

[1] CHEN H M,RICK K,SERGE H.Agile Big Data Analytics Development:An Architecture-Centric Approach[C]∥2016 49th Hawaii International Conference on System Sciences (HICSS).IEEE,2016.
[2] DEAN,JEFFREY,SANJAY G.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[3] TOSHNIWAL A,TANEJA S,SHUKLA A,et al.Stormtwit-ter[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2014:147-156.
[4] KREPS,JAY,NEHA N,et al.Kafka:A distributed messaging system for log processing[C]∥Proceedings of the NetDB.2011.
[5] VAVILAPALLI,VINOD K,et al.Apache hadoop yarn:Yetanother resource negotiator[C]∥Proceedings of the 4th An-nual Symposium on Cloud Computing.ACM,2013.
[6] http://samza.apache.org.
[7] WANG C K,MENG X F.Relational Query Techniques for Distributed Data Stream:A Survey [J].Chinese Journal of Compu-ters,2016,39(1):80-96.(in Chinese) 王春凯,孟小峰.分布式数据流关系查询技术研究[J].计算机学报,2016,39(1):80-96.
[8] RocketMQ.https://github.com/alibaba/rocketmq.
[9] RabbitMQ.https://www.rabbitmq.com.
[10] ZAHARIA,MATEI,et al.Discretized streams:Fault-tolerant st-reaming computation at scale[C]∥Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.ACM,2013.
[11] ARMBRUST,MICHAEL,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015.
[12] StreamingSQL.https://github.com/Intel-bigdata/spark-streamingsql.
[13] Squall.https://github.com/epfldata/squall.
[14] Flink.http://flink.apache.org.
[15] ZAHARIA,MATEI,et al.Resilient distributed datasets:A fau-lt-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012.
[16] HOFFMAN S.Apache Flume:Distributed Log Collection forHadoop[M].Packt Publishing Ltd,2013.
[17] Kafka_flume.http://www.cloudera.com/documentation/kafka/lat-est/topics/kafka_flume.html.
[18] Kafkacat.https://github.com/edenhill/kafkacat.
[19] KafkaProducer.https://kafka.apache.org/090/javadoc/in- dex.html?org/apache/kafka/clients/producer/KafkaProducer.html.
[20] SNYDER,PETER.tmpfs:A virtual memory file system[C]∥Proceedings of the Autumn 1990 EUUG Conference.1990.
[21] GRAEFE G,MCKENNA W J.The Volcano optimizer generator:Extensibility and efficient search[C]∥International Confe-rence on Data Engineering.IEEE Xplore,1993:209-218.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .