计算机科学 ›› 2017, Vol. 44 ›› Issue (5): 172-177.doi: 10.11896/j.issn.1002-137X.2017.05.031
易佳,薛晨,王树鹏
YI Jia, XUE Chen and WANG Shu-peng
摘要: 分布式流查询是一种基于数据流的实时查询计算方法,近年来得到了广泛的关注和快速发展。综述了分布式流处理框架在实时关系型查询上取得的研究成果;对涉及分布式数据加载、分布式流计算框架、分布式流查询的产品进行了分析和比较;提出了基于Spark Streaming和Apache Kafka构建的分布式流查询模型,以并发加载多个文件源的形式,设计内存文件系统实现数据的快速加载,相较于基于Apache Flume的加载技术提速1倍以上。在Spark Streaming的基础上,实现了基于Spark SQL的分布式流查询接口,并提出了自行编码解析SQL语句的方法,实现了分布式查询。测试结果表明,在查询语句复杂的情况下,自行编码解析SQL的查询效率具有明显的优势。
[1] CHEN H M,RICK K,SERGE H.Agile Big Data Analytics Development:An Architecture-Centric Approach[C]∥2016 49th Hawaii International Conference on System Sciences (HICSS).IEEE,2016. [2] DEAN,JEFFREY,SANJAY G.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113. [3] TOSHNIWAL A,TANEJA S,SHUKLA A,et al.Stormtwit-ter[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2014:147-156. [4] KREPS,JAY,NEHA N,et al.Kafka:A distributed messaging system for log processing[C]∥Proceedings of the NetDB.2011. [5] VAVILAPALLI,VINOD K,et al.Apache hadoop yarn:Yetanother resource negotiator[C]∥Proceedings of the 4th An-nual Symposium on Cloud Computing.ACM,2013. [6] http://samza.apache.org. [7] WANG C K,MENG X F.Relational Query Techniques for Distributed Data Stream:A Survey [J].Chinese Journal of Compu-ters,2016,39(1):80-96.(in Chinese) 王春凯,孟小峰.分布式数据流关系查询技术研究[J].计算机学报,2016,39(1):80-96. [8] RocketMQ.https://github.com/alibaba/rocketmq. [9] RabbitMQ.https://www.rabbitmq.com. [10] ZAHARIA,MATEI,et al.Discretized streams:Fault-tolerant st-reaming computation at scale[C]∥Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.ACM,2013. [11] ARMBRUST,MICHAEL,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015. [12] StreamingSQL.https://github.com/Intel-bigdata/spark-streamingsql. [13] Squall.https://github.com/epfldata/squall. [14] Flink.http://flink.apache.org. [15] ZAHARIA,MATEI,et al.Resilient distributed datasets:A fau-lt-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012. [16] HOFFMAN S.Apache Flume:Distributed Log Collection forHadoop[M].Packt Publishing Ltd,2013. [17] Kafka_flume.http://www.cloudera.com/documentation/kafka/lat-est/topics/kafka_flume.html. [18] Kafkacat.https://github.com/edenhill/kafkacat. [19] KafkaProducer.https://kafka.apache.org/090/javadoc/in- dex.html?org/apache/kafka/clients/producer/KafkaProducer.html. [20] SNYDER,PETER.tmpfs:A virtual memory file system[C]∥Proceedings of the Autumn 1990 EUUG Conference.1990. [21] GRAEFE G,MCKENNA W J.The Volcano optimizer generator:Extensibility and efficient search[C]∥International Confe-rence on Data Engineering.IEEE Xplore,1993:209-218. |
No related articles found! |
|