Computer Science ›› 2017, Vol. 44 ›› Issue (5): 172-177.doi: 10.11896/j.issn.1002-137X.2017.05.031

Previous Articles     Next Articles

Optimization on Distributed Stream Data Loading and Querying

YI Jia, XUE Chen and WANG Shu-peng   

  • Online:2018-11-13 Published:2018-11-13

Abstract: Distributed stream query is a kind of real-time query computation method based on data stream,which has been widely concerned and developed rapidly in recent years.This paper summarized the research results of the distributed stream processing framework in real-time relational query.There is an in-depth comparison of some products,including the distributed data loading framework,distributed stream computing framework and distributed stream query systems.The paper proposed a distributed stream query model based on Spark Streaming and Apache Kafka,and designed a fast data loading technology based on virtual memory file system,which gets the data loading speed one time faster compare to Apache Flume.On the basis of Spark Streaming,a distributed stream query interface based on Spark SQL was realized,and a method for parsing SQL queries was proposed to implement distributed query in data stream.The experiment results demonstrate that,in the case of complex SQL queries,the method of analyzing SQL by writing code by oneself has obvious advantages.

Key words: Big data,Stream processing system,Distributed stream query,Query optimization,Kafka fast loading

[1] CHEN H M,RICK K,SERGE H.Agile Big Data Analytics Development:An Architecture-Centric Approach[C]∥2016 49th Hawaii International Conference on System Sciences (HICSS).IEEE,2016.
[2] DEAN,JEFFREY,SANJAY G.MapReduce:simplified data pro-cessing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[3] TOSHNIWAL A,TANEJA S,SHUKLA A,et al.Stormtwit-ter[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2014:147-156.
[4] KREPS,JAY,NEHA N,et al.Kafka:A distributed messaging system for log processing[C]∥Proceedings of the NetDB.2011.
[5] VAVILAPALLI,VINOD K,et al.Apache hadoop yarn:Yetanother resource negotiator[C]∥Proceedings of the 4th An-nual Symposium on Cloud Computing.ACM,2013.
[6] http://samza.apache.org.
[7] WANG C K,MENG X F.Relational Query Techniques for Distributed Data Stream:A Survey [J].Chinese Journal of Compu-ters,2016,39(1):80-96.(in Chinese) 王春凯,孟小峰.分布式数据流关系查询技术研究[J].计算机学报,2016,39(1):80-96.
[8] RocketMQ.https://github.com/alibaba/rocketmq.
[9] RabbitMQ.https://www.rabbitmq.com.
[10] ZAHARIA,MATEI,et al.Discretized streams:Fault-tolerant st-reaming computation at scale[C]∥Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.ACM,2013.
[11] ARMBRUST,MICHAEL,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015.
[12] StreamingSQL.https://github.com/Intel-bigdata/spark-streamingsql.
[13] Squall.https://github.com/epfldata/squall.
[14] Flink.http://flink.apache.org.
[15] ZAHARIA,MATEI,et al.Resilient distributed datasets:A fau-lt-tolerant abstraction for in-memory cluster computing[C]∥Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation.USENIX Association,2012.
[16] HOFFMAN S.Apache Flume:Distributed Log Collection forHadoop[M].Packt Publishing Ltd,2013.
[17] Kafka_flume.http://www.cloudera.com/documentation/kafka/lat-est/topics/kafka_flume.html.
[18] Kafkacat.https://github.com/edenhill/kafkacat.
[19] KafkaProducer.https://kafka.apache.org/090/javadoc/in- dex.html?org/apache/kafka/clients/producer/KafkaProducer.html.
[20] SNYDER,PETER.tmpfs:A virtual memory file system[C]∥Proceedings of the Autumn 1990 EUUG Conference.1990.
[21] GRAEFE G,MCKENNA W J.The Volcano optimizer generator:Extensibility and efficient search[C]∥International Confe-rence on Data Engineering.IEEE Xplore,1993:209-218.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!