计算机科学 ›› 2016, Vol. 43 ›› Issue (8): 142-147.doi: 10.11896/j.issn.1002-137X.2016.08.030

• 软件与数据库技术 • 上一篇    下一篇

流数据Top-K关键字查询算法

郑诗敏,秦小麟,刘亮,周倩   

  1. 南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(61373015,61300052),江苏高校优势学科建设工程资助

Algorithm for Top-K Keyword Query in Data Streams

ZHENG Shi-min, QIN Xiao-lin, LIU Liang and ZHOU Qian   

  • Online:2018-12-01 Published:2018-12-01

摘要: 基于Spark Streaming计算框架的分布式Top-K关键字查询是统计流数据中所有关键字的热点研究问题。多数研究通过限定存储空间来实现Top-K关键字查询,并假设关键字集合已知。针对这个问题,提出一种可应用于关键字集合未知情况的分布式Top-K关键字查询算法,根据监测到的关键字动态地调整存储空间,通过更新策略的优化提升其精度。实验结果表明,该算法的性能在关键字集合未知的情况下比现有算法更优。

关键词: Top-K关键字查询,流数据,云计算,Spark Streaming

Abstract: Distributed Top-K keyword query based on the framework of Spark Streaming is a hot research issue.It is used to count all keywords in data streams.Most studies of Top-K keyword query limit storage space and assume that the keywords set is known.To solve this problem,we presented a distributed Top-K keyword query algorithm which can be used in cases where the keywords set is unknown.This algorithm dynamically adjusts the size of storage space according to monitored keywords and optimizes the updated strategy to improve precision.Experimental results show that the proposed algorithm under the condition of unknown keywords set has better performance.

Key words: Top-K keyword query,Data streams,Cloud computing,Spark streaming

[1] Chen L,Cong G,Cao X,et al.Temporal spatial-keyword top-k publish/subscribe[C]∥2015 IEEE 31st International Confe-rence on Data Engineering.IEEE,2015:255-266
[2] Zheng K,Su H,Zheng B,et al.Interactive top-k spatial keyword queries[C]∥2015 IEEE 31st International Conference on Data Engineering.IEEE,2015:423-434
[3] Charikar M,Chen K,Farach-Colton M.Finding Frequent Items in Data Streams[J].Theoretical Computer Science,2004,2(1):1530-1541
[4] Metwally A,Agrawal D,Abbadi A E.Efficient Computation of Frequent and Top-k Elements in Data Streams[C]∥International Conference on Database Theory.Springer-Verlag,2005:398-412
[5] Zaharia M,Das T,Li H,et al.Discretized streams:fault-tolerant streaming computation at scale[C]∥Twenty-Fourth ACM Symposium on Operating Systems Principles.2013:423-438
[6] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters[J].Commun.ACM (CACM), 2008,51(1):107-113
[7] Ci Xiang,Ma You-zhong,Meng Xiao-feng.Method for top-Kquery on big data in cloud[J].Journal of Software,2014,5(4):813-825(in Chinese) 慈祥,马友忠,孟小峰.一种云环境下的大数据Top-K查询方法[J].软件学报,2014,5(4):813-825
[8] Song Jie,Hao Wen-ning,Chen Gang,et al.Research of Distributed ETL Architecture Based on MapReduce[J].Computer Science,2013,40(6):152-154(in Chinese) 宋杰,郝文宁,陈刚,等.基于MapReduce的分布式ETL体系结构研究[J].计算机科学,2013,40(6):152-154
[9] Manku G,Motwani R.Approximate frequency counts over data streams[C]∥Proceedings of the 28th International Conference on Very Large Data Bases.2002:346-357
[10] Demaine E D,Lopez-Ortiz A,Munro J I.Frequency estimation of internet packet streams with limited space[C]∥Proceedings of the 10th Annual European Symposium on Algorithms.2002:348-360
[11] Cormode G,Muthukrishnan S.What’s hot and what’s not:tracking most frequent items dynamically[J].TODS,2005,30(1):249-278
[12] Estan C,Varghese G.New directions in traffic measurement and accounting:Focusing on the elephants,ignoring the mice[J].ACM Trans.Comput.Syst.,2003,21(3):270-313
[13] Manerikar N,Palpanas T.Frequent items in streaming data:An experimental evaluation of the state-of-the-art[J].DKE,2009,68(4):415-430
[14] Agarwal P K,Cormode G,Huang Z,et al.Mergeable summaries[C]∥PODS.2012:23-34
[15] Frequent Itemset Mining Dataset Repository,University of Helsinki,2008.http://fimi.cs.helsinki.fi/data
[16] Kim J M,Park Y T.Scalable OWL-Horst ontology reasoning using SPARK[C]∥International Conference on Big Data and Smart Computing.IEEE,2015:79-86
[17] Armbrust M,Xin R S,Lian C,et al.Spark sql:Relational data processing in spark[C]∥Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data.ACM,2015:1383-1394
[18] Lin C Y,Tsai C H,Lee C P,et al.Large-scale logistic regression and linear support vector machines using Spark[C]∥2014 IEEE International Conference on Big Data.IEEE,2014:519-528
[19] Akgün B.Streaming Linear Regression on Spark MLlib and MOA[C]∥Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015.ACM,2015:1244-1247

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!