计算机科学 ›› 2017, Vol. 44 ›› Issue (3): 10-15.doi: 10.11896/j.issn.1002-137X.2017.03.003

• 云计算 • 上一篇    下一篇

云环境下的突发关键字查询算法

郑诗敏,秦小麟,刘亮,周倩   

  1. 南京航空航天大学计算机科学与技术学院 南京211106,南京航空航天大学计算机科学与技术学院 南京211106,南京航空航天大学计算机科学与技术学院 南京211106,南京航空航天大学计算机科学与技术学院 南京211106
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金项目(61373015,61300052),江苏高校优势学科建设工程资助

Algorithm for Bursty Term Query in Cloud Computing

ZHENG Shi-min, QIN Xiao-lin, LIU Liang and ZHOU Qian   

  • Online:2018-11-13 Published:2018-11-13

摘要: 基于Spark Streaming计算框架下的分布式突发关键字查询是监测流数据中关键字突发时间的热点研究问题。多数研究方法存储统计所有的关键字,并未考虑热点关键字。在数据呈爆炸式增长的背景下,获取热点关键字的突发时间更具有价值。针对这个问题,提出一种分布式突发关键字查询算法,该算法采用动态的更新策略,通过设置检查点的方法提取热点关键字,并在线性的时间内查询突发的时间范围。实验结果表明,该算法的性能比现有算法更优。

关键词: 云计算,突发关键字查询,流数据,Spark Streaming

Abstract: Distributed bursty term query under the framework of Spark Streaming is a hot research issue.It aims to detect bursty terms in data streams.Most studies of bursty term query count and save all terms without consideration of hot terms.Under the background of exploding in the data scale,it makes more sense to get bursty time of hot terms.To solve this problem,we presented a distributed bursty term query algorithm.The algorithm uses dynamic update strategy and a checkpoint mechanism to extract hot terms.Also it finds the bursty time range in linear time.Experimental results show that the proposed algorithm has better performance.

Key words: Cloud computing,Bursty term query,Data streams,Spark Streaming

[1] XIE R Q,ZHU F D,MA H,et al.A Real-time Online Observatory for Bursty and Viral Events[J].PVLDB,2014,7(13):1637-1640.
[2] YAMAMOTO Y,IWANUMA K,FUKUDA S.Resource-oriented approximation for frequent itemset mining from bursty data streams[M].SIGMOD,2014:205-216.
[3] LAPPAS T,VIEIRA M R,GUNOPULOS D,et al.On The Spatiotemporal Burstiness of Terms[J].PVLDB,2012,5(9):836-847.
[4] LAPPAS T,ARAI B,PLATAKIS M,et al.On burstiness-aware search for document sequences[C]∥KDD.2009:477-486.
[5] ZAHARIA V,DAS T,LI H Y,et al.Discretized streams:fault-tolerant streaming computation at scale[C]∥SOSP.2013:423-438.
[6] DEAN J,GHEMAWAT S.MapReduce:simplified data proces-sing on large clusters[J].Commun.ACM (CACM),2008,51(1):107-113.
[7] XIANG L H,CHEN Y W,ZHANG Y L.High Order MatrixMultiplication by Mapreduce Algorithm Based on Hadoop Platform[J].Computer Science,2013,40(S1):96-98.(in Chinese) 向林泓,陈芋文,张昱琳.基于Hadoop平台的高阶矩阵相乘MapReduce算法研究[J].计算机科学,2013,40(S1):96-98.
[8] YAN Y L,DONG Y H,HE X M,et al.FSMBUS:A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark[J].Journal of Computer Research and Development,2015,52(8):1768-1783.(in Chinese) 严玉良,董一鸿,何贤芒,等.FSMBUS:一种基于Spark的大规模频繁子图挖掘算法[J].计算机研究与发展,2015,52(8):1768-1783.
[9] MANKU G S,MOTWANI R.Approximate Frequency Counts over Data Streams[J].Proceedings of the VLDB Endowment,2003,5(1):1699.
[10] DEMAINE E D,LPEZ-ORTIZ A,MUNRO J I.Frequency Estimation of Internet Packet Streams with Limited Space[C]∥ESA.2002:348-360.
[11] KARP R M,SHENKER S,PAPADIMTRIOU C H.A simple algorithm for finding frequent elements in streams and bags[J].ACM Trans.Database Syst.(TODS),2003,28:51-55.
[12] MISRA J,GRIES D.Finding Repeated Elements[J].Science of Computer Programming,1982,2(2):143-152.
[13] METWALLY A,AGRAWAL D,ABBADI A E.Efficient Computation of Frequent and Top-k Elements in Data Streams[C]∥ ICDT.2005:398-412.
[14] CHARIKAR M,CHEN K,FARACH-COLTON M.Finding Fre-quent Items in Data Streams[J].Theoretical Computer Science,2002,2(1):3-15.
[15] CORMODE G,MUTHUKRISHNAN S.What’s hot and what’snot:tracking most frequent items dynamically[J].ACM Trans.Database Syst.(TODS),2005,30(1):249-278.
[16] CORMODE G,MUTHUKRISHNAN S.An improved datastream summary:the count-min sketch and its applications[J].J.Algorithms (JAL),2005,55(1):58-75.
[17] KLEINBERG J.Bursty and hierarchical structure in streams[C]∥Proceeding of the Eighth ACM SIGKDD International Confe-rence on Knowledge Discovery and Data Mining.2002:91-101.
[18] PUI G, FUNG C,YU J X,et al.Parameter Free Bursty Events Detection in Text Streams[C]∥VLDB.2005:181-192.
[19] HE Q,CHANG K Y,LIM E P.Analyzing feature trajectories for event detection[C]∥SIGIR.2007:207-214.
[20] VLACHOS M,MEEK C,VAGENA Z Z,et al.Identifying Similarities,Periodicities and Bursts for Online Search Queries[C]∥SIGMOD.2004:131-142.
[21] LAPPAS T,ARAI B,PLATAKIS M,et al.On burstiness-aware search for document sequences[C]∥ACM SIGKD International Conference on Knowledge Discovery and Data Mining.Paris,France,2009:477-486.
[22] DOBKIN D P,EPPSTEIN D.Computing the Discrepancy[C]∥Symposium on Computational Geometry.1993:47-52.
[23] ARLITT M,JIN T.World Cup Web Site Access Logs.http://ita.ee.lbl.gov/html/contrib/WorldCup.html.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .