计算机科学 ›› 2020, Vol. 47 ›› Issue (9): 67-73.doi: 10.11896/jsjkx.190800143

• 数据库&大数据&数据科学 • 上一篇    下一篇

不确定时态数据Top-k查询

韦建华, 许建秋   

  1. 南京航空航天大学计算机科学与技术学院 南京210016
  • 收稿日期:2019-08-29 发布日期:2020-09-10
  • 通讯作者: 许建秋(jianqiu@nuaa.edu.cn)
  • 作者简介:jianhua@nuaa.edu.cn
  • 基金资助:
    国家自然科学基金(61972198)

Efficient Top-k Query Processing on Uncertain Temporal Data

WEI Jian-hua, XU Jian-qiu   

  1. Department of Computer Science and Technology,Nanjing University of Aeronautics and Astronautics,Nanjing 210016,China
  • Received:2019-08-29 Published:2020-09-10
  • About author:WEI Jian-hua,born in 1994,postgradua-te.Her main research interests include temporal data and uncertainty.
    XU Jian-qiu,born in 1982,Ph.D,professor,is a member of China Computer Federation.His main research interests include mobile objects,and spatio-temporal data.
  • Supported by:
    Natural Science Fundation of China (61972198).

摘要: 时态数据在医疗、经济和电子商务等领域有着广泛的应用。由于时间的测量技术不精确等因素,时态数据具有不确定性。文中针对该数据进行研究,处理Top-k查询,即返回与查询点相交的k个权值最大的数据,该权值是根据数据权值和相交概率按一定规则组合计算所得。为有效解决该查询问题,提出了一个基于关系模型和辅助结构的2D R-tree结构,其中关系模型用于管理所有区间数据的R-tree,辅助结构用于管理R-tree中每个节点内部数据权值的大小关系。基于该结构,提出了按权值的降序访问数据的查询算法。从根节点开始遍历R-tree,对于与查询点相交的节点,根据辅助结构中存储的信息找到数据权值最大的项,将它确定为下一个访问对象。实验使用数据规模在30万到1000万的合成数据集,以及包括大约320万条的航班信息的真实数据集。在可扩展数据库SECONDO系统下,将所提方法与无索引方法、R-tree和区间树方法在性能上进行比较,并以平均I/O访问次数和CPU时间作为性能的评判指标。实验结果表明,在1000万条的数据规模下,所提方法优于对比方法2~3个数量级。通过将实验返回的k个结果的概率与权值和实际相交数据的概率和权值作比较可以发现,实验返回的k个结果的概率与权值均靠近实际相交数据的概率和权值的最大值,因此所提算法可行且有效。

关键词: top-k, 不确定性, 时态数据

Abstract: Temporal data is widely used in many applications such as medical,economic and e-commerce.The uncertainty is mainly caused by factors such as inaccurate measurement techniques.This paper studies top-k queries over uncertain temporal data.Such a query returns Top-k intervals with the largest scores which are calculated by a function combining the original weight of the data and the probability of intersection with the query data.To answer the query efficiently,this paper proposes a 2D R-tree based on the relational model and auxiliary structures.The relational model is used to manage all intervals,and the auxiliary structure is used to manage the order of the weights of each node in the R-tree.Based on the proposed index structure,a query algorithm for accessing data in descending order by weights is proposed.It traverses the R-tree from the root node.For each node that intersects with the query point,the item with the largest weight in it can be found according to the information stored in the auxiliary structure,and it is determined as the next accessed object.This paper uses synthetic datasets with data sizes ranging from 300000 to 10 million,and a real dataset of a flight information with size of 3.2 million.In the extensible database system SECONDO,the proposed method is compared with the unindexed method,R-tree,and interval tree,and the average I/O access times and CPU time are used as the indicators of the experimental results.The experimental results show that the proposed approach outperforms baseline methods by 2 to 3 orders of magnitude using 10 million intervals.Comparing the probabilities and weights of the k results with the results of all intersecting data,it is found that the probabilities and weights of the k results are close to the maximum value of the actual intersecting data,so the proposed algorithm is feasible and effective.

Key words: Interval data, Top-k, Uncertainty

中图分类号: 

  • TP391
[1] LI R,ZHANG X,ZHOU X,et al.INK:A Cloud-Based System for Efficient Top-k Interval Keyword Search[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management.New York,2014:2003-2005.
[2] SNODGRASS R T,AHN I.A Taxonomy of Time in Databases.[J].ACM,1985,14(4):236-246.
[3] UYSAL M.Efficiently Processing Queries on Interval-and-Va-lue Tuples in Relational Databases[C]//International Confe-rence on Very Large Data Bases.DBLP,2005:385-396.
[4] ARGE L,VITTER J S.Optimal external memory interval ma-nagement [J].SIAMJ.2003,32(6):1488-1508.
[5] HAMPEL M.Joining Interval Data in Relational Databases[C]//ACM SIGMOD International Conference on Management of Data.DBLP,2004:683-694.
[6] LAYER R M,SKADRON K,ROBINS G,et al.Binary Interval Search:a scalable algorithm for counting interval intersections[J].Bioinformatics,2013,29(1):1-7.
[7] AGARWAL P K,ARGE L,YI K.An optimal dynamic interval stabbing-max data structure[C]//Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms.Vancouver,SODA,2005:803-812.
[8] SFAKIANAKIS G,PATLAKAS I,NTARMOS N,et al.Interval indexing and querying on key-value cloud stores[C]//Proceedings of IEEE 29th International Conference on Data Engineering.Brisbane:IEEE Press,2013:805-816.
[9] XU J Q,LU H.Efficiently answer top-k queries on typed intervals[J].Information Systems,2017,71:164-181.
[10] LI S,WANG H.Dispersing points on intervals [J].Discrete Applied Mathematics,2018,239:106-118.
[11] COWLEY W,PLEXOUSAKIS D.An Interval Algebra for Indeterminate Time[C]//Seventeenth National Conference on Artificial Intelligence.AAAI,2000:470-475.
[12] PLESNIEWICZ G S,VU NGUEN T M,DMITRY M.Queryanswering over fact bases for fuzzy interval ontologies[C]//Proceedings of the 2nd International Workshop on Soft Computing Applications and Knowledge Discovery.Moscow,2016:93-100.
[13] WU W Y,LI Y,NI Z W,et al.Probabilistic interval-valued hesitant fuzzy information aggregation operators and their application to multi-attribute decision making[J].Algorithms,2018,11(8):120-128.
[14] ZHANG S C.Interval-Gap-Based Representation of Temporal Knowledge[J].Journal of Software,1994,5(6):49-54.
[15] LIN J Y,PENG H,XIE J M.Uncertain temporal information representation and the extensions of temporal operation[J].Computer Science,2005,32(8):161-163.
[16] KATERINA P,MARTIN T,MICHAEL B.Supporting Set Op-erations in Temporal-Probabilistic Databases[C]//34th IEEE.
International Conference on Data Engineering(ICDE).2018:1180-1191.
[17] BLANKENAGEL G.External segment trees[J].Algorithmica,1994,12(6):498-532.
[18] IMAI H,ASANO T.Dynamic orthogonal segment intersection search[J].Journal of Algorithms,1987,8(1):1-18.
[19] MCCREIGHT E M.Priority search trees[J].SIAM J,1985,14(2):257-276.
[20] GUY D T MOL R D,BRONSELAER A.Indexing Possibilistic Numerical Data:The Interval B+-tree Approach[M]//Information Processing and Management of Uncertainty in Knowledge-Based Systems.Springer International Publishing,2016:305-316.
[21] MEI W,MENG X,PENG S,et al.A hybrid index for temporal big data[J].China Modern Medicine,2017,72:264-272.
[22] KRIEGEL H P,PÖTKE,MARCO,et al.Managing IntervalsEfficiently in Object-Relational Databases[C]//International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc,2000:407-418.
[23] BLIUJUTE R,JENSEN C S,SALTENIS S,et al.R-tree Based Indexing of Now-Relative Bitemporal Data[C]//International Conference on Very Large Data Bases.Morgan Kaufmann Publishers Inc.,1998:345-356.
[24] ZENG X J.Temporal index trees for uncertain temporal information [J].Microcomputer Information,2011(5):236-237.
[25] GÜTING R H,BEHR T,DÜNTGEN C.SECONDO:A plat-form for moving objects database rsearch and for publishing and integrating research implementations[J].IEEE Data Eng,2010,33(2):56-63.
[1] 林潮伟, 林兵, 陈星.
边缘环境下基于模糊理论的科学工作流调度研究
Study on Scientific Workflow Scheduling Based on Fuzzy Theory Under Edge Environment
计算机科学, 2022, 49(2): 312-320. https://doi.org/10.11896/jsjkx.201000102
[2] 穆聪聪, 王一舒, 袁野, 乔百友, 马玉亮.
时序图中Top-k稠密子图查询算法研究
Top-k Densest Subgraphs Search in Temporal Graphs
计算机科学, 2021, 48(10): 152-159. https://doi.org/10.11896/jsjkx.201100005
[3] 杨洁,王国胤,李帅.
基于边界域的邻域知识距离度量模型
Neighborhood Knowledge Distance Measure Model Based on Boundary Regions
计算机科学, 2020, 47(3): 61-66. https://doi.org/10.11896/jsjkx.190500174
[4] Renata WONG.
量子计算与不确定性原理
Uncertainty Principle as Related to Quantum Computation
计算机科学, 2020, 47(1): 40-50. https://doi.org/10.11896/jsjkx.190400432
[5] 杨洁, 王国胤, 张清华, 冯林.
层次粒结构下粗糙模糊集的不确定性度量
Uncertainty Measure of Rough Fuzzy Sets in Hierarchical Granular Structure
计算机科学, 2019, 46(1): 45-50. https://doi.org/10.11896/j.issn.1002-137X.2019.01.007
[6] 郑宏亮, 侯雪辉, 宋笑迎, 庞阔, 邹丽.
基于犹豫模糊可信度的知识推理
Approach for Knowledge Reasoning Based on Hesitate Fuzzy Credibility
计算机科学, 2019, 46(1): 131-137. https://doi.org/10.11896/j.issn.1002-137X.2019.01.020
[7] 许华杰, 吴青华, 胡小明.
基于轨迹多特性的隐私保护算法
Privacy Protection Algorithm Based on Multi-characteristics of Trajectory
计算机科学, 2019, 46(1): 190-195. https://doi.org/10.11896/j.issn.1002-137X.2019.01.029
[8] 夏扬波, 杨文忠, 张振宇, 王庆鹏, 石研.
一种移动无线传感器网络的节点位置预测方法
Node Position Prediction Method for Mobile Wireless Sensor Networks
计算机科学, 2018, 45(8): 113-118. https://doi.org/10.11896/j.issn.1002-137X.2018.08.020
[9] 汪琳娜, 杨新, 杨习贝.
监督邻域粗糙集
Supervised Neighborhood Rough Set
计算机科学, 2018, 45(8): 186-190. https://doi.org/10.11896/j.issn.1002-137X.2018.08.033
[10] 毛莺池,陈杨.
不确定性车辆路口的轨迹预测
Uncertain Vehicle Intersection Trajectory Prediction
计算机科学, 2018, 45(3): 235-240. https://doi.org/10.11896/j.issn.1002-137X.2018.03.037
[11] 戴华, 保静静, 朱向洋, 易训, 杨庚.
面向云环境的一致性可验证单关键词检索方法
Integrity-verifying Single Keyword Search Method in Clouds
计算机科学, 2018, 45(12): 92-97. https://doi.org/10.11896/j.issn.1002-137X.2018.12.014
[12] 魏方圆,黄德才.
基于区间数的多维不确定性数据UID-DBSCAN聚类算法
UID-DBSCAN Clustering Algorithm of Multi-dimensional Uncertain Data Based on Interval Number
计算机科学, 2017, 44(Z11): 442-447. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.094
[13] 赵凡,魏玲.
D型概率决策形式背景下的规则获取
Rule Acquisition of D-type Probabilistic Decision Formal Context
计算机科学, 2017, 44(8): 274-279. https://doi.org/10.11896/j.issn.1002-137X.2017.08.047
[14] 徐俊洁,陈荣.
基于故障关联的多故障概率诊断方法
Probabilistic Diagnosis Approach to Diagnosing Multiple-fault Programs with Fault Correlation
计算机科学, 2017, 44(4): 124-130. https://doi.org/10.11896/j.issn.1002-137X.2017.04.027
[15] 刘德浩,王倩.
一种具有相依关系的二维云推理方法及其在预测中的应用
Uncertainty Reasoning Based on Related Planar Cloud and Application in Prediction
计算机科学, 2016, 43(Z6): 99-102. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.024
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!