计算机科学 ›› 2023, Vol. 50 ›› Issue (9): 145-151.doi: 10.11896/jsjkx.230300065

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向最优直方图求解的监督学习模型研究

陈云亮, 刘浩, 朱桂水, 黄晓辉, 陈小岛, 王力哲   

  1. 中国地质大学(武汉)计算机学院 武汉 430070
  • 收稿日期:2023-03-08 修回日期:2023-06-18 出版日期:2023-09-15 发布日期:2023-09-01
  • 通讯作者: 黄晓辉(xhhuang@cug.edu.cn)
  • 作者简介:(chenyunliang@cug.edu.cn)
  • 基金资助:
    国家自然科学基金面上项目(62076224);国家杰出青年科学基金(41925007);国家自然科学基金联合基金(U21A2013);智能地学信息处理湖北省重点实验室开放研究课题(KLIGIP-2022B16)

Study on Supervised Learning Model for Optimal Histogram Solution

CHEN Yunliang, LIU Hao, ZHU Guishui, HUANG Xiaohui, CHEN Xiaodao, WANG Lizhe   

  1. School of Computer Science,China University of Geosciences,Wuhan 430070,China
  • Received:2023-03-08 Revised:2023-06-18 Online:2023-09-15 Published:2023-09-01
  • About author:CHEN Yunliang,born in 1979,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include cloud computing and data mining.
    HUANG Xiaohui,born in 1994,lecturer,is a member of China Computer Federation.His main research interests include geoscience data management,processing and analysis.
  • Supported by:
    National Natural Science Foundation of China(62076224),National Science Fund for Distinguished Young Scho-lars of China(41925007),Joint Funds of the National Natural Science Foundation of China(U21A2013) and Open Research Project of The Hubei Key Laboratory of Intelligent Geo-Information Processing(KLIGIP-2022B16).

摘要: 最优直方图是一类重要的直方图技术,目前用于实现最优直方图的动态规划分组算法存在时间复杂度过高的问题。因此,提出了一种基于概率稀疏自注意力的监督学习模型来学习动态规划分组算法,该监督学习模型可作为动态规划分组算法的替代方案,主要包括3个部分:1)通过Embedding层与位置编码层将输入数值序列映射为对应的向量序列;2)通过概率稀疏的自注意力层捕获输入序列之间的依赖关系;3)通过前馈神经网络层将依赖关系映射到分组“桶”边界下标信息。实验结果表明,基于概率稀疏自注意力的监督学习模型在6个数据集上的准确率超过了83.47%,且其在预测阶段的时间消耗不超过动态规划分组算法的1/3。

关键词: 最优直方图, 动态规划分组算法, 监督学习模型

Abstract: The dynamic programming binning algorithm is currently used to realize the optimal histogram.However,its time complexity is too high.A supervised learning model based on ProbSparse self-attention is proposed in this paper to learn the dynamic programming binning algorithm.It can be used as an alternative to the dynamic programming binning algorithm.The proposed model consists of three parts:1)mapping the numerical input sequence into the corresponding vector sequence through the embedding and position coding layer;2)capturing the dependence between input sequences through the ProbSparse self-attention;3)the dependency is mapped to the subscript information of the binning “bucket” boundary through the feedforward neural network.Experimental results on six data sets indicate that the proposed model based on ProbSparse self-attention outperforms the dynamic programming binning algorithm.The accuracy of the proposed method is greater than 83.47%.Meanwhile,its time cost in the prediction stage is no more than 1/3 of the compared method.

Key words: Optimal histogram, Dynamic programming binning algorithm, Supervised learning model

中图分类号: 

  • TP391
[1]IOANNIDIS Y.The history of histograms(abridged) [C]//Proceedings 2003 VLDB Conference.Elsevier,2003:19-30.
[2]IOANNIDIS Y E,POOSALA V J A S R.Balancing histogram optimality and practicality for query result size estimation [J].ACM Sigmod Record,1995,24(2):233-244.
[3]NAVAS-PALENCIA G.Optimal binning:mathe-matical pro-gramming formulation [J].arXiv:2001.08025,2020.
[4]IOANNIDIS Y E.Universality of serial histograms [C]//Proceedings of the VLDB.1993:256-267.
[5]JAGADISH H V,KOUDAS N,MUTHUKRISHNAN S,et al.Optimal histograms with quality guarantees [C]//Proceedings of the VLDB.1998:24-27.
[6]BELLMAN R.On the approximation of curves by line segmentsusing dynamic programming [J].Communications of the ACM,1961,4(6):284.
[7]GUHA S,KOUDAS N,SHIM K.Approximation and streaming algorithms for histogram construction problems [J].ACM Transactions on Database Systems(TODS),2006,31(1):396-438.
[8]KAUSHIK R,SUCIU D.Consistent histograms in the presence of distinct value counts[J].Proceedings of the VLDB Endowment,2009,2(1):850-861.
[9]GREENWALD M,KHANNA S.Space-efficient online computation of quantile summaries[J].ACM SIGMOD Record,2001,30(2):58-66.
[10]FANG M,SHIVAKUMAR N,GARCIA-MOLINA H,et al.Computing Iceberg Queries Efficiently[C]//Proceedings of the 1998 VLDB Conference.New York:Citeseer,1998.
[11]MIRONCHYK P,TCHISTIAKOV V.Monotone optimal bin-ning algorithm for credit risk modeling [J/OL].https://www.researchgate.net/profile/Viktor-Tchistiakov/publication/322520135_Monotone_optimal_binning_algorithm_for_credit_risk_modeling/links/5a5dd1a8458515c03edf9a97/Monotone-optimal-binning-algorithm-for-credit-risk-modeling.pdf.
[12]SHEVERTALOV M,STEHLE E,MANCORIDIS S.A genetic algorithm for solving the binning problem in networked applications detection [C]// Proceedings of the 2007 IEEE Congress on Evolutionary Computation.IEEE,2007:713-720.
[13]POOSALA V,HAAS P J,IOANNIDIS Y E,et al.Improved histograms for selectivity estimation of range predicates [J].ACM Sigmod Record,1996,25(2):294-305.
[14]IOANNIDIS Y E,KANG Y.Randomized algorithms for optimizing large join queries [J].ACM Sigmod Record,1990,19(2):312-321.
[15]GUHA S,KOUDAS N,SHIM K.Data-streamsand histograms [C]//Proceedings of the thirty-third Annual ACM Symposium on Theory of Computing.2001:471-475.
[16]GUHA S,INDYK P,MUTHUKRISHNAN S,et al.Histogramming data streams with fast per-item processing [C]//Procee-dings of the International Colloquium on Automata,Languages,and Programming.Springer,2002:681-692.
[17]GUHA S,KOUDAS N.Approximating a data stream for que-rying and estimation:Algorithms and performance evaluation[C]//Proceedings 18th International Conference on Data Engineering.IEEE,2002:567-576.
[18]GUHA S,SHIM K,WOO J.REHIST:Relative error histogram construction algorithms [C]//Proceedings of the VLDB.2004:300-311.
[19]NOMER H A,ALNOWIBET K A,ELSAYED A,et al.Neural knapsack:A neural network based solver for the knapsack pro-blem[J].IEEE Access,2020,8:224200-224210.
[20]ZAREMBA W,SUTSKEVER I.Learning to execute [J].ar-Xiv:14104615,2014.
[21]GRAVES A,WAYNE G,DANIHELKA I.Neural turing ma-chines[J].arXiv:1410.5401,2014.
[22]CHEN Y,LI V O,CHO K,et al.A stable and effective learning strategy for trainable greedy decoding [J].arXiv:1804.07915,2018.
[23]KOEHN P.Pharaoh:a beam search decoder for phrase-basedstatistical machine translation models[C]//Proceedings of the Conference of the Association for Machine Translation in the Americas.Springer,2004:115-124.
[24]GRAVES A,WAYNE G,REYNOLDS M,et al.Hybrid compu-ting using a neural network with dynamic external memory [J].Nature,2016,538(7626):471-476.
[25]ZHOU H,ZHANG S,PENG J,et al.Informer:Beyond efficient transformer for long sequence time-series forecasting [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:11106-11115.
[26]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [J].arXiv.1706.03762,2017.
[27]QIU J,MA H,LEVY O,et al.Blockwise self-attention for long document understanding [J].arXiv:1911.02972,2019.
[28]LI S,JIN X,XUAN Y,et al.Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting [J].arXiv:1907.00235,2019.
[29]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentivelanguage models beyond a fixed-length context [J].arXiv:1901.02860,2019.
[30]ELEKES Á,ENGLHARDT A,SCHÄLER M,et al.Towardmeaningful notions of similarity in NLP embedding models [J].International Journal on Digital Libraries,2020,21(2):109-128.
[31]KE G,HE D,LIU T Y.Rethinking positional encoding in language pre-training [J].arXiv:2006.15595,2020.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!