计算机科学 ›› 2015, Vol. 42 ›› Issue (6): 193-203.doi: 10.11896/j.issn.1002-137X.2015.06.042

• 软件与数据库技术 • 上一篇    下一篇

二维混合数据分布下相关性检测的新方法HY-COCA

曹巍,王秋月,覃雄派,王珊   

  1. 中国人民大学信息学院 北京100872,中国人民大学信息学院 北京100872,中国人民大学信息学院 北京100872,中国人民大学信息学院 北京100872
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金项目(61202331,3),软件工程国家重点实验室开放研究基金项目(SKLSE2012-09-33)资助

HY-COCA:A Hybrid-data-distribution-aware Way to Detect Correlation over Bi-dimensional Data Space

CAO Wei, WANG Qiu-yue, QIN Xiong-pai and WANG Shan   

  • Online:2018-11-14 Published:2018-11-14

摘要: 混合数据分布是指数据分布的不同区域具有不同的特殊分布。例如销售额和地区两个属性之间,在销售额比较低的数值区间中,两者呈现近似相互独立的数据分布;而在销售额比较高的数值区间,二者呈现近似函数依赖的数据分布。现有检测数据相关性的研究专注于给出一个总体的二维相关性的度量,而无法检测出子区域的特殊相关性。在统计分析时,这类具有特殊相关性的子区域有更丰富的统计意义,值得引起重视。研究并提出了存在这类混合数据分布的情况下,检测数据相关性的新方法HY-COCA。该方法在熵相关系数的基础上,缩小了子区域的搜索空间,与Naive方法相比,降低了复杂度;同时HY-COCA还讨论了子区域的相关性差异判别与结果展示等问题。在生成的数据和测试基准数据上进行了实验,结果验证了方法的有效性。

关键词: 数据分布,混合数据分布,相关性,数据分布区域,相关性差异分数

Abstract: Hybrid data distribution between two attributes means that different data sub-regions exhibit different correlated associations.For example,in a distribution between sale amounts and different cities,a semi-independent distribution is observed with lower sale amounts,but for higher sale amounts,the two attributes present soft functional depen-dency.Previous researches on auto detection of association focused on deducing an overall measure of association over two dimensional distributions.They were unable to address hybrid data distribution problem.In statistical analysis,such sub-regions with particular data associations are worth paying attention to.This paper proposed a new way,HY-COCA,to detect data associations globally and locally,finding those sub-regions with special data associations.We did experiments on both synthetic and benchmark data.Experimental results verify the effectiveness of HY-COCA.

Key words: Data distribution,Hybrid data distribution,Data association,Sub-regions in data distribution,Differentiating score of association

[1] 王珊,曹巍,覃雄派.基于熵相关系数的关联性自动判别方法——COCA[J].计算机应用,2006,26(9):2005-2008 Wang Shan,Cao Wei,Qin Xiong-pai.COCA-a new way to auto-detect association based on entropy correlated coefficients[J].Journal of Computer Applications,2006,26(9):2005-2008
[2] 曹巍,王珊.面向多维混合型数据分布的混合多维直方图初探[J].计算机应用,2009,29(9):2487-2490 Cao Wei,Wang Shan.Exploration of hybrid multi-dimensional histograms for hybrid multi-dimensional data distribution[J].Journal of Computer Applications,2009,29(9):2487-2490
[3] 曹巍,王珊,覃雄派,等.面向不同数据分布的多维直方图算法COCA-Hist[J].计算机学报,2008,31(6):1013-1024 Cao Wei,Wang Shan,Qin Xiong-pai,et al.Versatile Multidimensional Histograms for Different Data Distributions[J].Chinese Journal of Computers,2008,31(6):1013-1024
[4] Ilyas I F,Markl V,Haas P J,et al.CORDS:Automatic Discoveryof Correlations and Soft Functional Dependencies[C]∥Procee-dings of the ACM SIGMOD International Conference on Management of Data.Paris,France:ACM,2004:647-658
[5] Poosala V,Ioannidis Y.Selectivity Estimation Without the Attribute Value Independence Assumption[C]∥Proceedings of 23rd International Conference on Very Large Data Bases.At-hens,Greece:Morgan Kaufmann,1997:486-495
[6] Deshpande A,Garofalakis M.Independence is Good:Dependen-cy-Based Histogram Synopses for High-Dimensional Data[C]∥Proceedings of the ACM SIGMOD International Conference on Management of Data.Santa Barbara,CA,USA:ACM,2001:199-210
[7] Lim L,Wang M,Vitter J S.SASH:A self-adaptive histogram set for dynamically changing workloads[C]∥Proceedings of the 29th International Conference on Very Large Data Bases.Berlin,Germany:Morgan Kaufmann,2003:369-380
[8] Bruno N,Chaudhuri S,Gravano L.STHoles:A MultidimensionalWorkload-Aware Histogram[R].Technical Report MSR-TR-2001-36
[9] 张尧庭,等.定性资料的统计分析[M].桂林:广西师范大学出版社,1991:1-205 Zhang Yao-ting,et al.Statistical analysis of qualitative data[M].Guilin:Guangxi Normal University Press,1991:1-205
[10] Poosala V,Haas P J,Ioannidis Y,et al.Improved Histograms for Selectivity Estimation of Range Predicates[C]∥Proceedings of the 1996 ACM SIGMOD International Conference on Mana-gement of Data.Montreal,Quebec,Canada:ACM,1996:294-305
[11] Mueen A,Nath S,Liu J.Fast Approximate Correlation for Massive Time-series Data[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.India-napolis,Indiana,USA:ACM,2010:171-182
[12] Moerkotte G,Neumann T,Steidl G.Preventing Bad Plans byBounding the Impact of Cardinality Estimation Errors[J].Proceedings of the VLDB Endowment,2009,2(1):982-993
[13] Kanne C,Moerkotte G.Histogram Reloaded:the Merits ofBucket Diversity[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.Indianapolis,Indiana,USA:ACM,2010:663-674
[14] Cormen T H,Leiserson C E,Rivest R L,et al.Introduction to Algorithms[M].Cambridge MA,USA:the MIT Press,2009:65
[15] Gunopulos D,Kollios G,Tsotras V,et al.Approximating multi-dimensional aggregate range queries over real attributes[C]∥Proceedings of the ACM SIGMOD International Conference on Management of Data.Dallas,Texas,USA:ACM,2000:463-474
[16] Aboulnaga A,Chaudhuri S.Self-tuning histograms:building histograms without looking at data[C]∥Proceedings of the ACM SIGMOD International Conference on Management of Data.Philadelphia,Pennsylvania,USA:ACM,1999:181-192
[17] Robinson J T.The K-D-B-Tree:A search structure for largemultidimensional dynamic indexes[C]∥Proceedings of the ACM SIGMOD International Conference on Management of Data.Ann Arbor,Michigan,USA:ACM,1981:10-18
[18] Nievergelt J,Hinterberger H,Sevcik K C.The grid file:Anadaptable,symmetric multikey file structure[J].ACM Transactions on Database Systems,1984,9(1):38-71
[19] Finkel R A,Bentley J L.Quad trees a data structure for retrieval on composite keys[J].Acta Informatica,1974,4(1):1-9

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!