计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 195-201.doi: 10.11896/jsjkx.180901748

• 软件与数据库技术 • 上一篇    下一篇

基于冲突依赖消除的网络表格外键检测算法

王佳敏, 王宁   

  1. (北京交通大学计算机与信息技术学院 北京100044)
  • 收稿日期:2018-09-16 修回日期:2019-03-06 出版日期:2019-10-15 发布日期:2019-10-21
  • 通讯作者: 王宁(1967-),女,博士,教授,CCF会员,主要研究方向为Web数据集成、大数据管理与分析、数据挖掘,E-mail:nwang@bjtu.edu.cn。
  • 作者简介:王佳敏(1994-),女,硕士生,CCF会员,主要研究方向为Web数据集成、数据挖掘,E-mail:16120423@bjtu.edu.cn。
  • 基金资助:
    本文受国家重点研发计划(2018YFC0809800),中央高校基本科研业务费专项资金(2017YJS065)资助。

Foreign Key Detection Algorithm for Web Tables Based on Conflict Dependency Elimination

WANG Jia-min, WANG Ning   

  1. (School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
  • Received:2018-09-16 Revised:2019-03-06 Online:2019-10-15 Published:2019-10-21

摘要: 作为数据库中最重要的约束之一,外键关系对数据分析与集成有着重要的意义。大量的网络表格缺乏显式指定的外键,但外键关系对于理解和利用网络表格至关重要。目前的研究工作主要集中于对属性间包含依赖的查找,一些传统关系表格上的外键关系检测方法无法解决网络表格的异构性而产生的大量冲突外键。综合考虑网络表格间的冲突依赖,提出了一种基于冲突依赖消除的网络表格外键检测算法。首先提出冲突依赖的概念,据此对候选外键关系建立包含依赖图;然后构建包含依赖图的层结构,并给出候选外键关系的强度定义;最后在逐层消除冲突依赖的基础上,筛选出真正的外键关系。为验证算法的有效性,实验数据集分别选择了具有完整模式规范的WIKI数据集,以及缺少模式信息的DWTC数据集和WDC数据集。基于以上数据集,将提出的算法与其他两种外键检测方法进行精确率、召回率以及F值的对比。实验结果表明,提出的算法在WIKI数据集和DWTC数据集上的精确率、召回率和F值均高于现有算法;在最新的大型网络表格数据集WDC中,所提算法的精确率、召回率和F值高达0.89,0.88和0.89,且大大优于其他算法。因此,与现有的方法相比,所提算法更适用于网络表格,同时具备更高的精确率、召回率以及F值。

关键词: 冲突依赖, 外键, 网络表格, 约束

Abstract: As one of the most important constraints in databases,foreign key relationships between tables are crucial for data integration and analysis.For large amount of web tables,foreign keys are not specified in most cases.Therefore,detection of foreign key becomes a significant step in understanding and utilizing the web tables.Current researches mainly focus on the search for inclusion dependencies between attributes,and foreign key detection methods on traditional relational tables cannot solve the problem about large number of conflicting foreign keys due to the heterogeneity of web tables.Considering the conflict dependency between web tables,a foreign key detection algorithm for web tables based on conflict dependency elimination was proposed.Firstly,the concept of conflict dependency is proposed,and an inclusion dependency graph is established for the candidate foreign key relationship.Subsequently,the layer structure of the inclusion dependency graph is constructed,and the definition for the strength of candidates is given.Finally,based on the layer-by-layer elimination of conflict dependency,true foreign key relationships are obtained.To verify the effectiveness of the algorithm,this paper selected 3 true web table datasets as experimental datasets,which are the WIKI dataset with complete schema specification,the DWTC and WDC datasets without schema information.The proposed algorithm has been compared with the other two foreign key detection methods on above datasets in terms of accuracy,recall and F-measure.The experimental results show that the accuracy,recall andF-measure of the proposed algorithm on WIKI and DWTC are more superior than existing algorithms.The proposed algorithm outperforms other algorithms especially on WDC,the largest and the most up-to-data web table corpus,it’s accuracy,recall and F-measure to 0.89,0.88 and 0.89.Therefore,the proposed foreign key detection algorithm is more suitable for web tables.Compared with the existing methods,it has higher accuracy,recall and F-measure.

Key words: Conflict dependency, Constraint, Foreign key, Web table

中图分类号: 

  • TP391
[1]VENETIS P,HALEVY A,MADHAVAN J,et al.Recovering semantics of tables on the web[J].In Proceedings of the VLDB Endowment,2011,4(9):528-538.
[2]Marco D,ADELFIO,HANAN S.Schema extraction for tabular data on the web[J].In Proceedings of the VLDB Endowment,2013,6(6):421-432.
[3]WANG J J,WANG H X,WANG Z Y,et al.Understanding tables on the web[C]//Proceedings of 31st International Confe-rence ER.2012:141-155.
[4]DENG D,JIANG Y,LI G L,et al.Scalable column concept determination for web tables using large konwledge bases[J].In Proceedings of the VLDB Endowment,2013,6(3):1606-1617.
[5]ZHANG M H,MARIOS H,BENG C O,et al.Automatic discovery of attributes in relational databases[C]//ACM SIGMOD Conference.2011:109-120.
[6]VARISH M,FININ T,JOSHI A.Generating linked data by inferring the semantics of tables[C]//Proc.1st Int.Workshop on Searching and Integrating New Web Data Sources--Very Large Data Search.2011:17-22.
[7]SARMA A D,FANG L,GUPTA N,et al.Finding related tables[C]// Proceedings of the 2012 ACM SIGMOD Int.Conf.on Management of Data.New York:ACM,2012:817-828.
[8]CASANOVA M A,FAGIN R,PAPADIMITRIOU C H.Inclusion dependencies and their interaction with functional depen-dencies[C]//PODS,Los Angeles,United States.1982:171-176.
[9]MARCHI F D,PETIT J M.Zigzag:a new algorithm for mining large inclusion dependencies in databases[C]//ICDM,Melbourne,Florida,USA,2003:27-34.
[10]BAUCKMANN J,LESER U,NAUMANN F,et al.Efficiently detecting inclusion dependencies[C]// International Conference on Data Engineering.Istanbul,Twrkey,2007:1448-1450.
[11]FABIAN T,THORSTEN P,FELIX N.Detecting inclusion dependencies on very many tables[J].ACM Transaction on Database Systems,2017,42(3):1-29.
[12]MARCHI F D,LOPES S,PETIT J M.Unary and n-ary inclusion dependency discovery in relational databases[J].Journal of Intelligent Information Systems,2009,32(1):53-73.
[13]PAPENBROCK T,KRUSE S,NAUMANN F.Divide & con-quer-based inclusion dependency discovery[J].Proceedings of the VLDB Endowment,2015,8(7):774-785.
[14]ROSTIN A,ALBRECHT O,BAUCKMANN J,et al.A ma-chince learning approach to foreign key discovery[C]//12th International Workshop on the Web and Databases.2009.
[15]CHEN Z M,NARASAYYA V,CHAUDHURI S.Fast foreign-key detection in microsoft sql server powerpivot for excel[J].Proceedings of the VLDB Endowment,2014,7(13):1417-1428.
[16]ZHANG M H,HADJIELEFTHERIOU M,OOI B C,et al.On multi-column foreign key discovery[J].Proceedings of the VLDB Endowment,2010,3(1):805-814.
[17]PELEG S,WERMAN M,ROM H.A unified approach to the change of resolution:space and gray-level[J].TPAMI,1989,11(7):739-742.
[18]Web Data Commons-Web Table Corpora[EB/OL].[2018-08-01].http://www.webdatacommons.org/webtables/index.html.
[19]DWTC-Dresden Web Table Corpus[EB/OL].[2018-08-01].https://wwwdb.inf.tu-dresden.de/misc/dwtc/.
[20]WikiTables:Public Site[EB/OL].[2018-08-01].http://dow-ney-n1.cs.northwestern.edu/public.
[1] 傅彦铭, 朱杰夫, 蒋侃, 黄保华, 孟庆文, 周兴.
移动众包中基于多约束工人择优的激励机制研究
Incentive Mechanism Based on Multi-constrained Worker Selection in Mobile Crowdsourcing
计算机科学, 2022, 49(9): 275-282. https://doi.org/10.11896/jsjkx.210700129
[2] 黄璞, 沈阳阳, 杜旭然, 杨章静.
基于局部约束特征线表示的人脸识别
Face Recognition Based on Locality Constrained Feature Line Representation
计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169
[3] 杨辉, 陶力宏, 朱建勇, 聂飞平.
基于锚点的快速无监督图嵌入
Fast Unsupervised Graph Embedding Based on Anchors
计算机科学, 2022, 49(4): 116-123. https://doi.org/10.11896/jsjkx.210200098
[4] 官铮, 邓扬琳, 聂仁灿.
光谱重建约束非负矩阵分解的高光谱与全色图像融合
Non-negative Matrix Factorization Based on Spectral Reconstruction Constraint for Hyperspectral and Panchromatic Image Fusion
计算机科学, 2021, 48(9): 153-159. https://doi.org/10.11896/jsjkx.200900054
[5] 张帆, 宫傲宇, 邓磊, 刘芳, 林艳, 张一晋.
面向实际信道观测环境的时限约束无线下行调度策略
Wireless Downlink Scheduling with Deadline Constraint for Realistic Channel Observation Environment
计算机科学, 2021, 48(9): 264-270. https://doi.org/10.11896/jsjkx.210100143
[6] 徐艺菲, 熊淑华, 孙伟恒, 何小海, 陈洪刚.
基于非局部低秩和自适应量化约束先验的HEVC后处理算法
HEVC Post-processing Algorithm Based on Non-local Low-rank and Adaptive Quantization Constraint Prior
计算机科学, 2021, 48(5): 155-162. https://doi.org/10.11896/jsjkx.200800079
[7] 李笠, 李广鹏, 常亮, 古天龙.
约束进化算法及其应用研究综述
Survey of Constrained Evolutionary Algorithms and Their Applications
计算机科学, 2021, 48(4): 1-13. https://doi.org/10.11896/jsjkx.200600151
[8] 周秋艳, 肖满生, 张龙信, 张晓丽, 杨文理.
多约束条件下生产排程智能优化技术
Intelligent Optimization Technology of Production Scheduling Under Multiple Constraints
计算机科学, 2021, 48(3): 239-245. https://doi.org/10.11896/jsjkx.200300105
[9] 郑建云, 庞建民, 周鑫, 王军.
基于约束推导式的增强型二进制漏洞挖掘
Enhanced Binary Vulnerability Mining Based on Constraint Derivation
计算机科学, 2021, 48(3): 320-326. https://doi.org/10.11896/jsjkx.200700047
[10] 曹波, 陈锋, 成静, 李华, 李永乐.
基于全向路口模型的非结构化道路重复节点路径规划
Route Planning of Unstructured Road Including Repeat Node Based on Bidirectional Search
计算机科学, 2021, 48(11A): 77-80. https://doi.org/10.11896/jsjkx.201200193
[11] 王文博, 罗恒利.
基于图卷积神经网络的完全图人脸聚类
Complete Graph Face Clustering Based on Graph Convolution Network
计算机科学, 2021, 48(11A): 275-277. https://doi.org/10.11896/jsjkx.201200102
[12] 叶亚男, 迟静, 于志平, 战玉丽, 张彩明.
基于改进CycleGan模型和区域分割的表情动画合成
Expression Animation Synthesis Based on Improved CycleGan Model and Region Segmentation
计算机科学, 2020, 47(9): 142-149. https://doi.org/10.11896/jsjkx.190900203
[13] 杨帆, 王俊斌, 白亮.
基于安全性的成对约束扩充算法
Extended Algorithm of Pairwise Constraints Based on Security
计算机科学, 2020, 47(9): 324-329. https://doi.org/10.11896/jsjkx.200700092
[14] 张龙信, 周立前, 文鸿, 肖满生, 邓晓军.
基于异构云计算的成本约束下的工作流能量高效调度算法
Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems
计算机科学, 2020, 47(8): 112-118. https://doi.org/10.11896/jsjkx.200300038
[15] 高方远, 王秀美.
一种基于块对角表示和近邻约束的子空间聚类方法
Subspace Clustering Method Based on Block Diagonal Representation and Neighbor Constraint
计算机科学, 2020, 47(7): 66-70. https://doi.org/10.11896/jsjkx.190600155
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!