计算机科学 ›› 2019, Vol. 46 ›› Issue (10): 195-201.doi: 10.11896/jsjkx.180901748
王佳敏, 王宁
WANG Jia-min, WANG Ning
摘要: 作为数据库中最重要的约束之一,外键关系对数据分析与集成有着重要的意义。大量的网络表格缺乏显式指定的外键,但外键关系对于理解和利用网络表格至关重要。目前的研究工作主要集中于对属性间包含依赖的查找,一些传统关系表格上的外键关系检测方法无法解决网络表格的异构性而产生的大量冲突外键。综合考虑网络表格间的冲突依赖,提出了一种基于冲突依赖消除的网络表格外键检测算法。首先提出冲突依赖的概念,据此对候选外键关系建立包含依赖图;然后构建包含依赖图的层结构,并给出候选外键关系的强度定义;最后在逐层消除冲突依赖的基础上,筛选出真正的外键关系。为验证算法的有效性,实验数据集分别选择了具有完整模式规范的WIKI数据集,以及缺少模式信息的DWTC数据集和WDC数据集。基于以上数据集,将提出的算法与其他两种外键检测方法进行精确率、召回率以及F值的对比。实验结果表明,提出的算法在WIKI数据集和DWTC数据集上的精确率、召回率和F值均高于现有算法;在最新的大型网络表格数据集WDC中,所提算法的精确率、召回率和F值高达0.89,0.88和0.89,且大大优于其他算法。因此,与现有的方法相比,所提算法更适用于网络表格,同时具备更高的精确率、召回率以及F值。
中图分类号:
[1]VENETIS P,HALEVY A,MADHAVAN J,et al.Recovering semantics of tables on the web[J].In Proceedings of the VLDB Endowment,2011,4(9):528-538. [2]Marco D,ADELFIO,HANAN S.Schema extraction for tabular data on the web[J].In Proceedings of the VLDB Endowment,2013,6(6):421-432. [3]WANG J J,WANG H X,WANG Z Y,et al.Understanding tables on the web[C]//Proceedings of 31st International Confe-rence ER.2012:141-155. [4]DENG D,JIANG Y,LI G L,et al.Scalable column concept determination for web tables using large konwledge bases[J].In Proceedings of the VLDB Endowment,2013,6(3):1606-1617. [5]ZHANG M H,MARIOS H,BENG C O,et al.Automatic discovery of attributes in relational databases[C]//ACM SIGMOD Conference.2011:109-120. [6]VARISH M,FININ T,JOSHI A.Generating linked data by inferring the semantics of tables[C]//Proc.1st Int.Workshop on Searching and Integrating New Web Data Sources--Very Large Data Search.2011:17-22. [7]SARMA A D,FANG L,GUPTA N,et al.Finding related tables[C]// Proceedings of the 2012 ACM SIGMOD Int.Conf.on Management of Data.New York:ACM,2012:817-828. [8]CASANOVA M A,FAGIN R,PAPADIMITRIOU C H.Inclusion dependencies and their interaction with functional depen-dencies[C]//PODS,Los Angeles,United States.1982:171-176. [9]MARCHI F D,PETIT J M.Zigzag:a new algorithm for mining large inclusion dependencies in databases[C]//ICDM,Melbourne,Florida,USA,2003:27-34. [10]BAUCKMANN J,LESER U,NAUMANN F,et al.Efficiently detecting inclusion dependencies[C]// International Conference on Data Engineering.Istanbul,Twrkey,2007:1448-1450. [11]FABIAN T,THORSTEN P,FELIX N.Detecting inclusion dependencies on very many tables[J].ACM Transaction on Database Systems,2017,42(3):1-29. [12]MARCHI F D,LOPES S,PETIT J M.Unary and n-ary inclusion dependency discovery in relational databases[J].Journal of Intelligent Information Systems,2009,32(1):53-73. [13]PAPENBROCK T,KRUSE S,NAUMANN F.Divide & con-quer-based inclusion dependency discovery[J].Proceedings of the VLDB Endowment,2015,8(7):774-785. [14]ROSTIN A,ALBRECHT O,BAUCKMANN J,et al.A ma-chince learning approach to foreign key discovery[C]//12th International Workshop on the Web and Databases.2009. [15]CHEN Z M,NARASAYYA V,CHAUDHURI S.Fast foreign-key detection in microsoft sql server powerpivot for excel[J].Proceedings of the VLDB Endowment,2014,7(13):1417-1428. [16]ZHANG M H,HADJIELEFTHERIOU M,OOI B C,et al.On multi-column foreign key discovery[J].Proceedings of the VLDB Endowment,2010,3(1):805-814. [17]PELEG S,WERMAN M,ROM H.A unified approach to the change of resolution:space and gray-level[J].TPAMI,1989,11(7):739-742. [18]Web Data Commons-Web Table Corpora[EB/OL].[2018-08-01].http://www.webdatacommons.org/webtables/index.html. [19]DWTC-Dresden Web Table Corpus[EB/OL].[2018-08-01].https://wwwdb.inf.tu-dresden.de/misc/dwtc/. [20]WikiTables:Public Site[EB/OL].[2018-08-01].http://dow-ney-n1.cs.northwestern.edu/public. |
[1] | 傅彦铭, 朱杰夫, 蒋侃, 黄保华, 孟庆文, 周兴. 移动众包中基于多约束工人择优的激励机制研究 Incentive Mechanism Based on Multi-constrained Worker Selection in Mobile Crowdsourcing 计算机科学, 2022, 49(9): 275-282. https://doi.org/10.11896/jsjkx.210700129 |
[2] | 黄璞, 沈阳阳, 杜旭然, 杨章静. 基于局部约束特征线表示的人脸识别 Face Recognition Based on Locality Constrained Feature Line Representation 计算机科学, 2022, 49(6A): 429-433. https://doi.org/10.11896/jsjkx.210300169 |
[3] | 杨辉, 陶力宏, 朱建勇, 聂飞平. 基于锚点的快速无监督图嵌入 Fast Unsupervised Graph Embedding Based on Anchors 计算机科学, 2022, 49(4): 116-123. https://doi.org/10.11896/jsjkx.210200098 |
[4] | 官铮, 邓扬琳, 聂仁灿. 光谱重建约束非负矩阵分解的高光谱与全色图像融合 Non-negative Matrix Factorization Based on Spectral Reconstruction Constraint for Hyperspectral and Panchromatic Image Fusion 计算机科学, 2021, 48(9): 153-159. https://doi.org/10.11896/jsjkx.200900054 |
[5] | 张帆, 宫傲宇, 邓磊, 刘芳, 林艳, 张一晋. 面向实际信道观测环境的时限约束无线下行调度策略 Wireless Downlink Scheduling with Deadline Constraint for Realistic Channel Observation Environment 计算机科学, 2021, 48(9): 264-270. https://doi.org/10.11896/jsjkx.210100143 |
[6] | 徐艺菲, 熊淑华, 孙伟恒, 何小海, 陈洪刚. 基于非局部低秩和自适应量化约束先验的HEVC后处理算法 HEVC Post-processing Algorithm Based on Non-local Low-rank and Adaptive Quantization Constraint Prior 计算机科学, 2021, 48(5): 155-162. https://doi.org/10.11896/jsjkx.200800079 |
[7] | 李笠, 李广鹏, 常亮, 古天龙. 约束进化算法及其应用研究综述 Survey of Constrained Evolutionary Algorithms and Their Applications 计算机科学, 2021, 48(4): 1-13. https://doi.org/10.11896/jsjkx.200600151 |
[8] | 周秋艳, 肖满生, 张龙信, 张晓丽, 杨文理. 多约束条件下生产排程智能优化技术 Intelligent Optimization Technology of Production Scheduling Under Multiple Constraints 计算机科学, 2021, 48(3): 239-245. https://doi.org/10.11896/jsjkx.200300105 |
[9] | 郑建云, 庞建民, 周鑫, 王军. 基于约束推导式的增强型二进制漏洞挖掘 Enhanced Binary Vulnerability Mining Based on Constraint Derivation 计算机科学, 2021, 48(3): 320-326. https://doi.org/10.11896/jsjkx.200700047 |
[10] | 曹波, 陈锋, 成静, 李华, 李永乐. 基于全向路口模型的非结构化道路重复节点路径规划 Route Planning of Unstructured Road Including Repeat Node Based on Bidirectional Search 计算机科学, 2021, 48(11A): 77-80. https://doi.org/10.11896/jsjkx.201200193 |
[11] | 王文博, 罗恒利. 基于图卷积神经网络的完全图人脸聚类 Complete Graph Face Clustering Based on Graph Convolution Network 计算机科学, 2021, 48(11A): 275-277. https://doi.org/10.11896/jsjkx.201200102 |
[12] | 叶亚男, 迟静, 于志平, 战玉丽, 张彩明. 基于改进CycleGan模型和区域分割的表情动画合成 Expression Animation Synthesis Based on Improved CycleGan Model and Region Segmentation 计算机科学, 2020, 47(9): 142-149. https://doi.org/10.11896/jsjkx.190900203 |
[13] | 杨帆, 王俊斌, 白亮. 基于安全性的成对约束扩充算法 Extended Algorithm of Pairwise Constraints Based on Security 计算机科学, 2020, 47(9): 324-329. https://doi.org/10.11896/jsjkx.200700092 |
[14] | 张龙信, 周立前, 文鸿, 肖满生, 邓晓军. 基于异构云计算的成本约束下的工作流能量高效调度算法 Energy Efficient Scheduling Algorithm of Workflows with Cost Constraint in Heterogeneous Cloud Computing Systems 计算机科学, 2020, 47(8): 112-118. https://doi.org/10.11896/jsjkx.200300038 |
[15] | 高方远, 王秀美. 一种基于块对角表示和近邻约束的子空间聚类方法 Subspace Clustering Method Based on Block Diagonal Representation and Neighbor Constraint 计算机科学, 2020, 47(7): 66-70. https://doi.org/10.11896/jsjkx.190600155 |
|