计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 8-13.doi: 10.11896/j.issn.1002-137X.2019.04.002

• 大数据与数据科学 • 上一篇    下一篇

基于概率推断的质量控制智能体

徐耀丽, 李战怀   

  1. 西北工业大学计算机学院 西安710072
    西北工业大学大数据存储与管理工业和信息化部重点实验室 西安710129
  • 收稿日期:2018-12-04 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 李战怀(1961-),男,博士,主要研究方向为数据管理和数据质量,E-mail:lizhh@nwpu.edu.cn(通信作者)。
  • 作者简介:徐耀丽(1987-),女,博士生,CCF学生会员,主要研究方向为数据修复、实体解析和不一致性消歧
  • 基金资助:
    本文受中国科技部国家重点研发计划(2016YFB1000703),国家自然科学基金重点项目(61732014,61332006),国家自然科学基金面上项目(61472321,61672432),国家自然科学基金青年项目(61502390),陕西省自然科学基础研究计划(2018JM6086),西北工业大学中央高校基本科研业务费项目(3102017jg02002)资助。

Quality Control Agent Based on Probability Inference

XU Yao-li, LI Zhan-huai   

  1. School of Computer Science and Engineering,Northwestern Polytechnical University,Xi’an 710072,China
    Key Laboratory of Big Data Storage and Management,Northwestern Polytechnical University,Ministry of Industry and Information Technology,Xi’an 710129,China
  • Received:2018-12-04 Online:2019-04-15 Published:2019-04-23

摘要: 实体解析(Entity Resolution,ER)是数据集成和清洗领域的基础问题,而不一致性消歧(Inconsistency Reconciliation,IR)通过对现存的不同ER算法产生的不一致记录对进行消歧,进一步提升解析效果。但是现有的IR方法有一个局限,即消歧结果没有质量保障。对此,首次提出了一个基于概率推断的质量控制智能体,记为QCAgent。该智能体不需要训练数据集,能够在满足给定查准率的约束条件下输出查全率最大的消歧结果。它的核心思想是:首先,使用异常点检测模型来估算不一致记录对匹配的概率,并依据这些概率估算查准率和查全率,再将计算出的查准率和查全率作为环境端的反馈;其次,使用二分搜索算法,选择满足查准率要求且查全率最大的翻转方案,作为QCAgent的下一次行动;然后,用更新后的一致结果训练异常点模型,并估算查准率和查全率。按此循环,当新估计的查准率满足约束条件时,该迭代过程停止。在真实的数据集上,实验结果表明:QCAgent能够有效解决消歧结果的质量控制问题。

关键词: 不一致性消歧, 查准率, 实体解析, 质量控制, 智能体

Abstract: Entity resolution (ER) is the fundamental problem of data integration and cleaning,while inconsistency reconciliation(IR) further improves the resolution performance through reconciling inconsistent pairs resolved by existing diverse ER approaches.However,previous IR approaches have a limitation that the reconciliation solution has no quality guarantee.To solve this problem,this paper firstly proposed a quality control agent based on probability inference,denoted as QCAgent.QCAgent does not require any manually labeled pair,and can automatically output reconciliation result with the highest recall on the premise of satisfying the given precision threshold.Its core idea is as follows.Firstly,the outlier detection model is utilized to estimate the matching probability for each inconsistent pair,and then the estimated precision and recall are regarded as the environmental feedback according to these probabilities.Next,the binary search algorithm is used to select a flipping solution as the next action of QCAgent,which can make flipped reconciliation result satisfy the precision requirement with the highest recall.Then the outlier detection model is retrained by using the new consistent pairs,and the recall and precision of flipped reconciliation result are estimated.The iterative process terminates until the newest estimated precision meets the constraints.On the real data set,the experimental results show that QCAgent can effectively solve the quality control problem of reconciliation result.

Key words: Agent, Entity resolution, Inconsistency reconciliation, Precision, Quality control

中图分类号: 

  • TP391
[1]XU Y,LI Z,CHEN Q,et al.GL-RF:A Reconciliation Framework for Label-free Entity Resolution [J].Frontiers of Compu-ter Science,2018,12(5):1035-1037.
[2]LI G.Human-in-the-loop data integration [J].Proceedings of the VLDB Endowment,2017,10(12):2006-2017.
[3]FAN F F,LI Z H,CHEN Q,et al.An outlier-detection based approach for automatic entity matching [J].Chinese Journal of Computers,2017,40(10):2197-2211.(in Chinese) 樊峰峰,李战怀,陈群,等.一种基于离群点检测的自动实体匹配方法[J].计算机学报,2017,40(10):2197-2211.
[4]EFTHYMIOU V,STEFANIDIS K,CHRISTOPHIDES V.Minoan ER:Progressive Entity Resolution in the Web of Data[C]∥Proceedings of the 19th International Conference on Extending Database Technology.2016:670-671.
[5]LI L,LI J,GAO H.Rule-Based Method for Entity Resolution [J].IEEE Transactions on Knowledge & Data Engineering,2015,27(1):250-263.
[6]WHANG S E,MARMAROS D,GARCIA-MOLINA H.Pay-as-you-go entity resolution [J].IEEE Transactions on Knowledge and Data Engineering,2013,25(5):1111-1124.
[7]BELLARE K,IYENGAR S,PARAMESWARAN A,et al.Active Sampling for Entity Matching with Guarantees [J].ACM Transactions on Knowledge Discovery from Data,2013,7(3):1-24.
[8]BELLARE K,IYENGAR S,PARAMESWARAN A G,et al. Active sampling for entity matching[C]∥Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM:New York,2012:1131-1139.
[9]WANG J,LI G,YU J X,et al.Entity matching:how similar is similar [J].Proceedings of the VLDB Endowment,2011,4(10):622-633.
[10]MONGE A E,ELKAN C.The Field Matching Problem:Algorithms and Applications[C]∥Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.AAAI Press:California,1996:267-270.
[11]ZHANG D,GUO L,HE X,et al.A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution[C]∥Procee-dings of the 34th IEEE International Conference on Data Engineering.IEEE Computer Society,2018:713-724.
[12]ARASU A,GÖTZ M,KAUSHIK R.On active learning of record matching packages[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM:New York,2010:783-794.
[13]MUDGAL S,LI H,REKATSINAS T,et al.Deep Learning for Entity Matching:A Design Space Exploration[C]∥Proceedings of the 2018 International Conference on Management of Data.ACM:New York,2018:19-34.
[14]COHEN W,RAVIKUMAR P,FIENBERG S.A comparison of string metrics for matching names and records[C]∥Proceedings of the KDD Workshop on Data Cleaning and Object Consolidation.2003:73-78.
[15]EBRAHEEM M,THIRUMURUGANATHAN S,JOTY S,et al. Distributed representations of tuples for entity resolution[J].Proceedings of the VLDB Endowment,2018,11(11):1454-1467.
[16]COHEN W W.Data integration using similarity joins and a word-based information representation language [J].ACM Transactions on Information Systems,2000,18(3):288-321.
[17]DAS A,KOTTUR S,MOURA J M F,et al.Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning[C]∥Proceedings of the IEEE International Conference on Computer Vision.2017:2970-2979.
[18]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.
[19]LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey on Deep Reinforcement Learning [J].Chinese Journal of Computers,2018,41(1):1-27.(in Chinese) 刘全,翟建伟,章宗长,等.深度强化学习综述 [J].计算机学报,2018,41(1):1-27.
[20]ZHAO X Y,DING S F.Research on Deep Reinforcement Learning [J].Computer Science,2018,45(7):1-6.(in Chinese) 赵星宇,丁世飞.深度强化学习研究综述 [J].计算机科学,2018,45(7):1-6.
[21]CHEN Z,CHEN Q,FAN F,et al.Enabling quality control for entity resolution:A human and machine cooperation framework[C]∥Proceedings of the 2018 IEEE 34th International Confe-rence on Data Engineering.IEEE:New Jersey,2018:1156-1167.
[22]EFTHYMIOU V,PAPADAKIS G,PAPASTEFANATOS G,et al. Parallel meta-blocking for scaling entity resolution over big heterogeneous data [J].Information Systems,2017,65:137-157.
[23]WANG Q,CUI M,LIANG H.Semantic-aware blocking for entity resolution [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):166-180.
[24]SIMONINI G,BERGAMASCHI S,JAGADISH H.BLAST:a loosely schema-aware meta-blocking approach for entity resolution [J].Proceedings of the VLDB Endowment,2016,9(12):1173-1184.
[25]PAPADAKIS G,KOUTRIKA G,PALPANAS T,et al.Meta- Blocking:Taking Entity Resolution to the Next Level [J].IEEE Transactions on Knowledge & Data Engineering,2014,26(8):1946-1960.
[26]SCHÖLKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution [J].Neural computation,2001,13(7):1443-1471.
[27]PEDREGOSA F,VAROQUAUX G,GRAMFORT A,et al.Scikit-learn:Machine learning in Python [J].Journal of Machine Learning Research,2011,12:2825-2830.
[28]CORMEN T H,LEISERSON C E,RIVEST R L,et al.算法导论 [M].殷建平,徐云,王刚,等译.北京:机械工业出版社,2013.
[29]KÖPCKE H,THOR A,RAHM E.Evaluation of entity resolution approaches on real-world match problems [J].Proceedings of the VLDB Endowment,2010,3(1-2):484-493.
[1] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[2] 史殿习, 赵琛然, 张耀文, 杨绍武, 张拥军.
基于多智能体强化学习的端到端合作的自适应奖励方法
Adaptive Reward Method for End-to-End Cooperation Based on Multi-agent Reinforcement Learning
计算机科学, 2022, 49(8): 247-256. https://doi.org/10.11896/jsjkx.210700100
[3] 张明新.
面向超大规模社会系统仿真的概念模型
Conceptual Model for Large-scale Social Simulation
计算机科学, 2022, 49(4): 16-24. https://doi.org/10.11896/jsjkx.210900136
[4] 王春静, 刘丽, 谭艳艳, 张化祥.
基于模糊颜色特征和模糊相似度的图像检索方法
Image Retrieval Method Based on Fuzzy Color Features and Fuzzy Smiliarity
计算机科学, 2021, 48(8): 191-199. https://doi.org/10.11896/jsjkx.200800202
[5] 周天阳, 曾子懿, 臧艺超, 王清贤.
基于多Agent联合决策的队组协同攻击规划
Team Cooperative Attack Planning Based on Multi-agent Joint Decision
计算机科学, 2021, 48(5): 301-307. https://doi.org/10.11896/jsjkx.200800174
[6] 高枫越, 王琰, 朱铁兰.
有适应力的分布式状态估计方法
Resilient Distributed State Estimation Algorithm
计算机科学, 2021, 48(5): 308-312. https://doi.org/10.11896/jsjkx.200300117
[7] 左剑凯, 吴杰宏, 陈嘉彤, 刘泽源, 李忠智.
异构无人机编队防御及评估策略研究
Study on Heterogeneous UAV Formation Defense and Evaluation Strategy
计算机科学, 2021, 48(2): 55-63. https://doi.org/10.11896/jsjkx.191100053
[8] 杜威, 丁世飞.
多智能体强化学习综述
Overview on Multi-agent Reinforcement Learning
计算机科学, 2019, 46(8): 1-8. https://doi.org/10.11896/j.issn.1002-137X.2019.08.001
[9] 文习明,方良达,余泉,常亮,王驹.
多智能体模态逻辑系统KD45n中的知识遗忘
Knowledge Forgetting in Multi-agent Modal Logic System KD45n
计算机科学, 2019, 46(7): 195-205. https://doi.org/10.11896/j.issn.1002-137X.2019.07.030
[10] 颜功达, 董鹏, 文昊林.
基于多智能体的复杂工程项目进度风险评估仿真建模
Simulation Modeling of Complex Engineering Project Schedule Risk AssessmentBased on Multi Agent
计算机科学, 2019, 46(6A): 523-526.
[11] 张森, 刘文奇, 赵宁.
复杂网络上多智能体系统的一致性研究
Research of Consensus in Multi-agent Systems on Complex Network
计算机科学, 2019, 46(4): 95-99. https://doi.org/10.11896/j.issn.1002-137X.2019.04.015
[12] 张杰, 王刚, 姚小强, 宋亚飞, 郑康波.
双向RNN下的航迹拟合模型研究
Research on Track Fitting Model Under Two-way RNN
计算机科学, 2019, 46(11A): 58-61.
[13] 董鹏, 吴翀, 余鹏, 文昊林.
基于多智能体的海上垂直补给规划仿真研究
Simulation Research on Offshore Vertical Replenishment Planning Based on Multi-agent
计算机科学, 2019, 46(11A): 72-75.
[14] 王世丽, 金英花, 吴晨.
基于通信时滞和噪音的群集运动
Flocking Based on Communication Delay and Noise
计算机科学, 2019, 46(10): 311-315. https://doi.org/10.11896/jsjkx.180901706
[15] 边宅安,李慧嘉,陈俊华,马雨晗,赵丹.
多智能体系构架下的属性图分布式聚类算法
Distributed and Heterogeneous Multi-agent System for Attributed Graph Clustering
计算机科学, 2017, 44(Z6): 407-413. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.092
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!