Computer Science ›› 2019, Vol. 46 ›› Issue (4): 8-13.doi: 10.11896/j.issn.1002-137X.2019.04.002

• Big Data & Data Science • Previous Articles     Next Articles

Quality Control Agent Based on Probability Inference

XU Yao-li, LI Zhan-huai   

  1. School of Computer Science and Engineering,Northwestern Polytechnical University,Xi’an 710072,China
    Key Laboratory of Big Data Storage and Management,Northwestern Polytechnical University,Ministry of Industry and Information Technology,Xi’an 710129,China
  • Received:2018-12-04 Online:2019-04-15 Published:2019-04-23

Abstract: Entity resolution (ER) is the fundamental problem of data integration and cleaning,while inconsistency reconciliation(IR) further improves the resolution performance through reconciling inconsistent pairs resolved by existing diverse ER approaches.However,previous IR approaches have a limitation that the reconciliation solution has no quality guarantee.To solve this problem,this paper firstly proposed a quality control agent based on probability inference,denoted as QCAgent.QCAgent does not require any manually labeled pair,and can automatically output reconciliation result with the highest recall on the premise of satisfying the given precision threshold.Its core idea is as follows.Firstly,the outlier detection model is utilized to estimate the matching probability for each inconsistent pair,and then the estimated precision and recall are regarded as the environmental feedback according to these probabilities.Next,the binary search algorithm is used to select a flipping solution as the next action of QCAgent,which can make flipped reconciliation result satisfy the precision requirement with the highest recall.Then the outlier detection model is retrained by using the new consistent pairs,and the recall and precision of flipped reconciliation result are estimated.The iterative process terminates until the newest estimated precision meets the constraints.On the real data set,the experimental results show that QCAgent can effectively solve the quality control problem of reconciliation result.

Key words: Quality control, Entity resolution, Inconsistency reconciliation, Agent, Precision

CLC Number: 

  • TP391
[1]XU Y,LI Z,CHEN Q,et al.GL-RF:A Reconciliation Framework for Label-free Entity Resolution [J].Frontiers of Compu-ter Science,2018,12(5):1035-1037.
[2]LI G.Human-in-the-loop data integration [J].Proceedings of the VLDB Endowment,2017,10(12):2006-2017.
[3]FAN F F,LI Z H,CHEN Q,et al.An outlier-detection based approach for automatic entity matching [J].Chinese Journal of Computers,2017,40(10):2197-2211.(in Chinese) 樊峰峰,李战怀,陈群,等.一种基于离群点检测的自动实体匹配方法[J].计算机学报,2017,40(10):2197-2211.
[4]EFTHYMIOU V,STEFANIDIS K,CHRISTOPHIDES V.Minoan ER:Progressive Entity Resolution in the Web of Data[C]∥Proceedings of the 19th International Conference on Extending Database Technology.2016:670-671.
[5]LI L,LI J,GAO H.Rule-Based Method for Entity Resolution [J].IEEE Transactions on Knowledge & Data Engineering,2015,27(1):250-263.
[6]WHANG S E,MARMAROS D,GARCIA-MOLINA H.Pay-as-you-go entity resolution [J].IEEE Transactions on Knowledge and Data Engineering,2013,25(5):1111-1124.
[7]BELLARE K,IYENGAR S,PARAMESWARAN A,et al.Active Sampling for Entity Matching with Guarantees [J].ACM Transactions on Knowledge Discovery from Data,2013,7(3):1-24.
[8]BELLARE K,IYENGAR S,PARAMESWARAN A G,et al. Active sampling for entity matching[C]∥Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM:New York,2012:1131-1139.
[9]WANG J,LI G,YU J X,et al.Entity matching:how similar is similar [J].Proceedings of the VLDB Endowment,2011,4(10):622-633.
[10]MONGE A E,ELKAN C.The Field Matching Problem:Algorithms and Applications[C]∥Proceedings of the Second International Conference on Knowledge Discovery and Data Mining.AAAI Press:California,1996:267-270.
[11]ZHANG D,GUO L,HE X,et al.A Graph-Theoretic Fusion Framework for Unsupervised Entity Resolution[C]∥Procee-dings of the 34th IEEE International Conference on Data Engineering.IEEE Computer Society,2018:713-724.
[12]ARASU A,GÖTZ M,KAUSHIK R.On active learning of record matching packages[C]∥Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data.ACM:New York,2010:783-794.
[13]MUDGAL S,LI H,REKATSINAS T,et al.Deep Learning for Entity Matching:A Design Space Exploration[C]∥Proceedings of the 2018 International Conference on Management of Data.ACM:New York,2018:19-34.
[14]COHEN W,RAVIKUMAR P,FIENBERG S.A comparison of string metrics for matching names and records[C]∥Proceedings of the KDD Workshop on Data Cleaning and Object Consolidation.2003:73-78.
[15]EBRAHEEM M,THIRUMURUGANATHAN S,JOTY S,et al. Distributed representations of tuples for entity resolution[J].Proceedings of the VLDB Endowment,2018,11(11):1454-1467.
[16]COHEN W W.Data integration using similarity joins and a word-based information representation language [J].ACM Transactions on Information Systems,2000,18(3):288-321.
[17]DAS A,KOTTUR S,MOURA J M F,et al.Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning[C]∥Proceedings of the IEEE International Conference on Computer Vision.2017:2970-2979.
[18]MNIH V,KAVUKCUOGLU K,SILVER D,et al.Human-level control through deep reinforcement learning [J].Nature,2015,518(7540):529-533.
[19]LIU Q,ZHAI J W,ZHANG Z Z,et al.A Survey on Deep Reinforcement Learning [J].Chinese Journal of Computers,2018,41(1):1-27.(in Chinese) 刘全,翟建伟,章宗长,等.深度强化学习综述 [J].计算机学报,2018,41(1):1-27.
[20]ZHAO X Y,DING S F.Research on Deep Reinforcement Learning [J].Computer Science,2018,45(7):1-6.(in Chinese) 赵星宇,丁世飞.深度强化学习研究综述 [J].计算机科学,2018,45(7):1-6.
[21]CHEN Z,CHEN Q,FAN F,et al.Enabling quality control for entity resolution:A human and machine cooperation framework[C]∥Proceedings of the 2018 IEEE 34th International Confe-rence on Data Engineering.IEEE:New Jersey,2018:1156-1167.
[22]EFTHYMIOU V,PAPADAKIS G,PAPASTEFANATOS G,et al. Parallel meta-blocking for scaling entity resolution over big heterogeneous data [J].Information Systems,2017,65:137-157.
[23]WANG Q,CUI M,LIANG H.Semantic-aware blocking for entity resolution [J].IEEE Transactions on Knowledge and Data Engineering,2016,28(1):166-180.
[24]SIMONINI G,BERGAMASCHI S,JAGADISH H.BLAST:a loosely schema-aware meta-blocking approach for entity resolution [J].Proceedings of the VLDB Endowment,2016,9(12):1173-1184.
[25]PAPADAKIS G,KOUTRIKA G,PALPANAS T,et al.Meta- Blocking:Taking Entity Resolution to the Next Level [J].IEEE Transactions on Knowledge & Data Engineering,2014,26(8):1946-1960.
[26]SCHÖLKOPF B,PLATT J C,SHAWE-TAYLOR J,et al.Estimating the support of a high-dimensional distribution [J].Neural computation,2001,13(7):1443-1471.
[27]PEDREGOSA F,VAROQUAUX G,GRAMFORT A,et al.Scikit-learn:Machine learning in Python [J].Journal of Machine Learning Research,2011,12:2825-2830.
[28]CORMEN T H,LEISERSON C E,RIVEST R L,et al.算法导论 [M].殷建平,徐云,王刚,等译.北京:机械工业出版社,2013.
[29]KÖPCKE H,THOR A,RAHM E.Evaluation of entity resolution approaches on real-world match problems [J].Proceedings of the VLDB Endowment,2010,3(1-2):484-493.
[1] CHENG Sheng-gan, YU Hao-ran, WEI Jian-wen, James LIN. Design and Optimization of Two-level Particle-mesh Algorithm Based on Fixed-point Compression [J]. Computer Science, 2020, 47(8): 56-61.
[2] HAO Jiang-wei, GUO Shao-zhong, XIA Yuan-yuan, XU Jin-chen. Algorithm Design of Variable Precision Transcendental Functions [J]. Computer Science, 2020, 47(8): 71-79.
[3] LI Li. Classification Algorithm of Distributed Data Mining Based on Judgment Aggregation [J]. Computer Science, 2020, 47(6A): 450-456.
[4] REN Yi. Design of Network Multi-server SIP Information Encryption System Based on Block Chain and Artificial Intelligence [J]. Computer Science, 2020, 47(6A): 634-638.
[5] XU Zi-xi, MAO Xin-jun, YANG Yi, LU Yao. Modeling and Simulation of Q&A Community and Its Incentive Mechanism [J]. Computer Science, 2020, 47(6): 32-37.
[6] WU Tian-tian,WANG Jie. Belief Coordination for Multi-agent System Based on Possibilistic Answer Set Programming [J]. Computer Science, 2020, 47(2): 201-205.
[7] ZHANG Hong-ying,SHEN Rong-miao,LUO Qian. Optimization of Aircraft Taxiing Strategy Based on Multi-agent [J]. Computer Science, 2020, 47(2): 306-312.
[8] ZENG Lei, LI Hao, LIN Yu-fei, ZHANG Shuai. Study on Simulation Optimization of Gazebo Based on Asynchronous Mechanism [J]. Computer Science, 2020, 47(11A): 593-598.
[9] DU Wei, DING Shi-fei. Overview on Multi-agent Reinforcement Learning [J]. Computer Science, 2019, 46(8): 1-8.
[10] WEN Xi-ming,FANG Liang-da,YU Quan,CHANG Liang,WANG Ju. Knowledge Forgetting in Multi-agent Modal Logic System KD45n [J]. Computer Science, 2019, 46(7): 195-205.
[11] YAN Gong-da, DONG Peng, WEN Hao-lin. Simulation Modeling of Complex Engineering Project Schedule Risk AssessmentBased on Multi Agent [J]. Computer Science, 2019, 46(6A): 523-526.
[12] LU Wen-chao, DUAN Xian-hua, XU Dan, WANG Wan-yao. Bayesian Model Saliency Detection Algorithm Based on Multiple Scales and Improved Convex Hull [J]. Computer Science, 2019, 46(6): 295-300.
[13] ZHANG Sen, LIU Wen-qi, ZHAO Ning. Research of Consensus in Multi-agent Systems on Complex Network [J]. Computer Science, 2019, 46(4): 95-99.
[14] LUO Xu-dong, HUANG Qiao-juan, ZHAN Jie-yu. A Survey of Automated Negotiation and Its Fuzzy Set Based Models [J]. Computer Science, 2019, 46(12): 220-230.
[15] ZHENG Wen-bin, LI Jin-jin, HE Qiu-hong. Attribute Reduction Algorithm for Neighborhood Rough Sets with Variable Precision Based on Attribute Importance [J]. Computer Science, 2019, 46(12): 261-265.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[8] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[9] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .
[10] YANG Yu-qi, ZHANG Guo-an and JIN Xi-long. Dual-cluster-head Routing Protocol Based on Vehicle Density in VANETs[J]. Computer Science, 2018, 45(4): 126 -130 .