计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 327-331.doi: 10.11896/jsjkx.191100176

• 信息安全 • 上一篇    下一篇

基于编辑距离的多实体可信确认算法

孙国梓, 吕建伟, 李华康   

  1. 南京邮电大学计算机学院 南京 210003
  • 收稿日期:2019-11-25 修回日期:2019-12-23 发布日期:2020-12-17
  • 通讯作者: 孙国梓(sun@njupt.edu.cn)
  • 基金资助:
    国家自然科学基金(615022471150130261502243);中国博士后科学基金(2016M6004342016M591840);江苏省博士后科研基金(1601128B);江西省经济犯罪侦查与防控技术协同创新中心开放基金资助课题(JXJZXTCX-015);数字工程与先进计算重点实验室开放课题(2017A10)

MeTCa:Multi-entity Trusted Confirmation Algorithm Based on Edit Distance

SUN Guo-zi, LYU Jian-wei, LI Hua-kang   

  1. School of Computer Science and Technology Nanjing University of Posts and Telecommunications Nanjing 210003,China
  • Received:2019-11-25 Revised:2019-12-23 Published:2020-12-17
  • About author:SUN Guo-zi,born in 1972Ph.Dprofessoris a senior member of China Computer Federation.His main research interests include cyberspace securitydigi-tal forensicsand blockchain.
  • Supported by:
    National Natural Science Foundation of China(61502247,11501302,61502243),China Postdoctoral Science Foundation(2016M600434,2016M591840),Jiangsu Postdoctoral Research Foundation (1601128B),Economic Crime Investigation and Prevention and Control in Jiangxi Province Supported by the Open Fund of the Collaborative Innovation Center of Technology(JXJZXTCX-015) and Open Project of the Key Laboratory of Digital Engineering and Advanced Computing(2017A10).

摘要: 随着自媒体的蓬勃发展任何人都可以在网上随意发布和转发信息而这些信息可能是真实的也可能是道听途说或被故意篡改的.互联网上数据的严重冗余和弱可信问题导致现有数据的可用性很差.Bi-LSTM-CRF(Bi-Long Short Term Memory with Conditional Random Field Layer)网络虽然能够解决数据中命名实体识别的准确率问题但不能满足识别出的实体是可信的这一要求.文中提出一种基于编辑距离的多实体可信确认算法并通过人物命名实体识别实例对该算法进行验证.首先通过分布式爬虫抓取同一个邮箱地址在多个搜索引擎上的TopN网页记录然后使用经过双语语料训练后的Bi-LSTM-CRF模型抽取每个页面内的人物命名实体最后通过实体多参数融合确定邮箱所对应的人物命名实体.实验结果表明多实体可信确认算法能够将邮箱地址与邮箱真实主人的匹配准确率MRR(Mean Reciprocal Rank)提高到91.32%相比只使用词频的算法其MRR提升了23.08%.实验数据充分说明多实体可信确认算法能很好地从弱可信数据中获得强可信度的实体降低海量数据中的低质特性从而有效地增强实体数据源的可信度.

关键词: 编辑距离, 多实体可信确认算法, 弱可信数据, 双向长短时记忆循环-条件随机场网络

Abstract: With the development of We-mediaevery individual can publish and forward information on the internet at will.The information may have real recordsbut it may also be hearsay or even contents being intentionally tampered with.The data on the Internet has serious redundancy and weak credibility problemsthus resulting in low availability of existing network media data.Although the Bi-LSTM-CRF network can solve the problem of the accuracy of named entity recognition in datait cannot meet the requirement that the identified entity is credible.In this papera multi-parameter fusion credible confirmation algorithm based on multi-source weakly trusted data is proposedwhich is verified by identifying instances of person named entities.This paper uses distributed spiders to crawl Top N pages with the same mailbox address on multiple search engines.AfterwardsBi-LSTM-CRF algorithm trained by bilingual corpus is adopted to extract person named entities from each page.Finallythe person named entities corresponding to the mailbox are determined by multi-parameter entity fusion trusted confirmation algorithm.The experimental results show that the multi-parameter fusion credible confirmation algorithm can improve the accuracy of MRR (MRR) of the matching between the mailbox address and the real owner of the mailbox to 91.32%which is 23.08% higher than the traditional algorithm using only the term frequency model.The experimental data reasonably demonstrates that the multi-parameter fusion credible confirmation algorithm can obtain strong credibility entities from weakly trusted data and reduce the low-quality characteristics of massive datathus effectively enhancing the credibility of entity data sources.

Key words: Bi-LSTM-CRF, Edit distance, Multi-parameter fusion trusted confirmation algorithm, Weak trusted data

中图分类号: 

  • TP311
[1] GUO J,HUANG C S.Research progress of information overload in foreign network environment[J].Information Science,2018,323(7):172-178.
[2] GRIDACH M.Character-level neural network for biomedicalnamed entity recognition[J].Journal of biomedical informatics,2018,70(6):85-91.
[3] CLICHE M.BB_twtr at SemEval-2017 task 4:twitter sentiment analysis with CNNs and LSTMs[J].arXiv:2017,1704.06125.
[4] LAMPLE G,BALLESTEROS M,SUBRAMANIANS,et al.Neural architectures for named entity recognition[J].arXiv:2016,1603.01360.
[5] BIKEL D M,SCHWARTZ R,WEISCHEDEL R M.An algo-rithm that learns what's in a name[J].Machine Learning,1999,34(1/2/3):211-231.
[6] LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[J].Machine Learning,2001,7:301-311.
[7] ZHOU G D,SU J.Named entity recognition using an HMM-based chunk tagger[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2002:473-480.
[8] MCCALLUM A,LI W.Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.Association for Computational Linguistics,2003:188-191.
[9] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[10] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[11] HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015.
[12] QIU Y F,TIAN Z P,JI W Y,et al.An Efficient Method for Detecting Similar Repetitive Records[J].Chinese Journal of Computers,2001(1):69-77.
[13] TAN M C,CAO J J.A method for calculating string similarity with multiple editing distances[J].Application Research of Computers,2010,27(12):4523-4525.
[14] YE X.Approximate longest common substring matching and optimization algorithm for editing distance constraints[D].Northeastern University,2014.
[15] ZHANG C Z,MA S T,JIE C Y,et al.Study on Parallel Web Page Recognition Based on Beneficial URL Matching Mode Credibility[J].Journal of Chinese Information Processing,2018,32(3):91-100.
[16] YE X M,MAO X Q,XIA J C,et al.Improvement of text classification TF-IDF algorithm[J].Computer Engineering and Applications,2019,55(2):104-109.
[17] ZHENG K,OUYANG L Y,LIN Q,et al.Research on LCS Algorithm and Edit Distance Algorithm[J].Information Communication,2015(5):22-23.
[18] LING W,LU T,MARUJO L,et al.Finding function in form:compositional character models for open vocabulary word representation[J].Computer Science,2015,11:1899-1907.
[19] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM networks[C]//IEEE International Joint Conference on Neural Networks.IEEE,2005,40(7):1482-1488.
[1] 窦家维.
保护隐私的汉明距离与编辑距离计算及应用
Privacy-preserving Hamming and Edit Distance Computation and Applications
计算机科学, 2022, 49(9): 355-360. https://doi.org/10.11896/jsjkx.220100241
[2] 项英倬, 谭菊仙, 韩杰思, 石浩.
图匹配技术研究
Survey of Graph Matching Algorithms
计算机科学, 2018, 45(6): 27-31. https://doi.org/10.11896/j.issn.1002-137X.2018.06.004
[3] 徐周波,张鵾,宁黎华,古天龙.
图编辑距离概述
Summary of Graph Edit Distance
计算机科学, 2018, 45(4): 11-18. https://doi.org/10.11896/j.issn.1002-137X.2018.04.002
[4] 王伟,陈志高,孟宪凯,李伟.
基于熵的音频指纹检索技术研究与实现
Research and Implementation of Identifying Music through Performances Using Entropy Based Audio-fingerprint
计算机科学, 2017, 44(Z6): 551-556. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.123
[5] 张润梁,牛之贤.
基于基本操作序列的编辑距离顺序验证
Sequential Verification Algorithm to Compute Edit Distance Based on Edit Operation Sequence
计算机科学, 2016, 43(Z6): 51-54. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.011
[6] 杨艳林,叶枫,吕鑫,余霖,刘璇.
一种基于DTW聚类的水文时间序列相似性挖掘方法
DTW Clustering-based Similarity Mining Method for Hydrological Time Series
计算机科学, 2016, 43(2): 245-249. https://doi.org/10.11896/j.issn.1002-137X.2016.02.051
[7] 李景玉,张仰森,陈若愚.
面向用户查询意图的句子相似度分层计算
User Query Intention Oriented Hierarchical Sentence Similarity Computation
计算机科学, 2015, 42(1): 227-231. https://doi.org/10.11896/j.issn.1002-137X.2015.01.050
[8] 向林泓,张炬,孙启龙,赵学良.
基于Relative-IDF的医药数据相似度算法研究
Medical Data Similarity Algorithm Analysis Based on Relative-IDF
计算机科学, 2014, 41(Z6): 417-420.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!