Computer Science ›› 2020, Vol. 47 ›› Issue (12): 327-331.doi: 10.11896/jsjkx.191100176

Previous Articles     Next Articles

MeTCa:Multi-entity Trusted Confirmation Algorithm Based on Edit Distance

SUN Guo-zi, LYU Jian-wei, LI Hua-kang   

  1. School of Computer Science and Technology Nanjing University of Posts and Telecommunications Nanjing 210003,China
  • Received:2019-11-25 Revised:2019-12-23 Published:2020-12-17
  • About author:SUN Guo-zi,born in 1972Ph.Dprofessoris a senior member of China Computer Federation.His main research interests include cyberspace securitydigi-tal forensicsand blockchain.
  • Supported by:
    National Natural Science Foundation of China(61502247,11501302,61502243),China Postdoctoral Science Foundation(2016M600434,2016M591840),Jiangsu Postdoctoral Research Foundation (1601128B),Economic Crime Investigation and Prevention and Control in Jiangxi Province Supported by the Open Fund of the Collaborative Innovation Center of Technology(JXJZXTCX-015) and Open Project of the Key Laboratory of Digital Engineering and Advanced Computing(2017A10).

Abstract: With the development of We-mediaevery individual can publish and forward information on the internet at will.The information may have real recordsbut it may also be hearsay or even contents being intentionally tampered with.The data on the Internet has serious redundancy and weak credibility problemsthus resulting in low availability of existing network media data.Although the Bi-LSTM-CRF network can solve the problem of the accuracy of named entity recognition in datait cannot meet the requirement that the identified entity is credible.In this papera multi-parameter fusion credible confirmation algorithm based on multi-source weakly trusted data is proposedwhich is verified by identifying instances of person named entities.This paper uses distributed spiders to crawl Top N pages with the same mailbox address on multiple search engines.AfterwardsBi-LSTM-CRF algorithm trained by bilingual corpus is adopted to extract person named entities from each page.Finallythe person named entities corresponding to the mailbox are determined by multi-parameter entity fusion trusted confirmation algorithm.The experimental results show that the multi-parameter fusion credible confirmation algorithm can improve the accuracy of MRR (MRR) of the matching between the mailbox address and the real owner of the mailbox to 91.32%which is 23.08% higher than the traditional algorithm using only the term frequency model.The experimental data reasonably demonstrates that the multi-parameter fusion credible confirmation algorithm can obtain strong credibility entities from weakly trusted data and reduce the low-quality characteristics of massive datathus effectively enhancing the credibility of entity data sources.

Key words: Bi-LSTM-CRF, Edit distance, Multi-parameter fusion trusted confirmation algorithm, Weak trusted data

CLC Number: 

  • TP311
[1] GUO J,HUANG C S.Research progress of information overload in foreign network environment[J].Information Science,2018,323(7):172-178.
[2] GRIDACH M.Character-level neural network for biomedicalnamed entity recognition[J].Journal of biomedical informatics,2018,70(6):85-91.
[3] CLICHE M.BB_twtr at SemEval-2017 task 4:twitter sentiment analysis with CNNs and LSTMs[J].arXiv:2017,1704.06125.
[4] LAMPLE G,BALLESTEROS M,SUBRAMANIANS,et al.Neural architectures for named entity recognition[J].arXiv:2016,1603.01360.
[5] BIKEL D M,SCHWARTZ R,WEISCHEDEL R M.An algo-rithm that learns what's in a name[J].Machine Learning,1999,34(1/2/3):211-231.
[6] LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[J].Machine Learning,2001,7:301-311.
[7] ZHOU G D,SU J.Named entity recognition using an HMM-based chunk tagger[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2002:473-480.
[8] MCCALLUM A,LI W.Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.Association for Computational Linguistics,2003:188-191.
[9] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[10] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537.
[11] HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015.
[12] QIU Y F,TIAN Z P,JI W Y,et al.An Efficient Method for Detecting Similar Repetitive Records[J].Chinese Journal of Computers,2001(1):69-77.
[13] TAN M C,CAO J J.A method for calculating string similarity with multiple editing distances[J].Application Research of Computers,2010,27(12):4523-4525.
[14] YE X.Approximate longest common substring matching and optimization algorithm for editing distance constraints[D].Northeastern University,2014.
[15] ZHANG C Z,MA S T,JIE C Y,et al.Study on Parallel Web Page Recognition Based on Beneficial URL Matching Mode Credibility[J].Journal of Chinese Information Processing,2018,32(3):91-100.
[16] YE X M,MAO X Q,XIA J C,et al.Improvement of text classification TF-IDF algorithm[J].Computer Engineering and Applications,2019,55(2):104-109.
[17] ZHENG K,OUYANG L Y,LIN Q,et al.Research on LCS Algorithm and Edit Distance Algorithm[J].Information Communication,2015(5):22-23.
[18] LING W,LU T,MARUJO L,et al.Finding function in form:compositional character models for open vocabulary word representation[J].Computer Science,2015,11:1899-1907.
[19] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM networks[C]//IEEE International Joint Conference on Neural Networks.IEEE,2005,40(7):1482-1488.
[1] DOU Jia-wei. Privacy-preserving Hamming and Edit Distance Computation and Applications [J]. Computer Science, 2022, 49(9): 355-360.
[2] XIANG Ying-zhuo, TAN Ju-xian, HAN Jie-si, SHI Hao. Survey of Graph Matching Algorithms [J]. Computer Science, 2018, 45(6): 27-31.
[3] XU Zhou-bo, ZHANG Kun, NING Li-hua and GU Tian-long. Summary of Graph Edit Distance [J]. Computer Science, 2018, 45(4): 11-18.
[4] ZHANG Run-liang and NIU Zhi-xian. Sequential Verification Algorithm to Compute Edit Distance Based on Edit Operation Sequence [J]. Computer Science, 2016, 43(Z6): 51-54.
[5] YANG Yan-lin, YE Feng, LV Xin, YU Lin and LIU Xuan. DTW Clustering-based Similarity Mining Method for Hydrological Time Series [J]. Computer Science, 2016, 43(2): 245-249.
[6] LI Jing-yu, ZHANG Yang-sen and CHEN Ruo-yu. User Query Intention Oriented Hierarchical Sentence Similarity Computation [J]. Computer Science, 2015, 42(1): 227-231.
[7] XIANG Lin-hong,ZHANG Ju,SUN Qi-long and ZHAO Xue-ling. Medical Data Similarity Algorithm Analysis Based on Relative-IDF [J]. Computer Science, 2014, 41(Z6): 417-420.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!