计算机科学 ›› 2020, Vol. 47 ›› Issue (12): 327-331.doi: 10.11896/jsjkx.191100176
孙国梓, 吕建伟, 李华康
SUN Guo-zi, LYU Jian-wei, LI Hua-kang
摘要: 随着自媒体的蓬勃发展任何人都可以在网上随意发布和转发信息而这些信息可能是真实的也可能是道听途说或被故意篡改的.互联网上数据的严重冗余和弱可信问题导致现有数据的可用性很差.Bi-LSTM-CRF(Bi-Long Short Term Memory with Conditional Random Field Layer)网络虽然能够解决数据中命名实体识别的准确率问题但不能满足识别出的实体是可信的这一要求.文中提出一种基于编辑距离的多实体可信确认算法并通过人物命名实体识别实例对该算法进行验证.首先通过分布式爬虫抓取同一个邮箱地址在多个搜索引擎上的TopN网页记录然后使用经过双语语料训练后的Bi-LSTM-CRF模型抽取每个页面内的人物命名实体最后通过实体多参数融合确定邮箱所对应的人物命名实体.实验结果表明多实体可信确认算法能够将邮箱地址与邮箱真实主人的匹配准确率MRR(Mean Reciprocal Rank)提高到91.32%相比只使用词频的算法其MRR提升了23.08%.实验数据充分说明多实体可信确认算法能很好地从弱可信数据中获得强可信度的实体降低海量数据中的低质特性从而有效地增强实体数据源的可信度.
中图分类号:
[1] GUO J,HUANG C S.Research progress of information overload in foreign network environment[J].Information Science,2018,323(7):172-178. [2] GRIDACH M.Character-level neural network for biomedicalnamed entity recognition[J].Journal of biomedical informatics,2018,70(6):85-91. [3] CLICHE M.BB_twtr at SemEval-2017 task 4:twitter sentiment analysis with CNNs and LSTMs[J].arXiv:2017,1704.06125. [4] LAMPLE G,BALLESTEROS M,SUBRAMANIANS,et al.Neural architectures for named entity recognition[J].arXiv:2016,1603.01360. [5] BIKEL D M,SCHWARTZ R,WEISCHEDEL R M.An algo-rithm that learns what's in a name[J].Machine Learning,1999,34(1/2/3):211-231. [6] LAFFERTY J,MCCALLUM A,PEREIRA F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[J].Machine Learning,2001,7:301-311. [7] ZHOU G D,SU J.Named entity recognition using an HMM-based chunk tagger[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Association for Computational Linguistics,2002:473-480. [8] MCCALLUM A,LI W.Early results for named entity recognition with conditional random fields,feature induction and web-enhanced lexicons[C]//Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4.Association for Computational Linguistics,2003:188-191. [9] HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780. [10] COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(8):2493-2537. [11] HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015. [12] QIU Y F,TIAN Z P,JI W Y,et al.An Efficient Method for Detecting Similar Repetitive Records[J].Chinese Journal of Computers,2001(1):69-77. [13] TAN M C,CAO J J.A method for calculating string similarity with multiple editing distances[J].Application Research of Computers,2010,27(12):4523-4525. [14] YE X.Approximate longest common substring matching and optimization algorithm for editing distance constraints[D].Northeastern University,2014. [15] ZHANG C Z,MA S T,JIE C Y,et al.Study on Parallel Web Page Recognition Based on Beneficial URL Matching Mode Credibility[J].Journal of Chinese Information Processing,2018,32(3):91-100. [16] YE X M,MAO X Q,XIA J C,et al.Improvement of text classification TF-IDF algorithm[J].Computer Engineering and Applications,2019,55(2):104-109. [17] ZHENG K,OUYANG L Y,LIN Q,et al.Research on LCS Algorithm and Edit Distance Algorithm[J].Information Communication,2015(5):22-23. [18] LING W,LU T,MARUJO L,et al.Finding function in form:compositional character models for open vocabulary word representation[J].Computer Science,2015,11:1899-1907. [19] GRAVES A,SCHMIDHUBER J.Framewise phoneme classification with bidirectional LSTM networks[C]//IEEE International Joint Conference on Neural Networks.IEEE,2005,40(7):1482-1488. |
[1] | 窦家维. 保护隐私的汉明距离与编辑距离计算及应用 Privacy-preserving Hamming and Edit Distance Computation and Applications 计算机科学, 2022, 49(9): 355-360. https://doi.org/10.11896/jsjkx.220100241 |
[2] | 项英倬, 谭菊仙, 韩杰思, 石浩. 图匹配技术研究 Survey of Graph Matching Algorithms 计算机科学, 2018, 45(6): 27-31. https://doi.org/10.11896/j.issn.1002-137X.2018.06.004 |
[3] | 徐周波,张鵾,宁黎华,古天龙. 图编辑距离概述 Summary of Graph Edit Distance 计算机科学, 2018, 45(4): 11-18. https://doi.org/10.11896/j.issn.1002-137X.2018.04.002 |
[4] | 王伟,陈志高,孟宪凯,李伟. 基于熵的音频指纹检索技术研究与实现 Research and Implementation of Identifying Music through Performances Using Entropy Based Audio-fingerprint 计算机科学, 2017, 44(Z6): 551-556. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.123 |
[5] | 张润梁,牛之贤. 基于基本操作序列的编辑距离顺序验证 Sequential Verification Algorithm to Compute Edit Distance Based on Edit Operation Sequence 计算机科学, 2016, 43(Z6): 51-54. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.011 |
[6] | 杨艳林,叶枫,吕鑫,余霖,刘璇. 一种基于DTW聚类的水文时间序列相似性挖掘方法 DTW Clustering-based Similarity Mining Method for Hydrological Time Series 计算机科学, 2016, 43(2): 245-249. https://doi.org/10.11896/j.issn.1002-137X.2016.02.051 |
[7] | 李景玉,张仰森,陈若愚. 面向用户查询意图的句子相似度分层计算 User Query Intention Oriented Hierarchical Sentence Similarity Computation 计算机科学, 2015, 42(1): 227-231. https://doi.org/10.11896/j.issn.1002-137X.2015.01.050 |
[8] | 向林泓,张炬,孙启龙,赵学良. 基于Relative-IDF的医药数据相似度算法研究 Medical Data Similarity Algorithm Analysis Based on Relative-IDF 计算机科学, 2014, 41(Z6): 417-420. |
|