计算机科学 ›› 2019, Vol. 46 ›› Issue (1): 238-244.doi: 10.11896/j.issn.1002-137X.2019.01.037

• 人工智能 • 上一篇    下一篇

基于双语主题模型和双语词向量的跨语言知识链接

余圆圆1, 巢文涵1, 何跃鹰2, 李舟军1   

  1. (北京航空航天大学计算机学院 北京100191)1
    (国家计算机网络应急技术处理协调中心 北京100029)2
  • 收稿日期:2018-01-24 出版日期:2019-01-15 发布日期:2019-02-25
  • 作者简介:余圆圆(1993-),女,硕士生,主要研究方向为自然语言处理,E-mail:yuyuanyuan0823@buaa.edu.cn;巢文涵(1979-),男,博士,讲师,CCF会员,主要研究方向为自然语言处理、机器翻译,E-mail:chaowenhan@buaa.edu.cn(通信作者);何跃鹰(1975-),男,博士,教授级高级工程师,主要研究方向为人工智能、工业物联网安全;李舟军(1963-),男,博士,教授,博士生导师,CCF高级会员,主要研究方向为数据挖掘与人工智能、网络与信息安全。

Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding

YU Yuan-yuan1, CHAO Wen-han1, HE Yue-ying2, LI Zhou-jun1   

  1. (School of Computer Science and Engineering,Beihang University,Beijing 100191,China)1
    (National Computer Network Emergency Response Technical Team/Coordination Center,Beijing 100029,China)2
  • Received:2018-01-24 Online:2019-01-15 Published:2019-02-25

摘要: 跨语言知识链接是指在描述相同内容的不同语言的在线百科文章之间建立联系。跨语言知识链接可分为候选集选择和候选集排序两部分。首先,把候选集选择问题转换为跨语言信息检索问题,提出一种将标题与关键词相结合从而生成查询的方法,该方法将候选集选择的召回率大幅提高至93.8%;在候选集排序部分,提出一种融合双语主题模型及双语词向量的排序模型,实现了英文维基百科和中文百度百科之间军事领域的跨语言知识链接。实验结果表明,该模型取得了75%的准确率,显著提高了跨语言知识链接的性能,并且提出的方法不依赖于语言特性和领域特性,因此可以很容易地扩展至其他语言和其他领域的跨语言知识链接。

关键词: 跨语言信息检索, 跨语言知识链接, 双语词向量, 双语主题模型

Abstract: Cross-language knowledge linking (CLKL) refers to the establishment of links between encyclopedia articles in different languages that describe the same content.CLKL can be divided into two parts:candidate selection and candidate ranking.Firstly,this paper formulated candidate selection as cross-language information retrieval problem,and proposed a method to generate query by combining title with keywords,which greatly improves the recall of candidate selection,reaching 93.8%.In the part of the candidate ranking,this paper trained a ranking model by mixing bilingual topic model and bilingual embedding,implementing military articles linking in English Wikipedia and Chinese Baidu Baike.The evaluation results show that the accuracy of model achieves 75%,which significantly improves the perfor-mance of CLKL.The proposed method does not depend on linguistic characteristics and domain characteristics,and it can be easily extended to CLKL in other languages and other domains.

Key words: Bilingual embedding, Bilingual topic model, Cross-language information retrieval, Cross-language knowledge linking

中图分类号: 

  • TP391
[1]LEHMANN J,ISELE R,JAKOB M,et al.DBpedia-a large-scale,multilingual knowledge base extracted from Wikipedia.Semantic Web,2015,6(2):167-195.<br /> [2]WANG Z,LI J,TANG J.Boosting Cross-Lingual Knowledge Linking via Concept Annotation//Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence.Beijing,China:AAAI Press,2013:2733-2739.<br /> [3]WANG Z,PAN L,LI J,et al.Boosting to Build a Large-Scale Cross-Lingual Ontology//China Conference on Knowledge Graph and Semantic Computing.Singapore:Springer,2016:41-53.<br /> [4]RUDER S,VULIC I,SØGAARD A.A survey of cross-lingual embedding models.https://arxiv.org/pdf/1706.04902v2.pdf.<br /> [5]FARUQUI M,DYER C.Improving vector space word representations using multilingual correlation//Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics.2014:462-471.<br /> [6]ARTETXE M,LABAKA G,AGIRRE E.Learning bilingual word embeddings with (almost) no bilingual data//Meeting of the Association for Computational Linguistics.2017:451-462.<br /> [7]DUONG L,KANAYAMA H,MA T F,et al.Learning Crosslingual Word Embeddings without Bilingual Corpora//Procee-dings of the 2016 Conference on Empirical Methods in Natural Language Processing.USA:ACL,2016:1285-1295.<br /> [8]MORENO J G,BESANÇON R,BEAUMONT R,et al.Combining word and entity embeddings for entity linking//European Semantic Web Conference.Cham:Springer,2017:337-352.<br /> [9]BLANCO R,OTTAVIANO G,MEIJ E.Fast and space-efficient entity linking for queries//Proceedings of the Eighth ACM International Conference on Web Search and Data Mining.ACM,2015:179-188.<br /> [10]PAPPU A,BLANCO R,MEHDAD Y,et al.Lightweight multilingual entity extraction and linking//Proceedings of the Tenth ACM International Conference on Web Search and Data Mining.ACM,2017:365-374.<br /> [11]WANG Z,LI J,WANG Z,et al.Cross-lingual knowledge linking across wiki knowledge bases//International Conference on World Wide Web.ACM,2012:459-468.<br /> [12]PAN L,WANG Z,LI J,et al.Domain Specific Cross-Lingual Knowledge Linking Based on Similarity Flooding//International Conference on Knowledge Science,Engineering and Ma-nagement.Cham:Springer,2016:426-438.<br /> [13]WANG Y C,WU C K,TSAI T H.Cross-Language Article Linking with Different Knowledge Bases Using Bilingual Topic Model and Translation Features.Knowledge-Based Systems,2016,111(3):228-236.<br /> [14]SHEN W,WANG J,LUO P,et al.LINDEN:linking named entities with knowledge base via semantic knowledge//Proceedings of the 21st International Conference on World Wide Web.ACM,2012:449-458.<br /> [15]TSAI C T,DAN R.Cross-lingual Wikification Using Multi-lingual Embeddings//Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:589-598.<br /> [16]SORG P,CIMIANO P.Enriching the crosslingual link structure of wikipedia-a classification-based approach//Proceedings of the AAAI 2008 Workshop on Wikipedia and Artifical Intelligence.Chicago,Illinois,2008:49-54.<br /> [17]OH J H,KAWAHARA D,UCHIMOTO K,et al.Enriching multilingual language resources by discovering missing cross-language links in wikipedia//Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology-Volume 01.IEEE Computer So-ciety,2008:322-328.<br /> [18]SHEARKAT E,MILIOS E E.Vector embedding of wikipedia concepts and entities//International Conference on Applications of Natural Language to Information Systems.Cham:Springer,2017:418-428.<br /> [19]ARTETXE M,LABAKA G,AGIRRE E.Learning principled bilingual mappings of word embeddings while preserving monolingual invariance//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:2289-2294.<br /> [20]HOFFART J,ALTUN Y,WEIKUM G.Discovering emerging entities with ambiguous names//Proceedings of the 23rd International Conference on World Wide Web.ACM,2014:385-396.<br /> [21]RATINOV L,ROTH D,DOWNEY D,et al.Local and global algorithms for disambiguation to wikipedia//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies-Volume 1.Association for Computational Linguistics,2011:1375-1384.<br /> [22]BARRÓN-CEDEÑO A,ESPAÑA-BONET C,BOLDOBA J,et al.A factory of comparable corpora from wikipedia//Proceedings of the Eighth Workshop on Building and Using Comparable Corpora.2015:3-13.<br /> [23]ZHANG T,LIU K,ZHAO J.Cross Lingual Entity Linking with Bilingual Topic Model//Proceedings of the 23rd InternationalJoint Conference on Artificial Intelligence.Beijing,China:AAAI Press,2013:2218-2224.<br /> [24]LEE C P,LIN C J.Large-scale linear ranksvm.Neural Computation,2014,26(4):781-817.
[1] 张俊林 曲为民 杜林 孙玉芳.
跨语言信息检索研究进展

计算机科学, 2004, 31(7): 16-19.
[2] 张玥杰 连理 吴立德.
一种新型的跨语言信息检索技术

计算机科学, 2002, 29(8): 66-72.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!