计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000111-6.doi: 10.11896/jsjkx.211000111

• 人工智能 • 上一篇    下一篇

基于Kernel-XGBoost的跨语言术语对齐方法

于娟, 张晨   

  1. 福州大学经济与管理学院 福州 350108
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 张晨(zhangchenfzu@163.com)
  • 作者简介:(zhangchenfzu@163.com)
  • 基金资助:
    国家自然科学基金(71771054)

Cross-lingual Term Alignment with Kernel-XGBoost

YU Juan, ZHANG Chen   

  1. School of Economics and Management,Fuzhou University,Fuzhou 350108,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:YU Juan,born in 1981,professor,Ph.D supervisor.Her main research interests include data science and knowledge engineering,intelligent information system.
    ZHANG Chen,born in 1997,postgra-duate.Her main research interests include cross-language text analysis and knowledge discovery.
  • Supported by:
    National Natural Science Foundation of China(71771054).

摘要: 跨语言术语对齐是跨语言文本数据分析与知识发现的关键基础。针对跨语言术语对齐研究多为单词术语对齐且严重依赖向量空间对齐的现状,提出一种能够实现跨语言单词及多词术语间一对多对齐的Kernel-XGBoost方法。给定跨语言平行语料库,该方法分两步得到同义的跨语言术语对:1)跨语言术语提取与候选术语对生成;2)基于跨语言词嵌入的术语对齐。汉语-西班牙语以及汉语-法语的术语对齐实验表明,该方法在Top-5的准确率可达到80%,能有效支持跨语言信息检索、本体构建等跨语言文本数据挖掘任务。

关键词: 跨语言, 文本分析, 术语对齐, Kernel-XGBoost, 汉语, 法语, 西班牙语

Abstract: Cross-lingual term alignment is a crucial step for cross-lingual text data analysis and knowledge discovery.Current research usually focuses on single-word term alignment and relies heavily on vector space alignment.Therefore,a new Kernel-XGBoost method is proposed for the one-to-many alignment of cross-lingual terms including multi-word terms.Given a cross-lingual parallel corpus,the proposed method obtains synonymous cross-lingual terms in two steps:1) extracting cross-lingual terms and generating candidate term pairs;2) aligning cross-lingual terms based on word embedding.Experiments on Chinese-Spanish and Chinese-French term alignments demonstrate that the proposed method can achieve an accuracy of 80% at Top-5.It can effectively support cross-lingual text mining tasks such as information retrieval,ontology building.

Key words: Cross-lingual, Text analysis, Term alignment, Kernel-XGBoost, Chinese, French, Spanish

中图分类号: 

  • G202
[1]SUN L,JIN Y B,DU L,et al.Automatic extraction of bilingual term dictionaries from parallel corpus[J].Journal of Chinese Information Processing,2000,14(6):33-39.
[2]ZHANG L,LIU Y X.Research on Automatic Extraction ofChinese-English Term Pairs Based on Word Order Position Features[J].Journal of Nanjing University(Natural Science),2015(4):707-713.
[3]LI X Y.Research on term alignment based on bilingual parallel corpus of historical classics[D].Dalian:Dalian University of Technology,2010.
[4]ZENG W,WANG H L,XU H J.Research on Automatic Construction Technology of Chinese-English Bilingual Dictionary[J].Journal of Information,2011,30(4):402-409.
[5]LIU S Q,ZHU D H.Term alignment method based on multi-strategy fusion Giza++[J].Journal of Software,2015,26(7):1650-1661.
[6]GAMALLO P.Strategies for Building High Quality BilingualLexicons from Comparable Corpora[J].Parallel Corpora for Contrastive and Translation Studies:New resources and applications,2019,90:251.
[7]SANJANASRI J P,MENON V K,SOMAN K P.BUCC2020:Bilingual Dictionary Induction using Cross-lingual Embedding[C]//Proceedings of the 13th Workshop on Building and Using Comparable Corpora.2020:65-68.
[8]MOHIUDDIN T,BARI M S,JOTY S.Lnmap:Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space[J].arXiv:2004.13889,2020.
[9]XIONG C,DAI Z,CALLAN J,et al.End-to-end neural ad-hoc ranking with kernel pooling[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.2017:55-64.
[10]YU P,ALLAN J.A Study of Neural Matching Models forCross-lingual IR[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1637-1640.
[11]DAI Z,XIONG C,CALLAN J,et al.Convolutional neural networks for soft-matching n-grams in ad-hoc search[C]//Procee-dings of the Eleventh ACM International Conference on Web Search and Data Mining.2018:126-134.
[12]ARTETXE M,LABAKA G,AGIRRE E.Bilingual lexicon induction through unsupervised machine translation[J].arXiv:1907.10761,2019.
[13]YU J,DANG Y Z.Word extraction method combining part-of-speech analysis and string frequency statistics[J].Systems Engineering Theory and Practice,2010,30(1):105-111.
[14]YU J,WU X P,LIAO X,et al.Extracting Terms form French Corpora with FP Sequence Tree[J].Journal of University of Electronic Science and Technology of China,2021,50(1):84-90.
[15]YU J,YAN Y L,JIAN Z W,et al.Extracting Terms from Spani-sh Corpora Based on DC-Value[J].Computer Systems & Applications,2021,30(6):271-277.
[16]YU J,DANG Y Z.Research on Extraction Method of Domain Feature Words[J].Journal of Information,2009(3):368-373.
[17]GOIKOETXEA J,SOROA A,AGIRRE E.Bilingual Embed-dings with Random Walks over Multilingual Wordnets[J].Knowledge-Based Systems,2018,150(JUN.15):218-230.
[18]GLAVA G,VULI I.Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020.
[19]JOULIN A,BOJANOWSKI P,MIKOLOV T,et al.Loss inTranslation:Learning Bilingual Word Mapping with a Retrieval Criterion[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018.
[20]CHEN T,GUESTRIN C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[21]ZIEMSKI M,JUNCZYS-DOWMUNT M,POULIQUEN B.The united nations parallel corpus v1.0[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).2016:3530-3534.
[22]CRASWELL N,LIU L,ZSU M T.Precision at n[M]//Encyclopedia of Database Systems.Boston:Springer,2009:2127-2128.
[1] 阿布都克力木·阿布力孜, 张雨宁, 阿力木江·亚森, 郭文强, 哈里旦木·阿布都克里木.
预训练语言模型的扩展模型研究综述
Survey of Research on Extended Models of Pre-trained Language Models
计算机科学, 2022, 49(11A): 210800125-12. https://doi.org/10.11896/jsjkx.210800125
[2] 张明阳, 王刚, 彭起, 张岩峰.
学术论文公开评审平台数据分析
Data Analysis of OpenReview
计算机科学, 2021, 48(6): 63-70. https://doi.org/10.11896/jsjkx.200500138
[3] 李建兰, 潘岳, 李小聪, 刘子维, 王天宇.
基于CiteSpace的中文评论文本研究现状与趋势分析
Chinese Commentary Text Research Status and Trend Analysis Based on CiteSpace
计算机科学, 2021, 48(11A): 17-21. https://doi.org/10.11896/jsjkx.210300172
[4] 崔丹丹, 刘秀磊, 陈若愚, 刘旭红, 李臻, 齐林.
基于Lattice LSTM的古汉语命名实体识别
Named Entity Recognition in Field of Ancient Chinese Based on Lattice LSTM
计算机科学, 2020, 47(11A): 18-23. https://doi.org/10.11896/jsjkx.200500090
[5] 刘慧清, 郭延哺, 李红灵, 李维华.
基于贝叶斯网的短文本特征扩展方法
Short Text Feature Extension Method Based on Bayesian Networks
计算机科学, 2019, 46(11A): 66-71.
[6] 余圆圆, 巢文涵, 何跃鹰, 李舟军.
基于双语主题模型和双语词向量的跨语言知识链接
Cross-language Knowledge Linking Based on Bilingual Topic Model and Bilingual Embedding
计算机科学, 2019, 46(1): 238-244. https://doi.org/10.11896/j.issn.1002-137X.2019.01.037
[7] 朱峰,顾敏,郑好,顾彦慧,周俊生,曲维光.
基于知识图谱的未登录词语义研究
Research on Sense Guessing of Chinese Unknown Words Based on Knowledge Graph
计算机科学, 2017, 44(1): 95-99. https://doi.org/10.11896/j.issn.1002-137X.2017.01.018
[8] 刘金,吴 斌,陈 震,沈崇玮.
基于领域划分的微博用户影响力分析
Research on Influence of Micro Blogging Based on Field Division
计算机科学, 2015, 42(5): 42-46. https://doi.org/10.11896/j.issn.1002-137X.2015.05.008
[9] 师越,师海忠.
自然语言是正则语言
Natural Languages Are Regular Languages
计算机科学, 2014, 41(Z11): 51-54.
[10] 王东波,朱丹浩.
基于CABOSFV聚类算法的汉语词汇类别知识挖掘研究
Research of Mining Word Category Knowledge Based on CABOSFV
计算机科学, 2013, 40(7): 211-215.
[11] 李敏,王晓聪,张军,刘正捷.
基于位置的社交网络用户签到及相关行为研究
Study on Check-in and Related Behaviors of Location-based Social Network Users
计算机科学, 2013, 40(10): 72-76.
[12] 井晓阳,罗飞,王亚棋.
汉语语音合成技术综述
Overview of the Chinese Voice Synthesis Technique
计算机科学, 2012, 39(Z11): 386-390.
[13] 于江德,王希杰,樊孝忠.
字标注汉语词法分析中上文和下文孰重孰轻
Which is More Effective for Chinese Lexical Analysis via Character Tagging:Above-context Versus Below-context
计算机科学, 2012, 39(11): 201-203.
[14] 冯秋香,Roland Hausser,汪榕培.
数据库语义学与自然语言交流
Database Semantics and Natural Language Communication
计算机科学, 2011, 38(11): 187-190.
[15] 丁建立,慈祥,黄剑雄.
一种基于免疫遗传算法的网络新词识别方法
Approach of Internet New Word Identification Based on Immune Genetic Algorithm
计算机科学, 2011, 38(1): 240-245.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!