基于Kernel-XGBoost的跨语言术语对齐方法

doi:10.11896/jsjkx.211000111

Abstract

Abstract: Cross-lingual term alignment is a crucial step for cross-lingual text data analysis and knowledge discovery.Current research usually focuses on single-word term alignment and relies heavily on vector space alignment.Therefore,a new Kernel-XGBoost method is proposed for the one-to-many alignment of cross-lingual terms including multi-word terms.Given a cross-lingual parallel corpus,the proposed method obtains synonymous cross-lingual terms in two steps:1) extracting cross-lingual terms and generating candidate term pairs;2) aligning cross-lingual terms based on word embedding.Experiments on Chinese-Spanish and Chinese-French term alignments demonstrate that the proposed method can achieve an accuracy of 80% at Top-5.It can effectively support cross-lingual text mining tasks such as information retrieval,ontology building.

Key words: Cross-lingual, Text analysis, Term alignment, Kernel-XGBoost, Chinese, French, Spanish

CLC Number:

G202

YU Juan, ZHANG Chen. Cross-lingual Term Alignment with Kernel-XGBoost[J].Computer Science, 2022, 49(11A): 211000111-6.

References

[1]SUN L,JIN Y B,DU L,et al.Automatic extraction of bilingual term dictionaries from parallel corpus[J].Journal of Chinese Information Processing,2000,14(6):33-39.
[2]ZHANG L,LIU Y X.Research on Automatic Extraction ofChinese-English Term Pairs Based on Word Order Position Features[J].Journal of Nanjing University(Natural Science),2015(4):707-713.
[3]LI X Y.Research on term alignment based on bilingual parallel corpus of historical classics[D].Dalian:Dalian University of Technology,2010.
[4]ZENG W,WANG H L,XU H J.Research on Automatic Construction Technology of Chinese-English Bilingual Dictionary[J].Journal of Information,2011,30(4):402-409.
[5]LIU S Q,ZHU D H.Term alignment method based on multi-strategy fusion Giza++[J].Journal of Software,2015,26(7):1650-1661.
[6]GAMALLO P.Strategies for Building High Quality BilingualLexicons from Comparable Corpora[J].Parallel Corpora for Contrastive and Translation Studies:New resources and applications,2019,90:251.
[7]SANJANASRI J P,MENON V K,SOMAN K P.BUCC2020:Bilingual Dictionary Induction using Cross-lingual Embedding[C]//Proceedings of the 13th Workshop on Building and Using Comparable Corpora.2020:65-68.
[8]MOHIUDDIN T,BARI M S,JOTY S.Lnmap:Departures from isomorphic assumption in bilingual lexicon induction through non-linear mapping in latent space[J].arXiv:2004.13889,2020.
[9]XIONG C,DAI Z,CALLAN J,et al.End-to-end neural ad-hoc ranking with kernel pooling[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.2017:55-64.
[10]YU P,ALLAN J.A Study of Neural Matching Models forCross-lingual IR[C]//Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval.2020:1637-1640.
[11]DAI Z,XIONG C,CALLAN J,et al.Convolutional neural networks for soft-matching n-grams in ad-hoc search[C]//Procee-dings of the Eleventh ACM International Conference on Web Search and Data Mining.2018:126-134.
[12]ARTETXE M,LABAKA G,AGIRRE E.Bilingual lexicon induction through unsupervised machine translation[J].arXiv:1907.10761,2019.
[13]YU J,DANG Y Z.Word extraction method combining part-of-speech analysis and string frequency statistics[J].Systems Engineering Theory and Practice,2010,30(1):105-111.
[14]YU J,WU X P,LIAO X,et al.Extracting Terms form French Corpora with FP Sequence Tree[J].Journal of University of Electronic Science and Technology of China,2021,50(1):84-90.
[15]YU J,YAN Y L,JIAN Z W,et al.Extracting Terms from Spani-sh Corpora Based on DC-Value[J].Computer Systems & Applications,2021,30(6):271-277.
[16]YU J,DANG Y Z.Research on Extraction Method of Domain Feature Words[J].Journal of Information,2009(3):368-373.
[17]GOIKOETXEA J,SOROA A,AGIRRE E.Bilingual Embed-dings with Random Walks over Multilingual Wordnets[J].Knowledge-Based Systems,2018,150(JUN.15):218-230.
[18]GLAVA G,VULI I.Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020.
[19]JOULIN A,BOJANOWSKI P,MIKOLOV T,et al.Loss inTranslation:Learning Bilingual Word Mapping with a Retrieval Criterion[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.2018.
[20]CHEN T,GUESTRIN C.Xgboost:A scalable tree boosting system[C]//Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining.2016:785-794.
[21]ZIEMSKI M,JUNCZYS-DOWMUNT M,POULIQUEN B.The united nations parallel corpus v1.0[C]//Proceedings of the Tenth International Conference on Language Resources and Evaluation(LREC’16).2016:3530-3534.
[22]CRASWELL N,LIU L,ZSU M T.Precision at n[M]//Encyclopedia of Database Systems.Boston:Springer,2009:2127-2128.

Related Articles 15

[1]	DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[2]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[3]	SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[4]	ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[5]	LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[6]	HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.
[7]	LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[8]	DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[9]	ZHANG Ming-yang, WANG Gang, PENG Qi, ZHANG Yan-feng. Data Analysis of OpenReview [J]. Computer Science, 2021, 48(6): 63-70.
[10]	DING Ling, XIANG Yang. Chinese Event Detection with Hierarchical and Multi-granularity Semantic Fusion [J]. Computer Science, 2021, 48(5): 202-208.
[11]	WU Fan, ZHU Pei-pei, WANG Zhong-qing, LI Pei-feng, ZHU Qiao-ming. Chinese Event Detection with Joint Representation of Characters and Words [J]. Computer Science, 2021, 48(4): 249-253.
[12]	JIANG Qi, SU Wei, XIE Ying, ZHOUHONG An-ping, ZHANG Jiu-wen, CAI Chuan. End-to-End Chinese-Braille Automatic Conversion Based on Transformer [J]. Computer Science, 2021, 48(11A): 136-141.
[13]	LI Jian-lan, PAN Yue, LI Xiao-cong, LIU Zi-wei, WANG Tian-yu. Chinese Commentary Text Research Status and Trend Analysis Based on CiteSpace [J]. Computer Science, 2021, 48(11A): 17-21.
[14]	YU Jie, JI Bin, LIU Lei, LI Sha-sha, MA Jun, LIU Hui-jun. Joint Extraction Method for Chinese Medical Events [J]. Computer Science, 2021, 48(11): 287-293.
[15]	SHU Yun-feng and WANG Zhong-qing. Research on Chinese Patent Summarization Based on Patented Structure [J]. Computer Science, 2020, 47(6A): 45-48.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Cross-lingual Term Alignment with Kernel-XGBoost

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0