细粒度语义知识图谱增强的中文OOV词嵌入学习

doi:10.11896/jsjkx.220700249

Abstract

Abstract: With the expansion of the scope in informatization fields,lots of text corpora in specific fields continue to appear.Due to the impact of security and sensitivity,the text corpora in these specific fields(e.g.,medical records corpora and communication corpora) are often small-scaled.It is difficult for traditional word embedding learning methods to obtain high-quality embeddings on these corpora.On the other hand,there may exist many out-of-vocabulary words in these corpora when using the existing pre-training language models directly,for which,many words cannot be represented as vectors and the performance on downstream tasks are limited.Many researchers start to study how to infer the semantics of out-of-vocabulary words and obtain effective out-of-vocabulary word embeddings based on fine-grained semantic information.However,the current models utilizing fine-grained semantic information mainly focus on the English corpora and they only model the relationship among fine-grained semantic information by simple ways of concatenation or mapping,which leads to a poor model robustness.Aiming at addressing the above problems,this paper first proposes to construct a fine-grained knowledge graph by exploiting Chinese word formation rules,such as the characters contained in Chinese words,as well as the character components and pinyin of Chinese characters.The know-ledge graph not only captures the relationship between Chinese characters and Chinese words,but also represents the multiple and complex relationships between Pinyin and Chinese characters,components and Chinese characters,and other fine-grained semantic information.Next,the relational graph convolution operation is performed on the knowledge graph to model the deeper relationship between fine-grained semantics and word semantics.The method further mines the relationship between fine-grained semantics by the sub-graph readout,so as to effectively infer the semantic information of Chinese out-of-vocabulary words.Experimental results show that our model achieves better performance on specific corpora with a large proportion of out-of-vocabulary words when applying to tasks such as word analogy,word similarity,text classification,and named entity recognition.

Key words: Out-of-vocabulary word embedding learning, Chinese fine-grained semantic information, Fine-grained knowledge graph, Graph convolution network learning

CLC Number:

TP311

CHEN Shurui, LIANG Ziran, RAO Yanghui. Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning[J].Computer Science, 2023, 50(3): 72-82.

References

[1]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//ICLR Workshop.2013.
[2]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[3]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//EMNLP.2014:1532-1543.
[4]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofdeep bidirectional transformers for language understanding[C]//NAACL-HLT.2019:4171-4186.
[5]DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive language models beyond a fixed-length context[C]//ACL.2019:2978-2988.
[6]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized autore-gressive pretraining for language understanding[C]//NeurIPS.2019:5754-5764.
[7]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A robustly optimized BERT pretraining approach[J].arXiv:1907.11692,2019.
[8]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A lite BERT for self-supervised learning of language representations[C]//ICLR.2020.
[9]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced language representation with informative entities[C]//ACL.2019:1441-1451.
[10]SUN Y,WANG S,LI Y K,et al.ERNIE 2.0:A continual pre-training framework for language understanding[C]//AAAI.2020:8969-8975.
[11]AJI A F,BOGOYCHEV N,HEAFIELD K,et al.In neural machine translation,what does transfer learning transfer?[C]//ACL.2020:7701-7710.
[12]JIN D,JIN Z,ZHOU J T,et al.Is BERT really robust? A strong baseline for natural language attack on text classificationand entailment[C]//AAAI.2020:8018-8025.
[13]HU Z,CHEN T,CHANG K,et al.Few-shot representationlearning for out-of-vocabulary words[C]//ACL.2019:4102-4112.
[14]PINTER Y,GUTHRIE R,EISENSTEIN J.Mimicking wordembeddings using subword RNNs[C]//EMNLP.2017:102-112.
[15]CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]//IJCAI.2015:1236-1242.
[16]YU J,JIAN X,XIN H,et al.Joint embeddings of Chinesewords,characters,and fine-grained subcharacter components[C]//EMNLP.2017:286-291.
[17]SU T R,LEE H Y.Learning Chinese word representations from glyphs of characters[C]//EMNLP.2017:264-273.
[18]CHEN H,YU S,LIN S.Glyph2vec:Learning Chinese out-of-vocabulary word embedding from glyphs[C]//ACL.2020:2865-2871.
[19]CAO S,LU W,ZHOU J,et al.cw2vec:Learning Chinese word embeddings with stroke n-gram information[C]//AAAI.2018:5053-5061.
[20]SINGH M,GREENBERG C,OUALIL Y,et al.Sub-word similarity based search for embeddings:Inducing rare-word embeddings for word similarity tasks and language modelling[C]//COLING.2016:2061-2070.
[21]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146.
[22]SCHICK T,SCHÜTZE H.Learning semantic representationsfor novel words:Leveraging both form and context[C]//AAAI.2019:6965-6973.
[23]SCHICK T,SCHÜTZE H.Attentive mimicking:Better wordembeddings by attending to informative contexts[C]//NAACL-HLT.2019:489-494.
[24]SCHICK T,SCHÜTZE H.Rare words:A major problem forcontextualized embeddings and how to fix it by attentive mimicking[C]//AAAI.2020:8766-8774.
[25]FUKUDA N,YOSHINAGA N,KITSUREGAWA M.Robustbacked-off estimation of out-of-vocabulary embeddings[C]//EMNLP Findings.2020:4827-4838.
[26]CHEN L,VAROQUAUX G,SUCHANEK F M.Imputing out-of-vocabulary embeddings with LOVE makes language models robust with little cost[C]//ACL.2022:3488-3504.
[27]BUCK G,VLACHOS A.Trajectory-based meta-learning forout-of-vocabulary word[J].arXiv:2102.12266,2021.
[28]CHO K,VAN MERRIENBOER B,GÜLÇEHRE Ç,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//EMNLP.2014:1724-1734.
[29]SUN Z,LI X,SUN X,et al.ChineseBERT:Chinese pretraining enhanced by glyph and pinyin information[C]//ACL/IJCNLP.2021:2065-2075.
[30]LI L,DAI Y,TANG D,et al.MarkBERT:Marking word boun-daries improves Chinese BERT[J].arXiv:2203.06378,2022.
[31]ZHANG Y,LIU Y,ZHU J,et al.Learning Chinese word embeddings from stroke,structure and pinyin of characters[C]//CIKM.2019:1011-1020.
[32]MILLER G A.Wordnet:A lexical database for English[J].Commun.ACM,1995,38(11):39-41.
[33]SCHLICHTKRULL M S,KIPF T N,BLOEM P,et al.Mode-ling relational data with graph convolutional networks[C]//ESWC.2018:593-607.
[34]HAMILTON W L,YING Z,LESKOVEC J.Inductive representation learning on large graphs[C]//NIPS.2017:1024-1034.
[35]WANG T,ISOLA P.Understanding contrastive representation learning through alignment and uniformity on the hypersphere[C]//ICML.2020:9929-9939.
[36]GILMER J,SCHOENHOLZ S S,RILEY P F,et al.Neural message passing for quantum chemistry[C]//ICML.2017:1263-1272.
[37]MA Y,WANG S,AGGARWAL C C,et al.Graph convolutional networks with eigenpooling[C]//KDD.2019:723-731.
[38]DIEHL F.Edge contraction pooling for graph neural networks[J].arXiv:1905.10990,2019.
[39]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computer,1997,9(8):1735-1780.
[40]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008.
[41]LI S,ZHAO Z,HU R,et al.Analogical reasoning on Chinese morphological and semantic relations[C]//ACL.2018:138-143.
[42]LEVY O,GOLDBERG Y.Linguistic regularities in sparse and explicit word representations[C]//CoNLL.2014:171-180.
[43]HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015.
[44]BAO C,QIAO J,LI H,et al.Research on long text classification method based on fusion features[J].Journal of Chongqing University of Technology(Natural Science),2022,36(9):128-136.
[45]DONG L,YANG D,ZHANG X.Large-scale semantic text overlapping region retrieval based on deep learning[J].Journal of Jilin University(Engineering and Technology Edition),2021,51(5):1817-1822.
[46]YANG Q.Hybrid semantic similarity calculation of knowledgeontology and word vector based on reinforcement learning[J].Journal of Chongqing University of Technology(Natural Scien-ce),2022,36(1):128-135.
[47]HOU Y,ABULIZI A,ABUDUKELIMU H.Advances in Chi-nese pre-training models[J].Computer Science,2022,49(7):148-163.
[48]CAO X,NIU Q,WANG R,et al.Distributed representationlearning and improvement of Chinese words based on co-occurrence[J].Computer Science,2021,48(6):222-226.

Related Articles 15

[1]	YU Jian, ZHAO Mankun, GAO Jie, WANG Congyuan, LI Yarong, ZHANG Wenbin. Study on Graph Neural Networks Social Recommendation Based on High-order and Temporal Features [J]. Computer Science, 2023, 50(3): 49-64.
[2]	LIU Xinwei, TAO Chuanqi. Method of Java Redundant Code Detection Based on Static Analysis and Knowledge Graph [J]. Computer Science, 2023, 50(3): 65-71.
[3]	HAN Jingyu, QIAN Long, GE Kang, MAO Yi. ECG Abnormality Detection Based on Label Co-occurrence and Feature Local Pertinence [J]. Computer Science, 2023, 50(3): 139-146.
[4]	WANG Zhulin, WU Youxi, WANG Yuehua, LIU Jingyu. Mining Negative Sequential Patterns with Periodic Gap Constraints [J]. Computer Science, 2023, 50(3): 147-154.
[5]	Peng XU, Jianxin ZHAO, Chi Harold LIU. Optimization and Deployment of Memory-Intensive Operations in Deep Learning Model on Edge Peng [J]. Computer Science, 2023, 50(2): 3-12.
[6]	WANG Jiwang, SHEN Liwei. Fine-grained Action Allocation and Scheduling Method for Dynamic Heterogeneous Tasks in Multi-robot Environments [J]. Computer Science, 2023, 50(2): 244-253.
[7]	MA Qican, WU Zehui, WANG Yunchao, WANG Xinlei. Approach of Web Application Access Control Vulnerability Detection Based on State Deviation Analysis [J]. Computer Science, 2023, 50(2): 346-352.
[8]	LIANG Jiali, HUA Baojian, SU Shaobo. Tensor Instruction Generation Optimization Fusing with Loop Partitioning [J]. Computer Science, 2023, 50(2): 374-383.
[9]	WANG Yitan, WANG Yishu, YUAN Ye. Survey of Learned Index [J]. Computer Science, 2023, 50(1): 1-8.
[10]	SHAN Zhongyuan, YANG Kai, ZHAO Junfeng, WANG Yasha, XU Yongxin. Ontology-Schema Mapping Based Incremental Entity Model Construction and Evolution Approach of Knowledge Graph [J]. Computer Science, 2023, 50(1): 18-24.
[11]	LU Mingchen, LYU Yanqi, LIU Ruicheng, JIN Peiquan. Fast Storage System for Time-series Big Data Streams Based on Waterwheel Model [J]. Computer Science, 2023, 50(1): 25-33.
[12]	JIAO Tianzhe, HE Hongyan, ZHANG Zexin, SONG Jie. Study on Big Graph Traversals for Storage Medium Optimization [J]. Computer Science, 2023, 50(1): 34-40.
[13]	MENG Yiyue, PENG Rong, LYU Qibiao. Text Material Recommendation Method Combining Label Classification and Semantic QueryExpansion [J]. Computer Science, 2023, 50(1): 76-86.
[14]	HUANG Yuzhou, WANG Lisong, QIN Xiaolin. Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning [J]. Computer Science, 2023, 50(1): 194-204.
[15]	LI Bei, WU Hao, HE Xiaowei, WANG Bin, XU Ergang. Survey of Storage Scalability in Blockchain Systems [J]. Computer Science, 2023, 50(1): 318-333.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0