细粒度语义知识图谱增强的中文OOV词嵌入学习

doi:10.11896/jsjkx.220700249

摘要/Abstract

摘要： 随着信息化领域的范围不断扩大,许多特定领域的文本语料开始涌现。这些特定领域,如医疗、通信等,由于受到安全性和敏感性的影响,其数据规模通常较小,传统的词嵌入学习模型难以获得有效的结果。另一方面,直接应用现有的预训练语言模型时会出现较多未登录词,这些词汇无法表示成向量,从而影响下游任务的性能表现。许多学者开始研究如何利用细粒度语义信息来得到较高质量的未登录词向量表示。然而,当前的未登录词嵌入学习模型大多针对英文语料,对中文词的细粒度语义信息只能进行简单的拼接或映射,难以在中文未登录词嵌入学习任务中得到有效的向量表示。针对上述问题,首先通过中文构字规则,即中文词所包含的汉字、汉字所包含的部件和拼音等,构建细粒度的知识图谱,使其不仅能涵盖汉字和单词之间的关联关系,还能对拼音和汉字、组件和汉字等细粒度语义信息之间的多元且复杂的关联关系进行表征。然后,在知识图谱上运行图卷积算法,从而对中文词的细粒度语义信息之间以及它们与词语义之间更深层次的关系进行建模。此外,文中通过在子图结构上构建图读出来进一步挖掘细粒度语义信息与词语义信息之间的组成关系,据此提升模型在未登录词嵌入推断中的精准度。实验结果表明,在面对未登录词占比较大的特定语料上的词配对、词相似任务,以及文本分类、命名实体识别等下游任务时,所提模型都取得了更好的性能。

关键词: 未登录词嵌入学习, 中文细粒度语义信息, 细粒度知识图谱, 图卷积网络学习

Abstract: With the expansion of the scope in informatization fields,lots of text corpora in specific fields continue to appear.Due to the impact of security and sensitivity,the text corpora in these specific fields(e.g.,medical records corpora and communication corpora) are often small-scaled.It is difficult for traditional word embedding learning methods to obtain high-quality embeddings on these corpora.On the other hand,there may exist many out-of-vocabulary words in these corpora when using the existing pre-training language models directly,for which,many words cannot be represented as vectors and the performance on downstream tasks are limited.Many researchers start to study how to infer the semantics of out-of-vocabulary words and obtain effective out-of-vocabulary word embeddings based on fine-grained semantic information.However,the current models utilizing fine-grained semantic information mainly focus on the English corpora and they only model the relationship among fine-grained semantic information by simple ways of concatenation or mapping,which leads to a poor model robustness.Aiming at addressing the above problems,this paper first proposes to construct a fine-grained knowledge graph by exploiting Chinese word formation rules,such as the characters contained in Chinese words,as well as the character components and pinyin of Chinese characters.The know-ledge graph not only captures the relationship between Chinese characters and Chinese words,but also represents the multiple and complex relationships between Pinyin and Chinese characters,components and Chinese characters,and other fine-grained semantic information.Next,the relational graph convolution operation is performed on the knowledge graph to model the deeper relationship between fine-grained semantics and word semantics.The method further mines the relationship between fine-grained semantics by the sub-graph readout,so as to effectively infer the semantic information of Chinese out-of-vocabulary words.Experimental results show that our model achieves better performance on specific corpora with a large proportion of out-of-vocabulary words when applying to tasks such as word analogy,word similarity,text classification,and named entity recognition.

Key words: Out-of-vocabulary word embedding learning, Chinese fine-grained semantic information, Fine-grained knowledge graph, Graph convolution network learning

中图分类号:

TP311

陈姝睿, 梁子然, 饶洋辉. 细粒度语义知识图谱增强的中文OOV词嵌入学习[J]. 计算机科学, 2023, 50(3): 72-82. https://doi.org/10.11896/jsjkx.220700249

CHEN Shurui, LIANG Ziran, RAO Yanghui. Fine-grained Semantic Knowledge Graph Enhanced Chinese OOV Word Embedding Learning[J]. Computer Science, 2023, 50(3): 72-82. https://doi.org/10.11896/jsjkx.220700249

参考文献

[1]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[C]//ICLR Workshop.2013.
[2]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//NIPS.2013:3111-3119.
[3]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//EMNLP.2014:1532-1543.
[4]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofdeep bidirectional transformers for language understanding[C]//NAACL-HLT.2019:4171-4186.
[5]DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive language models beyond a fixed-length context[C]//ACL.2019:2978-2988.
[6]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized autore-gressive pretraining for language understanding[C]//NeurIPS.2019:5754-5764.
[7]LIU Y,OTT M,GOYAL N,et al.RoBERTa:A robustly optimized BERT pretraining approach[J].arXiv:1907.11692,2019.
[8]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A lite BERT for self-supervised learning of language representations[C]//ICLR.2020.
[9]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced language representation with informative entities[C]//ACL.2019:1441-1451.
[10]SUN Y,WANG S,LI Y K,et al.ERNIE 2.0:A continual pre-training framework for language understanding[C]//AAAI.2020:8969-8975.
[11]AJI A F,BOGOYCHEV N,HEAFIELD K,et al.In neural machine translation,what does transfer learning transfer?[C]//ACL.2020:7701-7710.
[12]JIN D,JIN Z,ZHOU J T,et al.Is BERT really robust? A strong baseline for natural language attack on text classificationand entailment[C]//AAAI.2020:8018-8025.
[13]HU Z,CHEN T,CHANG K,et al.Few-shot representationlearning for out-of-vocabulary words[C]//ACL.2019:4102-4112.
[14]PINTER Y,GUTHRIE R,EISENSTEIN J.Mimicking wordembeddings using subword RNNs[C]//EMNLP.2017:102-112.
[15]CHEN X,XU L,LIU Z,et al.Joint learning of character and word embeddings[C]//IJCAI.2015:1236-1242.
[16]YU J,JIAN X,XIN H,et al.Joint embeddings of Chinesewords,characters,and fine-grained subcharacter components[C]//EMNLP.2017:286-291.
[17]SU T R,LEE H Y.Learning Chinese word representations from glyphs of characters[C]//EMNLP.2017:264-273.
[18]CHEN H,YU S,LIN S.Glyph2vec:Learning Chinese out-of-vocabulary word embedding from glyphs[C]//ACL.2020:2865-2871.
[19]CAO S,LU W,ZHOU J,et al.cw2vec:Learning Chinese word embeddings with stroke n-gram information[C]//AAAI.2018:5053-5061.
[20]SINGH M,GREENBERG C,OUALIL Y,et al.Sub-word similarity based search for embeddings:Inducing rare-word embeddings for word similarity tasks and language modelling[C]//COLING.2016:2061-2070.
[21]BOJANOWSKI P,GRAVE E,JOULIN A,et al.Enriching word vectors with subword information[J].Transactions of the Association for Computational Linguistics,2017,5:135-146.
[22]SCHICK T,SCHÜTZE H.Learning semantic representationsfor novel words:Leveraging both form and context[C]//AAAI.2019:6965-6973.
[23]SCHICK T,SCHÜTZE H.Attentive mimicking:Better wordembeddings by attending to informative contexts[C]//NAACL-HLT.2019:489-494.
[24]SCHICK T,SCHÜTZE H.Rare words:A major problem forcontextualized embeddings and how to fix it by attentive mimicking[C]//AAAI.2020:8766-8774.
[25]FUKUDA N,YOSHINAGA N,KITSUREGAWA M.Robustbacked-off estimation of out-of-vocabulary embeddings[C]//EMNLP Findings.2020:4827-4838.
[26]CHEN L,VAROQUAUX G,SUCHANEK F M.Imputing out-of-vocabulary embeddings with LOVE makes language models robust with little cost[C]//ACL.2022:3488-3504.
[27]BUCK G,VLACHOS A.Trajectory-based meta-learning forout-of-vocabulary word[J].arXiv:2102.12266,2021.
[28]CHO K,VAN MERRIENBOER B,GÜLÇEHRE Ç,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]//EMNLP.2014:1724-1734.
[29]SUN Z,LI X,SUN X,et al.ChineseBERT:Chinese pretraining enhanced by glyph and pinyin information[C]//ACL/IJCNLP.2021:2065-2075.
[30]LI L,DAI Y,TANG D,et al.MarkBERT:Marking word boun-daries improves Chinese BERT[J].arXiv:2203.06378,2022.
[31]ZHANG Y,LIU Y,ZHU J,et al.Learning Chinese word embeddings from stroke,structure and pinyin of characters[C]//CIKM.2019:1011-1020.
[32]MILLER G A.Wordnet:A lexical database for English[J].Commun.ACM,1995,38(11):39-41.
[33]SCHLICHTKRULL M S,KIPF T N,BLOEM P,et al.Mode-ling relational data with graph convolutional networks[C]//ESWC.2018:593-607.
[34]HAMILTON W L,YING Z,LESKOVEC J.Inductive representation learning on large graphs[C]//NIPS.2017:1024-1034.
[35]WANG T,ISOLA P.Understanding contrastive representation learning through alignment and uniformity on the hypersphere[C]//ICML.2020:9929-9939.
[36]GILMER J,SCHOENHOLZ S S,RILEY P F,et al.Neural message passing for quantum chemistry[C]//ICML.2017:1263-1272.
[37]MA Y,WANG S,AGGARWAL C C,et al.Graph convolutional networks with eigenpooling[C]//KDD.2019:723-731.
[38]DIEHL F.Edge contraction pooling for graph neural networks[J].arXiv:1905.10990,2019.
[39]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computer,1997,9(8):1735-1780.
[40]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//NIPS.2017:5998-6008.
[41]LI S,ZHAO Z,HU R,et al.Analogical reasoning on Chinese morphological and semantic relations[C]//ACL.2018:138-143.
[42]LEVY O,GOLDBERG Y.Linguistic regularities in sparse and explicit word representations[C]//CoNLL.2014:171-180.
[43]HUANG Z,XU W,YU K.Bidirectional LSTM-CRF models for sequence tagging[J].arXiv:1508.01991,2015.
[44]BAO C,QIAO J,LI H,et al.Research on long text classification method based on fusion features[J].Journal of Chongqing University of Technology(Natural Science),2022,36(9):128-136.
[45]DONG L,YANG D,ZHANG X.Large-scale semantic text overlapping region retrieval based on deep learning[J].Journal of Jilin University(Engineering and Technology Edition),2021,51(5):1817-1822.
[46]YANG Q.Hybrid semantic similarity calculation of knowledgeontology and word vector based on reinforcement learning[J].Journal of Chongqing University of Technology(Natural Scien-ce),2022,36(1):128-135.
[47]HOU Y,ABULIZI A,ABUDUKELIMU H.Advances in Chi-nese pre-training models[J].Computer Science,2022,49(7):148-163.
[48]CAO X,NIU Q,WANG R,et al.Distributed representationlearning and improvement of Chinese words based on co-occurrence[J].Computer Science,2021,48(6):222-226.

相关文章 15

[1]	于健, 赵满坤, 高洁, 王聪源, 李亚蓉, 张文彬. 基于高阶和时序特征的图神经网络社会化推荐算法研究 Study on Graph Neural Networks Social Recommendation Based on High-order and Temporal Features 计算机科学, 2023, 50(3): 49-64. https://doi.org/10.11896/jsjkx.220700108
[2]	刘昕炜, 陶传奇. 一种静态分析与知识图谱结合的Java冗余代码检测方法 Method of Java Redundant Code Detection Based on Static Analysis and Knowledge Graph 计算机科学, 2023, 50(3): 65-71. https://doi.org/10.11896/jsjkx.220700240
[3]	韩京宇, 钱龙, 葛康, 毛毅. 基于标签共现和特征局部相关的心电异常检测方法 ECG Abnormality Detection Based on Label Co-occurrence and Feature Local Pertinence 计算机科学, 2023, 50(3): 139-146. https://doi.org/10.11896/jsjkx.220200004
[4]	王珠林, 武优西, 王月华, 刘靖宇. 具有周期间隙约束的负序列模式挖掘 Mining Negative Sequential Patterns with Periodic Gap Constraints 计算机科学, 2023, 50(3): 147-154. https://doi.org/10.11896/jsjkx.211200248
[5]	Peng XU, Jianxin ZHAO, Chi Harold LIU. Optimization and Deployment of Memory-Intensive Operations in Deep Learning Model on Edge Peng Optimization and Deployment of Memory-Intensive Operations in Deep Learning Model on Edge Peng 计算机科学, 2023, 50(2): 3-12. https://doi.org/10.11896/jsjkx.20221100135
[6]	王积旺, 沈立炜. 面向多机器人环境中动态异构任务的细粒度动作分配与调度方法 Fine-grained Action Allocation and Scheduling Method for Dynamic Heterogeneous Tasks in Multi-robot Environments 计算机科学, 2023, 50(2): 244-253. https://doi.org/10.11896/jsjkx.220500117
[7]	马琪灿, 武泽慧, 王允超, 王新蕾. 基于状态偏离分析的Web访问控制漏洞检测方法 Approach of Web Application Access Control Vulnerability Detection Based on State Deviation Analysis 计算机科学, 2023, 50(2): 346-352. https://doi.org/10.11896/jsjkx.211100166
[8]	梁佳利, 华保健, 苏少博. 融合循环划分的张量指令生成优化 Tensor Instruction Generation Optimization Fusing with Loop Partitioning 计算机科学, 2023, 50(2): 374-383. https://doi.org/10.11896/jsjkx.220300147
[9]	王艺潭, 王一舒, 袁野. 学习索引研究综述 Survey of Learned Index 计算机科学, 2023, 50(1): 1-8. https://doi.org/10.11896/jsjkx.211000149
[10]	单中原, 杨恺, 赵俊峰, 王亚沙, 徐涌鑫. 一种增量式本体模型与数据模式映射的图谱实例模型构建演化方法 Ontology-Schema Mapping Based Incremental Entity Model Construction and Evolution Approach of Knowledge Graph 计算机科学, 2023, 50(1): 18-24. https://doi.org/10.11896/jsjkx.220500205
[11]	陆铭琛, 吕晏齐, 刘睿诚, 金培权. 基于水车模型的时序大数据快速存储 Fast Storage System for Time-series Big Data Streams Based on Waterwheel Model 计算机科学, 2023, 50(1): 25-33. https://doi.org/10.11896/jsjkx.220900045
[12]	矫天哲, 何虹燕, 张泽鑫, 宋杰. 一种存储介质优化的大规模图遍历方法研究 Study on Big Graph Traversals for Storage Medium Optimization 计算机科学, 2023, 50(1): 34-40. https://doi.org/10.11896/jsjkx.211100049
[13]	孟怡悦, 彭蓉, 吕其标. 一种结合标签分类和语义查询扩展的文本素材推荐方法 Text Material Recommendation Method Combining Label Classification and Semantic QueryExpansion 计算机科学, 2023, 50(1): 76-86. https://doi.org/10.11896/jsjkx.220100078
[14]	黄昱洲, 王立松, 秦小麟. 一种基于深度强化学习的无人小车双层路径规划方法 Bi-level Path Planning Method for Unmanned Vehicle Based on Deep Reinforcement Learning 计算机科学, 2023, 50(1): 194-204. https://doi.org/10.11896/jsjkx.220500241
[15]	李贝, 吴昊, 贺小伟, 王宾, 徐尔刚. 区块链系统的存储可扩展性综述 Survey of Storage Scalability in Blockchain Systems 计算机科学, 2023, 50(1): 318-333. https://doi.org/10.11896/jsjkx.211200042

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed