计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 178-183.doi: 10.11896/jsjkx.190600149

• 人工智能 • 上一篇    下一篇

面向自然语言推理的基于截断高斯距离的自注意力机制

张鹏飞, 李冠宇, 贾彩燕   

  1. 北京交通大学计算机与信息技术学院 北京100044北京交通大学交通数据分析与挖掘北京市重点实验室 北京100044
  • 收稿日期:2019-06-26 出版日期:2020-04-15 发布日期:2020-04-15
  • 通讯作者: 贾彩燕(cyjia@bjtu.edu.cn)
  • 基金资助:
    国家自然科学基金(61876016);中央高校基础科研业务费专项资金(2019JBZ110)

Truncated Gaussian Distance-based Self-attention Mechanism for Natural Language Inference

ZHANG Peng-fei, LI Guan-yu, JIA Cai-yan   

  1. School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,ChinaBeijing Key Lab of Traffic Data Analysis and Mining,Beijing Jiaotong University,Beijing 100044,China
  • Received:2019-06-26 Online:2020-04-15 Published:2020-04-15
  • Contact: JIA Cai-yan,born in 1976,professor.Her main research interests include data mining,social computing and natural language processing.
  • About author:ZHANG Peng-fei,born in 1995,postgraduate.His main research interests include natural language inference and rumor detection.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61876016) and Fundamental Research Funds for the Central Universities (2019JBZ110)

摘要: 在自然语言理解任务中,注意力机制由于可以有效捕获词在上下文语境中的重要程度并提高自然语言理解任务的有效性而受到了人们的普遍关注。基于注意力机制的非递归深度网络Transformer,不仅以非常少的参数和训练时间取得了机器翻译学习任务的最优性能,还在自然语言推理(Gaussian-Transformer)、词表示学习(Bert)等任务中取得了令人瞩目的成绩。目前Gaussian-Transformer已成为自然语言推理任务性能最好的方法之一。然而,在Transformer中引入Gaussian先验分布对词的位置信息进行编码,虽然可以大大提升邻近词的重要程度,但由于Gaussian分布中非邻近词的重要性会快速趋向于0,对当前词的表示有重要作用的非邻近词的影响会随着距离的加深消失殆尽。因此,文中面向自然语言推理任务,提出了一种基于截断高斯距离分布的自注意力机制,该方法不仅可以凸显邻近词的重要性,还可以保留对当前词表示具有重要作用的非邻近词的信息。在自然语言推理基准数据集SNLI和MultiNLI上的实验结果证实,截断高斯距离分布自注意力机制能够更有效地提取句子中词语的相对位置信息。

关键词: 截断高斯掩码, 距离掩码, 自然语言推理, 自注意力机制

Abstract: In the task of natural language inference,attention mechanisms have attracted a lot of attention because it can effectively capture the importance of words in the context and improve the effectiveness of natural language inference tasks.Transformer,a deep feedforward network model solely based on attention mechanisms,not only achieves state-of-the-art performance on machine translation with much less parameters and training time,but also achieves remarkable results in tasks such as natural language inference (Gaussian-Transformer) and word representation learning (Bert).Moreover,Gaussian-Transformer has become one of the best methods for natural language inference tasks.However,the Gaussian prior distribution in Transformer,which weights the positional importance of words,although greatly improves the importance of adjacent words,the importance of non-neighborhood words in Gaussian distribution will quickly become 0,the influence of non-neighborhood words that plays an important role in the current word representation will disappear as the distance deepens.Therefore,this paper proposed a position weighting method based on the self-attention mechanism of clipped Gaussian distance distribution for natural language inference.This method not only highlights the importance of neighboring words,but also preserves non-neighborhood words those are important to the current word representation.The experimental results on the natural language inference benchmark datasets SNLI and MultiNLI confirm the validity of the cliped Gaussian distance distribution used in the self-attention mechanism for extracting the relative position information of the words in sentences.

Key words: Clipped Gaussian distance mask, Distance mask, Natural language inference, Self-attention mechanism

中图分类号: 

  • TP181
[1]HERMANN K M,KOCISKY T,GREFENSTETTE E,et al.Teaching machines to read and comprehend[J].Neural Information Processing Systems,2015:1693-1701.
[2]DU X,SHAO J,CARDIE C,et al.Learning to Ask:NeuralQuestion Generation for Reading Comprehension[C]//Meeting of the Association for Computational Linguistics.2017:1342-1352.
[3]LAN W,XU W.Neural Network Models for Paraphrase Identification,Semantic Textual Similarity,Natural Language Infe-rence,and Question Answering[C]//International Conference on Computational Linguistics.2018:3890-3902.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll you Need[C]//Neural Information Processing Systems.2017:5998-6008.
[5]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//Computer Vision and Pattern Recognition.2016:770-778.
[6]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2019:4171-4186.
[7]SHEN T,ZHOU T,LONG G,et al.DiSAN:Directional Self-Attention Network for RNN/CNN-Free Language Understanding[C]//National Conference on Artificial Intelligence.2018:5446-5455.
[8]BOWMAN S R,ANGELI G,POTTS C,et al.A large annotated corpus for learning natural language inference[C]//Empirical Methods in Natural Language Processing.2015:632-642.
[9]IM J,CHO S.Distance-based Self-Attention Network for Natural Language Inference[J].arXiv:1712.02047,2017.
[10]GUO M,ZHANG Y,LIU T,et al.Gaussian Transformer:aLightweight Approach for Natural Language Inference[C]//National Conference on Artificial Intelligence.2019:6489-6496.
[11]KLAMBAUER G,UNTERTHINER T,MAYR A,et al.SelfNormalizing Neural Networks[C]//Neural Information Processing Systems.2017:971-980.
[12]PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]//Empirical Me-thods in Natural Language Processing.2014:1532-1543.
[13]MIKOLOV T,CHEN K,CORRADO G S,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781.
[14]BOJANOWSKI P,GRAVE E,JOULIN A,et al.EnrichingWord Vectors with Subword Information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146.
[15]CHEN Q,ZHU X,LING Z,et al.Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference[C]//Workshop on Evaluating Vector Space Representations for Nlp.2017:36-40.
[16]WILLIAMS A,NANGIA N,BOWMAN S R,et al.A BroadCoverage Challenge Corpus for Sentence Understanding through Inference [C]//North American Chapter of the Association for Computational Linguistics.2018:1112-1122.
[17]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[18]ZEILER M D.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701,2012.
[19]SRIVASTAVA N,HINTON G E,KRIZHEVSKY A,et al.Dropout:a simple way to prevent neural networks from overfitting[J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[20]ABADI M,AGARWAL A,BARHAM P,et al.TensorFlow:Large-Scale Machine Learning on Heterogeneous Distributed Systems[J].arXiv:1603.04467,2016.
[1] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[2] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[3] 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀.
基于BERT-GRU-ATT模型的中文实体关系分类
Chinese Entity Relations Classification Based on BERT-GRU-ATT
计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[4] 胡艳丽, 童谭骞, 张啸宇, 彭娟.
融入自注意力机制的深度学习情感分析方法
Self-attention-based BGRU and CNN for Sentiment Analysis
计算机科学, 2022, 49(1): 252-258. https://doi.org/10.11896/jsjkx.210600063
[5] 徐少伟, 秦品乐, 曾建朝, 赵致楷, 高媛, 王丽芳.
基于多级特征和全局上下文的纵膈淋巴结分割算法
Mediastinal Lymph Node Segmentation Algorithm Based on Multi-level Features and Global Context
计算机科学, 2021, 48(6A): 95-100. https://doi.org/10.11896/jsjkx.200700067
[6] 王习, 张凯, 李军辉, 孔芳, 张熠天.
联合自注意力和循环网络的图像标题生成
Generation of Image Caption of Joint Self-attention and Recurrent Neural Network
计算机科学, 2021, 48(4): 157-163. https://doi.org/10.11896/jsjkx.200300146
[7] 周小诗, 张梓葳, 文娟.
基于神经网络机器翻译的自然语言信息隐藏
Natural Language Steganography Based on Neural Machine Translation
计算机科学, 2021, 48(11A): 557-564. https://doi.org/10.11896/jsjkx.210100015
[8] 康雁,崔国荣,李浩,杨其越,李晋源,王沛尧.
融合自注意力机制和多路金字塔卷积的软件需求聚类算法
Software Requirements Clustering Algorithm Based on Self-attention Mechanism and Multi- channel Pyramid Convolution
计算机科学, 2020, 47(3): 48-53. https://doi.org/10.11896/jsjkx.190700146
[9] 张义杰, 李培峰, 朱巧明.
基于自注意力机制的事件时序关系分类方法
Event Temporal Relation Classification Method Based on Self-attention Mechanism
计算机科学, 2019, 46(8): 244-248. https://doi.org/10.11896/j.issn.1002-137X.2019.08.040
[10] 凡子威, 张民, 李正华.
基于BiLSTM并结合自注意力机制和句法信息的隐式篇章关系分类
BiLSTM-based Implicit Discourse Relation Classification Combining Self-attention
Mechanism and Syntactic Information
计算机科学, 2019, 46(5): 214-220. https://doi.org/10.11896/j.issn.1002-137X.2019.05.033
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!