Computer Science ›› 2020, Vol. 47 ›› Issue (4): 178-183.doi: 10.11896/jsjkx.190600149

• Artificial Intelligence • Previous Articles     Next Articles

Truncated Gaussian Distance-based Self-attention Mechanism for Natural Language Inference

ZHANG Peng-fei, LI Guan-yu, JIA Cai-yan   

  1. School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,ChinaBeijing Key Lab of Traffic Data Analysis and Mining,Beijing Jiaotong University,Beijing 100044,China
  • Received:2019-06-26 Online:2020-04-15 Published:2020-04-15
  • Contact: JIA Cai-yan,born in 1976,professor.Her main research interests include data mining,social computing and natural language processing.
  • About author:ZHANG Peng-fei,born in 1995,postgraduate.His main research interests include natural language inference and rumor detection.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61876016) and Fundamental Research Funds for the Central Universities (2019JBZ110)

Abstract: In the task of natural language inference,attention mechanisms have attracted a lot of attention because it can effectively capture the importance of words in the context and improve the effectiveness of natural language inference tasks.Transformer,a deep feedforward network model solely based on attention mechanisms,not only achieves state-of-the-art performance on machine translation with much less parameters and training time,but also achieves remarkable results in tasks such as natural language inference (Gaussian-Transformer) and word representation learning (Bert).Moreover,Gaussian-Transformer has become one of the best methods for natural language inference tasks.However,the Gaussian prior distribution in Transformer,which weights the positional importance of words,although greatly improves the importance of adjacent words,the importance of non-neighborhood words in Gaussian distribution will quickly become 0,the influence of non-neighborhood words that plays an important role in the current word representation will disappear as the distance deepens.Therefore,this paper proposed a position weighting method based on the self-attention mechanism of clipped Gaussian distance distribution for natural language inference.This method not only highlights the importance of neighboring words,but also preserves non-neighborhood words those are important to the current word representation.The experimental results on the natural language inference benchmark datasets SNLI and MultiNLI confirm the validity of the cliped Gaussian distance distribution used in the self-attention mechanism for extracting the relative position information of the words in sentences.

Key words: Natural language inference, Self-attention mechanism, Distance mask, Clipped Gaussian distance mask

CLC Number: 

  • TP181
[1]HERMANN K M,KOCISKY T,GREFENSTETTE E,et al.Teaching machines to read and comprehend[J].Neural Information Processing Systems,2015:1693-1701.
[2]DU X,SHAO J,CARDIE C,et al.Learning to Ask:NeuralQuestion Generation for Reading Comprehension[C]//Meeting of the Association for Computational Linguistics.2017:1342-1352.
[3]LAN W,XU W.Neural Network Models for Paraphrase Identification,Semantic Textual Similarity,Natural Language Infe-rence,and Question Answering[C]//International Conference on Computational Linguistics.2018:3890-3902.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll you Need[C]//Neural Information Processing Systems.2017:5998-6008.
[5]HE K,ZHANG X,REN S,et al.Deep Residual Learning for Image Recognition[C]//Computer Vision and Pattern Recognition.2016:770-778.
[6]DEVLIN J,CHANG M,LEE K,et al.BERT:Pre-training ofDeep Bidirectional Transformers for Language Understanding[C]//North American Chapter of the Association for Computational Linguistics.2019:4171-4186.
[7]SHEN T,ZHOU T,LONG G,et al.DiSAN:Directional Self-Attention Network for RNN/CNN-Free Language Understanding[C]//National Conference on Artificial Intelligence.2018:5446-5455.
[8]BOWMAN S R,ANGELI G,POTTS C,et al.A large annotated corpus for learning natural language inference[C]//Empirical Methods in Natural Language Processing.2015:632-642.
[9]IM J,CHO S.Distance-based Self-Attention Network for Natural Language Inference[J].arXiv:1712.02047,2017.
[10]GUO M,ZHANG Y,LIU T,et al.Gaussian Transformer:aLightweight Approach for Natural Language Inference[C]//National Conference on Artificial Intelligence.2019:6489-6496.
[11]KLAMBAUER G,UNTERTHINER T,MAYR A,et al.SelfNormalizing Neural Networks[C]//Neural Information Processing Systems.2017:971-980.
[12]PENNINGTON J,SOCHER R,MANNING C D,et al.Glove:Global Vectors for Word Representation[C]//Empirical Me-thods in Natural Language Processing.2014:1532-1543.
[13]MIKOLOV T,CHEN K,CORRADO G S,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781.
[14]BOJANOWSKI P,GRAVE E,JOULIN A,et al.EnrichingWord Vectors with Subword Information[J].Transactions of the Association for Computational Linguistics,2017,5(1):135-146.
[15]CHEN Q,ZHU X,LING Z,et al.Recurrent Neural Network-Based Sentence Encoder with Gated Attention for Natural Language Inference[C]//Workshop on Evaluating Vector Space Representations for Nlp.2017:36-40.
[16]WILLIAMS A,NANGIA N,BOWMAN S R,et al.A BroadCoverage Challenge Corpus for Sentence Understanding through Inference [C]//North American Chapter of the Association for Computational Linguistics.2018:1112-1122.
[17]KINGMA D P,BA J.Adam:A Method for Stochastic Optimization[J].arXiv:1412.6980,2014.
[18]ZEILER M D.ADADELTA:An Adaptive Learning Rate Method[J].arXiv:1212.5701,2012.
[19]SRIVASTAVA N,HINTON G E,KRIZHEVSKY A,et al.Dropout:a simple way to prevent neural networks from overfitting[J].Journal of Machine Learning Research,2014,15(1):1929-1958.
[20]ABADI M,AGARWAL A,BARHAM P,et al.TensorFlow:Large-Scale Machine Learning on Heterogeneous Distributed Systems[J].arXiv:1603.04467,2016.
[1] ZHANG Yi-jie, LI Pei-feng, ZHU Qiao-ming. Event Temporal Relation Classification Method Based on Self-attention Mechanism [J]. Computer Science, 2019, 46(8): 244-248.
[2] FAN Zi-wei, ZHANG Min, LI Zheng-hua. BiLSTM-based Implicit Discourse Relation Classification Combining Self-attention
Mechanism and Syntactic Information
[J]. Computer Science, 2019, 46(5): 214-220.
Full text



[1] JIA Xu, SUN Fu-ming, LI Hao-jie, CAO Yu-dong. Vein Recognition Algorithm Based on Supervised NMF with Two Regularization Terms[J]. Computer Science, 2018, 45(8): 283 -287 .
[2] SONG Zhen-hua and ZHANG Guang-quan. Modeling of CPS Based on Aspect-oriented Spatial-Temporal Petri Net[J]. Computer Science, 2017, 44(7): 38 -41, 73 .
[3] CAO Xiao-mei, CHEN Hai-shan and WANG Shao-hui. Method to Construct Secure S-boxes Based on Multimap[J]. Computer Science, 2017, 44(7): 107 -110, 119 .
[4] PEI Yu-hang, LIU Jing-sen and LI Yu. Adaptive Bat Algorithm with Dynamically Adjusting Inertia Weight[J]. Computer Science, 2017, 44(6): 240 -244 .
[5] MA Yuan-yuan, SHI Yong-yi, ZHANG Hong, LIN Qi and LI Qian-mu. Feature Processing Approach Based on MA-LSSVM in Safety Data[J]. Computer Science, 2017, 44(3): 237 -241 .
[6] ZHANG Qian and WU Jing-li. Triploid Individual Haplotype Reconstruction Algorithm Based on Enumeration Strategy[J]. Computer Science, 2017, 44(1): 75 -79, 112 .
[7] WANG Zhen-fei, ZHANG Li-ying, ZHANG Xing-jin and LI Lun. Research on Temporal Perception-oriented Microblog Propagation Model[J]. Computer Science, 2017, 44(2): 275 -278, 289 .
[8] LIAO Xing, YUAN Jing-ling and CHEN Min-cheng. Parallel PSO Container Packing Algorithm with Adaptive Weight[J]. Computer Science, 2018, 45(3): 231 -234, 273 .
[9] SONG Ya-qing, WU You-xi, LIU Jing-yu and LI Yan. k-step Reachability Queries Based on Bidirectional Double Interval Labeling Indexes[J]. Computer Science, 2018, 45(3): 178 -181 .
[10] MAO Dian-hui, XUE Zi-yu, LI Zi-qin and WANG Fan. Survey on Converting Image to Sentence Based on Depth Neural Networks[J]. Computer Science, 2018, 45(3): 23 -28 .