Computer Science ›› 2020, Vol. 47 ›› Issue (3): 162-173.doi: 10.11896/jsjkx.191000167

• Artificial Intelligence • Previous Articles     Next Articles

Survey of Natural Language Processing Pre-training Techniques

LI Zhou-jun,FAN Yu,WU Xian-jie   

  1. (School of Computer Science and Engineering, Beihang University, Beijing 100191, China)
  • Received:2019-09-25 Online:2020-03-15 Published:2020-03-30
  • About author:LI Zhou-jun,born in 1963,is a professor and doctoral tutor of Beihang University of Computer.He is currently a member of the Network Space Security Discipline Review Group of the Academic Degrees Committee of the State Council,the executive director of the China Cyberspace Security Association,the deputy director of the Language Intelligence Committee of the China Artificial Intelligence Society,and a member of the ACM,IEEE,and AAAI.He is mainly engaged in the research of artificial intelligence and natural language processing such as intelligent question and answer,semantic analysis,information extraction and OCR.He has published more than 300 academicpapers in SCI journals including TKDE,TIFS and other top international conferences such as AAAI,IJCAI,ACL,EMNLP,and won the ECIR 2010 Best Paper Award.The team he directed has won several championships in artificial intelligence and cybersecurity competitions at home and abroad.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (U1636211, 61672081), Fund of the State Key Laboraty of Software Development Environment (SKLSDE-2019ZX-17) and Beijing Advanced Innovation Center for Imaging Technology (BAICIT-2016001).

Abstract: In recent years,with the rapid development of deep learning,the pre-training technology for the field of natural language processing has made great progress.In the early days of natural language processing,the word embedding methods such as Word2Vec were used to encode text.These word embedding methods can also be regarded as static pre-training techniques.However,the context-independent text representation has limitation and cannot solve the polysemy problem.The ELMo pre-training language model gives a context-dependent method that can effectively handle polysemy problems.Later,GPT,BERT and other pre-training language models have been proposed,especially the BERT model,which significantly improves the effect on many typical downstream tasks,greatly promotes the technical development in the field of natural language processing,and thus initia-tes the age of dynamic pre-training.Since then,a number of pre-training language models such as BERT-based improved models and XLNet have emerged,and pre-training techniques have become an indispensable mainstream technology in the field of natural language processing.This paper first briefly introduce the pre-training technology and its development history,and then comb the classic pre-training techniques in the field of natural language processing,including the early static pre-training techniques and the classic dynamic pre-training techniques.Then the paper briefly comb a series of inspiring pre-training techniques,including BERT-based models and XLNet.On this basis,the paper analyze the problems faced by the current pre-training technology.Finally,the future development trend of pre-training technologies is prospected.

Key words: Natural language processing, Pre-training, Word embedding, Language model

CLC Number: 

  • TP391
[1]HE K,ZHANG X,REN S,et al.Deep residual learning for ima- ge recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[2]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781.
[3]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.
[4]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:a system for large-scale machine learning[J].arXiv:1605.08695.
[5]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]∥International Conference on Machine Learning.2014:1188-1196.
[6]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trends in Signal Processing,2014,7(3/4):197-387.
[7]PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365.
[8]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL]. understanding paper.pdf,2018.
[9]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805.
[10]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized Autoregressive Pretraining for Language Understanding[J].arXiv:1906.08237.
[11]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]∥Advances in Neural Information Processing Systems.2014:3320-3328.
[12]OQUAB M,BOTTOU L,LAPTEV I,et al.Learning and transferring mid-level image representations using convolutional neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:1717-1724.
[13]GLOROT X,BORDES A,BENGIO Y.Domain adaptation for large-scale sentiment classification:A deep learning approach[C]∥Proceedings of the 28th International Conference on Machine Learning (ICML-11).2011:513-520.
[14]CHEN M,XU Z,WEINBERGER K,et al.Marginalized denoi- sing autoencoders for domain adaptation[J].arXiv:1206.4683.
[15]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030.
[16]SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections onlear-ning[C]∥AAAI.2017:12.
[17]WU Z,SHEN C,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].arXiv:1611.10080.
[18]SINGH S,HOIEM D,FORSYTH D.Swapout:Learning an ensemble of deep architectures[C]∥Advances in Neural Information Processing Systems.2016:28-36.
[19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]∥Advances in Neural Information Processing Systems.2015:91-99.
[20]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[J].arXiv:1608.06993.
[21]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]∥European Conference on Computer Vision.Cham:Springer,2016:630-645.
[22]LEDIG C,THEIS L,HUSZÁR F,et al.Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network[J].arXiv:1609.04802.
[23]PETERS M,AMMAR W,BHAGAVATULA C,et al.Semi-supervised sequence tagging with bidirectional language models[C]∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:1756-1765.
[24]KIROS R,ZHU Y,SALAKHUTDINOV R R,et al.Skip- thought vectors[C]∥Advances in Neural Information Proces-sing Systems.2015:3294-3302.
[25]VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]∥Proceedings of the 25th International Conference on Machine Learning.ACM,2008:1096-1103.
[26]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[27]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[28]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[J].arXiv:1607.01759.
[29]CHEN D,MANNING C.A fast and accurate dependency parser using neural networks[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:740-750.
[30]BORDES A,USUNIER N,GARCIA-DURAN A,et al.Translating embeddings for modeling multi-relational data[C]∥Advances in Neural Information Processing Systems.2013:2787-2795.
[31]TAI K S,SOCHER R,MANNING C D.Improved semantic representations from tree-structured long short-term memory networks[J].arXiv:1503.00075.
[32]GROVER A,LESKOVEC J.node2vec:Scalable feature learning for networks[C]∥Proceedings of the 22nd ACM SIGKDD international Conference on Knowledge Discovery and Data Mi-ning.ACM,2016:855-864.
[33]TANG J,QU M,WANG M,et al.Line:Large-scale information network embedding[C]∥Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee.2015:1067-1077.
[34]NICKEL M,KIELA D.Poincaré embeddings for learning hierarchical representations[C]∥Advances in Neural Information Processing Systems.2017:6338-6347.
[35]KAHNG M,ANDREWS P Y,KALRO A,et al.A cti v is:Vi- sual exploration of industry-scale deep neural network models[J].IEEE Transactions on Visualization and Computer Graphics,2018,24(1):88-97.
[36]YANG X,MACDONALD C,OUNIS I.Using word embeddings in twitter election classification[J].Information Retrieval Journal,2018,21(2/3):183-207.
[37]MNIH A,HINTON G.Three new graphical models for statistical language modelling[C]∥Proceedings of the 24th International Conference on Machine Learning.ACM,2007:641-648.
[38]MNIH A,HINTON G E.A scalable hierarchical distributed language model[C]∥Advances in Neural Information Processing Systems.2009:1081-1088.
[39]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[40]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrent neural network based language model[C]∥Eleventh Annual Conference of the International Speech Communication Association.2010.
[41]GUTMANN M U,HYVÄRINEN A.Noise-contrastive estimation of unnormalized statistical models,with applications tona-tural image statistics[J].Journal of Machine Learning Research,2012,13:307-361.
[42]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
[43]GOLUB G H,REINSCH C.Singular value decomposition and least squares solutions[M]∥Linear Algebra.Berlin:Springer,1971:134-151.
[44]HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162.
[45]JOZEFOWICZ R,VINYALS O,SCHUSTER M,et al.Exploring the limits of language modeling[J].arXiv:1602.02410.
[46]HOWARD J,RUDER S.Universal language model fine-tuning for text classification[J].arXiv:1801.06146.
[47]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008.
[48]LIU P J,SALEH M,POT E,et al.Generating wikipedia by summarizing long sequences[J].arXiv:1801.10198.
[49]SONG K,TAN X,QIN T,et al.Mass:Masked sequence to sequence pre-training for language generation[J].arXiv:1905.02450.
[50]DONG L,YANG N,WANG W,et al.Unified Language Model Pre-training for Natural Language Understanding and Generation[J].arXiv:1905.03197.
[51]SUN Y,WANG S,LI Y,et al.ERNIE:Enhanced Representation through Knowledge Integration[J].arXiv:1904.09223.
[52]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced Language Representation with Informative Entities[J].arXiv:1905.07129.
[53]LIU X,HE P,CHEN W,et al.Multi-task deep neural networks for natural language understanding[J].arXiv:1901.11504.
[54]SUN Y,WANG S,LI Y,et al.Ernie 2.0:A continual pre-trai- ning framework for language understanding[J].arXiv:1907.12412.
[55]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531.
[56]CUI Y,CHE W,LIU T,et al.Pre-Training with Whole Word Masking for Chinese BERT[J].arXiv:1906.08101.
[57]JOSHI M,CHEN D,LIU Y,et al.SpanBERT:Improving pre-training by representing and predicting spans[J].arXiv:1907.10529.
[58]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti- mized BERT pretraining approach[J].arXiv:1907.11692.
[59]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8).
[60]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].arXiv:1901.02860.
[61]NIVEN T,KAO H Y.Probing neural network comprehension of natural language arguments[J].arXiv:1907.07355.
[62]MCCOY R T,PAVLICK E,LINZEN T.Right for the Wrong Reasons:Diagnosing Syntactic Heuristics in Natural Language Inference[J].arXiv:1902.01007.
[63]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art Natural Language Processing[J].arXiv:1910.03771.
[64]Bright.GitHub repository[OL]. mart/albert_zh.
[65]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A Lite BERT for Self-supervised Learning of Language Representations[J].arXiv:1909.11942.
[1] HU Chao-wen, YANG Ya-lian, WU Chang-xing. Survey of Implicit Discourse Relation Recognition Based on Deep Learning [J]. Computer Science, 2020, 47(4): 157-163.
[2] YU Shan-shan, SU Jin-dian, LI Peng-fei. Sentiment Classification Method for Sentences via Self-attention [J]. Computer Science, 2020, 47(4): 204-210.
[3] TANG Guo-qiang,GAO Da-qi,RUAN Tong,YE Qi,WANG Qi. Clinical Electronic Medical Record Named Entity Recognition Incorporating Language Model and Attention Mechanism [J]. Computer Science, 2020, 47(3): 211-216.
[4] GU Xue-mei,LIU Jia-yong,CHENG Peng-sen,HE Xiang. Malware Name Recognition in Tweets Based on Enhanced BiLSTM-CRF Model [J]. Computer Science, 2020, 47(2): 245-250.
[5] YANG Dan-hao,WU Yue-xin,FAN Chun-xiao. Chinese Short Text Keyphrase Extraction Model Based on Attention [J]. Computer Science, 2020, 47(1): 193-198.
[6] LI Zhou-jun,WANG Chang-bao. Survey on Deep-learning-based Machine Reading Comprehension [J]. Computer Science, 2019, 46(7): 7-12.
[7] WANG Le-le,WANG Bin-qiang,LIU Jian-gang,ZHANG Jian-hui,MIAO Qi-guang. Study on Malicious Program Detection Based on Recurrent Neural Network [J]. Computer Science, 2019, 46(7): 86-90.
[8] ZHANG Lu, SHEN Chen-lin, LI Shou-shan. Emotion Classification Algorithm Based on Emotion-specific Word Embedding [J]. Computer Science, 2019, 46(6A): 93-97.
[9] ZHANG Shuai, FU Xiang-ling, HOU Yi. Prediction Model of P2P Trading Volume Based on Investor Sentiment [J]. Computer Science, 2019, 46(6A): 60-65.
[10] SUN Bao-hua, HU Nan, LI Dong-yang. Analysis Research of Software Requirement Safety Based on Neural Network and NLP [J]. Computer Science, 2019, 46(6A): 348-352.
[11] ZHOU Ming,JIA Yan-ming,ZHOU Cai-lan,XU Ning. English Automated Essay Scoring Methods Based on Discourse Structure [J]. Computer Science, 2019, 46(3): 234-241.
[12] HOU Yu-chen, WU Wei. Design and Implementation of Crowdsourcing System for Still Image Activity Annotation [J]. Computer Science, 2019, 46(11A): 580-583.
[13] ZHANG Xian, BEN Ke-rong. Modified Neural Language Model and Its Application in Code Suggestion [J]. Computer Science, 2019, 46(11): 168-175.
[14] WU Liang-qing, ZHANG Dong, LI Shou-shan, CHEN Ying. Multi-modal Emotion Recognition Approach Based on Multi-task Learning [J]. Computer Science, 2019, 46(11): 284-290.
[15] LUO Da, SU Jin-dian, LI Peng-fei. Multi-view Attentional Approach to Single-fact Knowledge-based Question Answering [J]. Computer Science, 2019, 46(10): 215-221.
Full text



[1] LI Xiao-xin, ZHOU Yuan-shen, ZHOU Xuan, LI Jing-jing, LIU Zhi-yong. Gabor Occlusion Dictionary Learning via Singular Value Decomposition[J]. Computer Science, 2018, 45(6): 275 -283 .
[2] MA Su-gang, ZHAO Chen, SUN Han-lin, HAN Jun-gang. Yawning Detection Algorithm Based on Convolutional Neural Network[J]. Computer Science, 2018, 45(6A): 227 -229, 241 .
[3] GUO Wen-sheng, BAO Ling, QIAN Zhi-cheng, CAO Wan-li. People Counting Method Based on Adaptive Overlapping Segmentation and Deep Neural Network[J]. Computer Science, 2018, 45(8): 229 -235 .
[4] YUAN Jia-xin, CHEN Jian-xin, XIAO Jun, WU Dao-liang. Time-aware Minimum Area Task Scheduling Algorithm Based on Backfilling Algorithm[J]. Computer Science, 2018, 45(8): 100 -104 .
[5] XIONG Zhen-ya, LIN Zheng-hao and REN Hao-qi. Efficient BTB Based on Taken Trace[J]. Computer Science, 2017, 44(3): 195 -201, 214 .
[6] FENG Fei, LIU Pei-xue,LI Li,CHEN Yu-jie. Study of FCM Fusing ImprovedGravitational Search Algorithm in Medical Image Segmentation[J]. Computer Science, 2018, 45(6A): 252 -254 .
[7] DU Yi, HE Yang and HONG Mei. Application of Probabilistic Model Checking in Dynamic Power Management[J]. Computer Science, 2018, 45(1): 261 -266, 291 .
[8] LIU Xiao-qin, WANG Jie-ting, QIAN Yu-hua and WANG Xiao-yue. Ensemble Method Against Evasion Attack with Different Strength of Attack[J]. Computer Science, 2018, 45(1): 34 -38, 46 .
[9] WANG Yong-wei,ZHAO Rong-cai,CHANG De-xian,LIU Yu-nan and SI Cheng. Reasoning Decision Method Based on Improved Theory of Evidence[J]. Computer Science, 2014, 41(12): 24 -29 .
[10] ZHANG Jie, WEN Min-hua, Jame LIN, MENG De-long and LU Hao. Implementation and Optimization of Historical VaR on GPU[J]. Computer Science, 2018, 45(5): 291 -294, 321 .