Computer Science ›› 2020, Vol. 47 ›› Issue (3): 162-173.doi: 10.11896/jsjkx.191000167

• Artificial Intelligence • Previous Articles     Next Articles

Survey of Natural Language Processing Pre-training Techniques

LI Zhou-jun,FAN Yu,WU Xian-jie   

  1. (School of Computer Science and Engineering, Beihang University, Beijing 100191, China)
  • Received:2019-09-25 Online:2020-03-15 Published:2020-03-30
  • About author:LI Zhou-jun,born in 1963,is a professor and doctoral tutor of Beihang University of Computer.He is currently a member of the Network Space Security Discipline Review Group of the Academic Degrees Committee of the State Council,the executive director of the China Cyberspace Security Association,the deputy director of the Language Intelligence Committee of the China Artificial Intelligence Society,and a member of the ACM,IEEE,and AAAI.He is mainly engaged in the research of artificial intelligence and natural language processing such as intelligent question and answer,semantic analysis,information extraction and OCR.He has published more than 300 academicpapers in SCI journals including TKDE,TIFS and other top international conferences such as AAAI,IJCAI,ACL,EMNLP,and won the ECIR 2010 Best Paper Award.The team he directed has won several championships in artificial intelligence and cybersecurity competitions at home and abroad.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (U1636211, 61672081), Fund of the State Key Laboraty of Software Development Environment (SKLSDE-2019ZX-17) and Beijing Advanced Innovation Center for Imaging Technology (BAICIT-2016001).

Abstract: In recent years,with the rapid development of deep learning,the pre-training technology for the field of natural language processing has made great progress.In the early days of natural language processing,the word embedding methods such as Word2Vec were used to encode text.These word embedding methods can also be regarded as static pre-training techniques.However,the context-independent text representation has limitation and cannot solve the polysemy problem.The ELMo pre-training language model gives a context-dependent method that can effectively handle polysemy problems.Later,GPT,BERT and other pre-training language models have been proposed,especially the BERT model,which significantly improves the effect on many typical downstream tasks,greatly promotes the technical development in the field of natural language processing,and thus initia-tes the age of dynamic pre-training.Since then,a number of pre-training language models such as BERT-based improved models and XLNet have emerged,and pre-training techniques have become an indispensable mainstream technology in the field of natural language processing.This paper first briefly introduce the pre-training technology and its development history,and then comb the classic pre-training techniques in the field of natural language processing,including the early static pre-training techniques and the classic dynamic pre-training techniques.Then the paper briefly comb a series of inspiring pre-training techniques,including BERT-based models and XLNet.On this basis,the paper analyze the problems faced by the current pre-training technology.Finally,the future development trend of pre-training technologies is prospected.

Key words: Language model, Natural language processing, Pre-training, Word embedding

CLC Number: 

  • TP391
[1]HE K,ZHANG X,REN S,et al.Deep residual learning for ima- ge recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[2]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781.
[3]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.
[4]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:a system for large-scale machine learning[J].arXiv:1605.08695.
[5]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]∥International Conference on Machine Learning.2014:1188-1196.
[6]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trends in Signal Processing,2014,7(3/4):197-387.
[7]PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365.
[8]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf,2018.
[9]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805.
[10]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized Autoregressive Pretraining for Language Understanding[J].arXiv:1906.08237.
[11]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]∥Advances in Neural Information Processing Systems.2014:3320-3328.
[12]OQUAB M,BOTTOU L,LAPTEV I,et al.Learning and transferring mid-level image representations using convolutional neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:1717-1724.
[13]GLOROT X,BORDES A,BENGIO Y.Domain adaptation for large-scale sentiment classification:A deep learning approach[C]∥Proceedings of the 28th International Conference on Machine Learning (ICML-11).2011:513-520.
[14]CHEN M,XU Z,WEINBERGER K,et al.Marginalized denoi- sing autoencoders for domain adaptation[J].arXiv:1206.4683.
[15]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030.
[16]SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections onlear-ning[C]∥AAAI.2017:12.
[17]WU Z,SHEN C,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].arXiv:1611.10080.
[18]SINGH S,HOIEM D,FORSYTH D.Swapout:Learning an ensemble of deep architectures[C]∥Advances in Neural Information Processing Systems.2016:28-36.
[19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]∥Advances in Neural Information Processing Systems.2015:91-99.
[20]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[J].arXiv:1608.06993.
[21]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]∥European Conference on Computer Vision.Cham:Springer,2016:630-645.
[22]LEDIG C,THEIS L,HUSZÁR F,et al.Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network[J].arXiv:1609.04802.
[23]PETERS M,AMMAR W,BHAGAVATULA C,et al.Semi-supervised sequence tagging with bidirectional language models[C]∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:1756-1765.
[24]KIROS R,ZHU Y,SALAKHUTDINOV R R,et al.Skip- thought vectors[C]∥Advances in Neural Information Proces-sing Systems.2015:3294-3302.
[25]VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]∥Proceedings of the 25th International Conference on Machine Learning.ACM,2008:1096-1103.
[26]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[27]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[28]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[J].arXiv:1607.01759.
[29]CHEN D,MANNING C.A fast and accurate dependency parser using neural networks[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:740-750.
[30]BORDES A,USUNIER N,GARCIA-DURAN A,et al.Translating embeddings for modeling multi-relational data[C]∥Advances in Neural Information Processing Systems.2013:2787-2795.
[31]TAI K S,SOCHER R,MANNING C D.Improved semantic representations from tree-structured long short-term memory networks[J].arXiv:1503.00075.
[32]GROVER A,LESKOVEC J.node2vec:Scalable feature learning for networks[C]∥Proceedings of the 22nd ACM SIGKDD international Conference on Knowledge Discovery and Data Mi-ning.ACM,2016:855-864.
[33]TANG J,QU M,WANG M,et al.Line:Large-scale information network embedding[C]∥Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee.2015:1067-1077.
[34]NICKEL M,KIELA D.Poincaré embeddings for learning hierarchical representations[C]∥Advances in Neural Information Processing Systems.2017:6338-6347.
[35]KAHNG M,ANDREWS P Y,KALRO A,et al.A cti v is:Vi- sual exploration of industry-scale deep neural network models[J].IEEE Transactions on Visualization and Computer Graphics,2018,24(1):88-97.
[36]YANG X,MACDONALD C,OUNIS I.Using word embeddings in twitter election classification[J].Information Retrieval Journal,2018,21(2/3):183-207.
[37]MNIH A,HINTON G.Three new graphical models for statistical language modelling[C]∥Proceedings of the 24th International Conference on Machine Learning.ACM,2007:641-648.
[38]MNIH A,HINTON G E.A scalable hierarchical distributed language model[C]∥Advances in Neural Information Processing Systems.2009:1081-1088.
[39]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[40]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrent neural network based language model[C]∥Eleventh Annual Conference of the International Speech Communication Association.2010.
[41]GUTMANN M U,HYVÄRINEN A.Noise-contrastive estimation of unnormalized statistical models,with applications tona-tural image statistics[J].Journal of Machine Learning Research,2012,13:307-361.
[42]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
[43]GOLUB G H,REINSCH C.Singular value decomposition and least squares solutions[M]∥Linear Algebra.Berlin:Springer,1971:134-151.
[44]HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162.
[45]JOZEFOWICZ R,VINYALS O,SCHUSTER M,et al.Exploring the limits of language modeling[J].arXiv:1602.02410.
[46]HOWARD J,RUDER S.Universal language model fine-tuning for text classification[J].arXiv:1801.06146.
[47]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008.
[48]LIU P J,SALEH M,POT E,et al.Generating wikipedia by summarizing long sequences[J].arXiv:1801.10198.
[49]SONG K,TAN X,QIN T,et al.Mass:Masked sequence to sequence pre-training for language generation[J].arXiv:1905.02450.
[50]DONG L,YANG N,WANG W,et al.Unified Language Model Pre-training for Natural Language Understanding and Generation[J].arXiv:1905.03197.
[51]SUN Y,WANG S,LI Y,et al.ERNIE:Enhanced Representation through Knowledge Integration[J].arXiv:1904.09223.
[52]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced Language Representation with Informative Entities[J].arXiv:1905.07129.
[53]LIU X,HE P,CHEN W,et al.Multi-task deep neural networks for natural language understanding[J].arXiv:1901.11504.
[54]SUN Y,WANG S,LI Y,et al.Ernie 2.0:A continual pre-trai- ning framework for language understanding[J].arXiv:1907.12412.
[55]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531.
[56]CUI Y,CHE W,LIU T,et al.Pre-Training with Whole Word Masking for Chinese BERT[J].arXiv:1906.08101.
[57]JOSHI M,CHEN D,LIU Y,et al.SpanBERT:Improving pre-training by representing and predicting spans[J].arXiv:1907.10529.
[58]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti- mized BERT pretraining approach[J].arXiv:1907.11692.
[59]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8).
[60]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].arXiv:1901.02860.
[61]NIVEN T,KAO H Y.Probing neural network comprehension of natural language arguments[J].arXiv:1907.07355.
[62]MCCOY R T,PAVLICK E,LINZEN T.Right for the Wrong Reasons:Diagnosing Syntactic Heuristics in Natural Language Inference[J].arXiv:1902.01007.
[63]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art Natural Language Processing[J].arXiv:1910.03771.
[64]Bright.GitHub repository[OL].https://github.com/bright- mart/albert_zh.
[65]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A Lite BERT for Self-supervised Learning of Language Representations[J].arXiv:1909.11942.
[1] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[2] HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.
[3] LI Xiao-wei, SHU Hui, GUANG Yan, ZHAI Yi, YANG Zi-ji. Survey of the Application of Natural Language Processing for Resume Analysis [J]. Computer Science, 2022, 49(6A): 66-73.
[4] ZHAO Dan-dan, HUANG De-gen, MENG Jia-na, DONG Yu, ZHANG Pan. Chinese Entity Relations Classification Based on BERT-GRU-ATT [J]. Computer Science, 2022, 49(6): 319-325.
[5] HAN Hong-qi, RAN Ya-xin, ZHANG Yun-liang, GUI Jie, GAO Xiong, YI Meng-lin. Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning [J]. Computer Science, 2022, 49(5): 33-42.
[6] LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[7] LI Yu-qiang, ZHANG Wei-jiang, HUANG Yu, LI Lin, LIU Ai-hua. Improved Topic Sentiment Model with Word Embedding Based on Gaussian Distribution [J]. Computer Science, 2022, 49(2): 256-264.
[8] ZHANG Hu, BAI Ping. Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification [J]. Computer Science, 2022, 49(2): 279-284.
[9] LIU Kai, ZHANG Hong-jun, CHEN Fei-qiong. Name Entity Recognition for Military Based on Domain Adaptive Embedding [J]. Computer Science, 2022, 49(1): 292-297.
[10] HOU Hong-xu, SUN Shuo, WU Nier. Survey of Mongolian-Chinese Neural Machine Translation [J]. Computer Science, 2022, 49(1): 31-40.
[11] LI Zhao-qi, LI Ta. Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining [J]. Computer Science, 2022, 49(1): 59-64.
[12] LIU Chuang, XIONG De-yi. Survey of Multilingual Question Answering [J]. Computer Science, 2022, 49(1): 65-72.
[13] CHEN Zhi-yi, SUI Jie. DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection [J]. Computer Science, 2022, 49(1): 101-107.
[14] WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[15] PAN Fang, ZHANG Hui-bing, DONG Jun-chao, SHOU Zhao-yu. Aspect Sentiment Analysis of Chinese Online Course Review Based on Efficient Transformer [J]. Computer Science, 2021, 48(6A): 264-269.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!