面向自然语言处理的预训练技术研究综述

doi:10.11896/jsjkx.191000167

计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 162-173.doi: 10.11896/jsjkx.191000167

面向自然语言处理的预训练技术研究综述

李舟军,范宇,吴贤杰

(北京航空航天大学计算机学院北京100191)

收稿日期:2019-09-25 出版日期:2020-03-15 发布日期:2020-03-30
通讯作者: 李舟军(lizj@buaa.edu.cn)
基金资助:
国家自然科学基金(U1636211,61672081);软件开发环境国家重点实验室课题(SKLSDE-2019ZX-17);北京成像理论与技术高精尖创新中心课题(BAICIT-2016001)

Survey of Natural Language Processing Pre-training Techniques

LI Zhou-jun,FAN Yu,WU Xian-jie

(School of Computer Science and Engineering, Beihang University, Beijing 100191, China)

Received:2019-09-25 Online:2020-03-15 Published:2020-03-30
About author:LI Zhou-jun,born in 1963,is a professor and doctoral tutor of Beihang University of Computer.He is currently a member of the Network Space Security Discipline Review Group of the Academic Degrees Committee of the State Council,the executive director of the China Cyberspace Security Association,the deputy director of the Language Intelligence Committee of the China Artificial Intelligence Society,and a member of the ACM,IEEE,and AAAI.He is mainly engaged in the research of artificial intelligence and natural language processing such as intelligent question and answer,semantic analysis,information extraction and OCR.He has published more than 300 academicpapers in SCI journals including TKDE,TIFS and other top international conferences such as AAAI,IJCAI,ACL,EMNLP,and won the ECIR 2010 Best Paper Award.The team he directed has won several championships in artificial intelligence and cybersecurity competitions at home and abroad.
Supported by:
This work was supported by the National Natural Science Foundation of China (U1636211, 61672081), Fund of the State Key Laboraty of Software Development Environment (SKLSDE-2019ZX-17) and Beijing Advanced Innovation Center for Imaging Technology (BAICIT-2016001).

摘要/Abstract

摘要： 近年来,随着深度学习的快速发展,面向自然语言处理领域的预训练技术获得了长足的进步。早期的自然语言处理领域长期使用Word2Vec等词向量方法对文本进行编码,这些词向量方法也可看作静态的预训练技术。然而,这种上下文无关的文本表示给其后的自然语言处理任务带来的提升非常有限,并且无法解决一词多义问题。ELMo提出了一种上下文相关的文本表示方法,可有效处理多义词问题。其后,GPT和BERT等预训练语言模型相继被提出,其中BERT模型在多个典型下游任务上有了显著的效果提升,极大地推动了自然语言处理领域的技术发展,自此便进入了动态预训练技术的时代。此后,基于BERT的改进模型、XLNet等大量预训练语言模型不断涌现,预训练技术已成为自然语言处理领域不可或缺的主流技术。文中首先概述预训练技术及其发展历史,并详细介绍自然语言处理领域的经典预训练技术,包括早期的静态预训练技术和经典的动态预训练技术;然后简要梳理一系列新式的有启发意义的预训练技术,包括基于BERT的改进模型和XLNet;在此基础上,分析目前预训练技术研究所面临的问题;最后对预训练技术的未来发展趋势进行展望。

关键词: 词向量, 语言模型, 预训练, 自然语言处理

Abstract: In recent years,with the rapid development of deep learning,the pre-training technology for the field of natural language processing has made great progress.In the early days of natural language processing,the word embedding methods such as Word2Vec were used to encode text.These word embedding methods can also be regarded as static pre-training techniques.However,the context-independent text representation has limitation and cannot solve the polysemy problem.The ELMo pre-training language model gives a context-dependent method that can effectively handle polysemy problems.Later,GPT,BERT and other pre-training language models have been proposed,especially the BERT model,which significantly improves the effect on many typical downstream tasks,greatly promotes the technical development in the field of natural language processing,and thus initia-tes the age of dynamic pre-training.Since then,a number of pre-training language models such as BERT-based improved models and XLNet have emerged,and pre-training techniques have become an indispensable mainstream technology in the field of natural language processing.This paper first briefly introduce the pre-training technology and its development history,and then comb the classic pre-training techniques in the field of natural language processing,including the early static pre-training techniques and the classic dynamic pre-training techniques.Then the paper briefly comb a series of inspiring pre-training techniques,including BERT-based models and XLNet.On this basis,the paper analyze the problems faced by the current pre-training technology.Finally,the future development trend of pre-training technologies is prospected.

Key words: Language model, Natural language processing, Pre-training, Word embedding

中图分类号:

TP391

李舟军,范宇,吴贤杰. 面向自然语言处理的预训练技术研究综述[J]. 计算机科学, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167

LI Zhou-jun,FAN Yu,WU Xian-jie. Survey of Natural Language Processing Pre-training Techniques[J]. Computer Science, 2020, 47(3): 162-173. https://doi.org/10.11896/jsjkx.191000167

参考文献

[1]HE K,ZHANG X,REN S,et al.Deep residual learning for ima- ge recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[2]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781.
[3]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119.
[4]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:a system for large-scale machine learning[J].arXiv:1605.08695.
[5]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]∥International Conference on Machine Learning.2014:1188-1196.
[6]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trends in Signal Processing,2014,7(3/4):197-387.
[7]PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365.
[8]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf,2018.
[9]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805.
[10]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized Autoregressive Pretraining for Language Understanding[J].arXiv:1906.08237.
[11]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]∥Advances in Neural Information Processing Systems.2014:3320-3328.
[12]OQUAB M,BOTTOU L,LAPTEV I,et al.Learning and transferring mid-level image representations using convolutional neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:1717-1724.
[13]GLOROT X,BORDES A,BENGIO Y.Domain adaptation for large-scale sentiment classification:A deep learning approach[C]∥Proceedings of the 28th International Conference on Machine Learning (ICML-11).2011:513-520.
[14]CHEN M,XU Z,WEINBERGER K,et al.Marginalized denoi- sing autoencoders for domain adaptation[J].arXiv:1206.4683.
[15]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030.
[16]SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections onlear-ning[C]∥AAAI.2017:12.
[17]WU Z,SHEN C,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].arXiv:1611.10080.
[18]SINGH S,HOIEM D,FORSYTH D.Swapout:Learning an ensemble of deep architectures[C]∥Advances in Neural Information Processing Systems.2016:28-36.
[19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]∥Advances in Neural Information Processing Systems.2015:91-99.
[20]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[J].arXiv:1608.06993.
[21]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]∥European Conference on Computer Vision.Cham:Springer,2016:630-645.
[22]LEDIG C,THEIS L,HUSZÁR F,et al.Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network[J].arXiv:1609.04802.
[23]PETERS M,AMMAR W,BHAGAVATULA C,et al.Semi-supervised sequence tagging with bidirectional language models[C]∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:1756-1765.
[24]KIROS R,ZHU Y,SALAKHUTDINOV R R,et al.Skip- thought vectors[C]∥Advances in Neural Information Proces-sing Systems.2015:3294-3302.
[25]VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]∥Proceedings of the 25th International Conference on Machine Learning.ACM,2008:1096-1103.
[26]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155.
[27]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543.
[28]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[J].arXiv:1607.01759.
[29]CHEN D,MANNING C.A fast and accurate dependency parser using neural networks[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:740-750.
[30]BORDES A,USUNIER N,GARCIA-DURAN A,et al.Translating embeddings for modeling multi-relational data[C]∥Advances in Neural Information Processing Systems.2013:2787-2795.
[31]TAI K S,SOCHER R,MANNING C D.Improved semantic representations from tree-structured long short-term memory networks[J].arXiv:1503.00075.
[32]GROVER A,LESKOVEC J.node2vec:Scalable feature learning for networks[C]∥Proceedings of the 22nd ACM SIGKDD international Conference on Knowledge Discovery and Data Mi-ning.ACM,2016:855-864.
[33]TANG J,QU M,WANG M,et al.Line:Large-scale information network embedding[C]∥Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee.2015:1067-1077.
[34]NICKEL M,KIELA D.Poincaré embeddings for learning hierarchical representations[C]∥Advances in Neural Information Processing Systems.2017:6338-6347.
[35]KAHNG M,ANDREWS P Y,KALRO A,et al.A cti v is:Vi- sual exploration of industry-scale deep neural network models[J].IEEE Transactions on Visualization and Computer Graphics,2018,24(1):88-97.
[36]YANG X,MACDONALD C,OUNIS I.Using word embeddings in twitter election classification[J].Information Retrieval Journal,2018,21(2/3):183-207.
[37]MNIH A,HINTON G.Three new graphical models for statistical language modelling[C]∥Proceedings of the 24th International Conference on Machine Learning.ACM,2007:641-648.
[38]MNIH A,HINTON G E.A scalable hierarchical distributed language model[C]∥Advances in Neural Information Processing Systems.2009:1081-1088.
[39]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
[40]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrent neural network based language model[C]∥Eleventh Annual Conference of the International Speech Communication Association.2010.
[41]GUTMANN M U,HYVÄRINEN A.Noise-contrastive estimation of unnormalized statistical models,with applications tona-tural image statistics[J].Journal of Machine Learning Research,2012,13:307-361.
[42]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407.
[43]GOLUB G H,REINSCH C.Singular value decomposition and least squares solutions[M]∥Linear Algebra.Berlin:Springer,1971:134-151.
[44]HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162.
[45]JOZEFOWICZ R,VINYALS O,SCHUSTER M,et al.Exploring the limits of language modeling[J].arXiv:1602.02410.
[46]HOWARD J,RUDER S.Universal language model fine-tuning for text classification[J].arXiv:1801.06146.
[47]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008.
[48]LIU P J,SALEH M,POT E,et al.Generating wikipedia by summarizing long sequences[J].arXiv:1801.10198.
[49]SONG K,TAN X,QIN T,et al.Mass:Masked sequence to sequence pre-training for language generation[J].arXiv:1905.02450.
[50]DONG L,YANG N,WANG W,et al.Unified Language Model Pre-training for Natural Language Understanding and Generation[J].arXiv:1905.03197.
[51]SUN Y,WANG S,LI Y,et al.ERNIE:Enhanced Representation through Knowledge Integration[J].arXiv:1904.09223.
[52]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced Language Representation with Informative Entities[J].arXiv:1905.07129.
[53]LIU X,HE P,CHEN W,et al.Multi-task deep neural networks for natural language understanding[J].arXiv:1901.11504.
[54]SUN Y,WANG S,LI Y,et al.Ernie 2.0:A continual pre-trai- ning framework for language understanding[J].arXiv:1907.12412.
[55]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531.
[56]CUI Y,CHE W,LIU T,et al.Pre-Training with Whole Word Masking for Chinese BERT[J].arXiv:1906.08101.
[57]JOSHI M,CHEN D,LIU Y,et al.SpanBERT:Improving pre-training by representing and predicting spans[J].arXiv:1907.10529.
[58]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti- mized BERT pretraining approach[J].arXiv:1907.11692.
[59]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8).
[60]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].arXiv:1901.02860.
[61]NIVEN T,KAO H Y.Probing neural network comprehension of natural language arguments[J].arXiv:1907.07355.
[62]MCCOY R T,PAVLICK E,LINZEN T.Right for the Wrong Reasons:Diagnosing Syntactic Heuristics in Natural Language Inference[J].arXiv:1902.01007.
[63]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art Natural Language Processing[J].arXiv:1910.03771.
[64]Bright.GitHub repository[OL].https://github.com/bright- mart/albert_zh.
[65]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A Lite BERT for Self-supervised Learning of Language Representations[J].arXiv:1909.11942.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

面向自然语言处理的预训练技术研究综述

Survey of Natural Language Processing Pre-training Techniques

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0

[1]	闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[2]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[3]	姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[4]	李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[5]	赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[6]	韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157
[7]	刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[8]	张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[9]	侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[10]	李昭奇, 黎塔. 基于wav2vec预训练的样例关键词识别 Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining 计算机科学, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007
[11]	刘创, 熊德意. 多语言问答研究综述 Survey of Multilingual Question Answering 计算机科学, 2022, 49(1): 65-72. https://doi.org/10.11896/jsjkx.210900003
[12]	陈志毅, 隋杰. 基于DeepFM和卷积神经网络的集成式多模态谣言检测方法 DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection 计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[13]	刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007
[14]	王胜, 张仰森, 陈若愚, 向尕. 基于细粒度差异特征的文本匹配方法 Text Matching Method Based on Fine-grained Difference Features 计算机科学, 2021, 48(8): 60-65. https://doi.org/10.11896/jsjkx.200700008
[15]	王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130