计算机科学 ›› 2020, Vol. 47 ›› Issue (3): 162-173.doi: 10.11896/jsjkx.191000167
李舟军,范宇,吴贤杰
LI Zhou-jun,FAN Yu,WU Xian-jie
摘要: 近年来,随着深度学习的快速发展,面向自然语言处理领域的预训练技术获得了长足的进步。早期的自然语言处理领域长期使用Word2Vec等词向量方法对文本进行编码,这些词向量方法也可看作静态的预训练技术。然而,这种上下文无关的文本表示给其后的自然语言处理任务带来的提升非常有限,并且无法解决一词多义问题。ELMo提出了一种上下文相关的文本表示方法,可有效处理多义词问题。其后,GPT和BERT等预训练语言模型相继被提出,其中BERT模型在多个典型下游任务上有了显著的效果提升,极大地推动了自然语言处理领域的技术发展,自此便进入了动态预训练技术的时代。此后,基于BERT的改进模型、XLNet等大量预训练语言模型不断涌现,预训练技术已成为自然语言处理领域不可或缺的主流技术。文中首先概述预训练技术及其发展历史,并详细介绍自然语言处理领域的经典预训练技术,包括早期的静态预训练技术和经典的动态预训练技术;然后简要梳理一系列新式的有启发意义的预训练技术,包括基于BERT的改进模型和XLNet;在此基础上,分析目前预训练技术研究所面临的问题;最后对预训练技术的未来发展趋势进行展望。
中图分类号:
[1]HE K,ZHANG X,REN S,et al.Deep residual learning for ima- ge recognition[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [2]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781. [3]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]∥Advances in Neural Information Processing Systems.2013:3111-3119. [4]ABADI M,BARHAM P,CHEN J,et al.Tensorflow:a system for large-scale machine learning[J].arXiv:1605.08695. [5]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]∥International Conference on Machine Learning.2014:1188-1196. [6]DENG L,YU D.Deep learning:methods and applications[J].Foundations and Trends in Signal Processing,2014,7(3/4):197-387. [7]PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365. [8]RADFORD A,NARASIMHAN K,SALIMANS T,et al.Improving language understanding by generative pre-training[J/OL].https://s3-us-west-2.amazonaws.com/openai-assets/researchcovers/languageunsupervised/language understanding paper.pdf,2018. [9]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805. [10]YANG Z,DAI Z,YANG Y,et al.XLNet:Generalized Autoregressive Pretraining for Language Understanding[J].arXiv:1906.08237. [11]YOSINSKI J,CLUNE J,BENGIO Y,et al.How transferable are features in deep neural networks?[C]∥Advances in Neural Information Processing Systems.2014:3320-3328. [12]OQUAB M,BOTTOU L,LAPTEV I,et al.Learning and transferring mid-level image representations using convolutional neural networks[C]∥Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:1717-1724. [13]GLOROT X,BORDES A,BENGIO Y.Domain adaptation for large-scale sentiment classification:A deep learning approach[C]∥Proceedings of the 28th International Conference on Machine Learning (ICML-11).2011:513-520. [14]CHEN M,XU Z,WEINBERGER K,et al.Marginalized denoi- sing autoencoders for domain adaptation[J].arXiv:1206.4683. [15]GANIN Y,USTINOVA E,AJAKAN H,et al.Domain-adversarial training of neural networks[J].The Journal of Machine Learning Research,2016,17(1):2096-2030. [16]SZEGEDY C,IOFFE S,VANHOUCKE V,et al.Inception-v4,inception-resnet and the impact of residual connections onlear-ning[C]∥AAAI.2017:12. [17]WU Z,SHEN C,HENGEL A V D.Wider or Deeper:Revisiting the ResNet Model for Visual Recognition[J].arXiv:1611.10080. [18]SINGH S,HOIEM D,FORSYTH D.Swapout:Learning an ensemble of deep architectures[C]∥Advances in Neural Information Processing Systems.2016:28-36. [19]REN S,HE K,GIRSHICK R,et al.Faster r-cnn:Towards real-time object detection with region proposal networks[C]∥Advances in Neural Information Processing Systems.2015:91-99. [20]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[J].arXiv:1608.06993. [21]HE K,ZHANG X,REN S,et al.Identity mappings in deep residual networks[C]∥European Conference on Computer Vision.Cham:Springer,2016:630-645. [22]LEDIG C,THEIS L,HUSZÁR F,et al.Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network[J].arXiv:1609.04802. [23]PETERS M,AMMAR W,BHAGAVATULA C,et al.Semi-supervised sequence tagging with bidirectional language models[C]∥Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:1756-1765. [24]KIROS R,ZHU Y,SALAKHUTDINOV R R,et al.Skip- thought vectors[C]∥Advances in Neural Information Proces-sing Systems.2015:3294-3302. [25]VINCENT P,LAROCHELLE H,BENGIO Y,et al.Extracting and composing robust features with denoising autoencoders[C]∥Proceedings of the 25th International Conference on Machine Learning.ACM,2008:1096-1103. [26]BENGIO Y,DUCHARME R,VINCENT P,et al.A neural probabilistic language model[J].Journal of Machine Learning Research,2003,3(6):1137-1155. [27]PENNINGTON J,SOCHER R,MANNING C.Glove:Global vectors for word representation[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:1532-1543. [28]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[J].arXiv:1607.01759. [29]CHEN D,MANNING C.A fast and accurate dependency parser using neural networks[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:740-750. [30]BORDES A,USUNIER N,GARCIA-DURAN A,et al.Translating embeddings for modeling multi-relational data[C]∥Advances in Neural Information Processing Systems.2013:2787-2795. [31]TAI K S,SOCHER R,MANNING C D.Improved semantic representations from tree-structured long short-term memory networks[J].arXiv:1503.00075. [32]GROVER A,LESKOVEC J.node2vec:Scalable feature learning for networks[C]∥Proceedings of the 22nd ACM SIGKDD international Conference on Knowledge Discovery and Data Mi-ning.ACM,2016:855-864. [33]TANG J,QU M,WANG M,et al.Line:Large-scale information network embedding[C]∥Proceedings of the 24th International Conference on World Wide Web.International World Wide Web Conferences Steering Committee.2015:1067-1077. [34]NICKEL M,KIELA D.Poincaré embeddings for learning hierarchical representations[C]∥Advances in Neural Information Processing Systems.2017:6338-6347. [35]KAHNG M,ANDREWS P Y,KALRO A,et al.A cti v is:Vi- sual exploration of industry-scale deep neural network models[J].IEEE Transactions on Visualization and Computer Graphics,2018,24(1):88-97. [36]YANG X,MACDONALD C,OUNIS I.Using word embeddings in twitter election classification[J].Information Retrieval Journal,2018,21(2/3):183-207. [37]MNIH A,HINTON G.Three new graphical models for statistical language modelling[C]∥Proceedings of the 24th International Conference on Machine Learning.ACM,2007:641-648. [38]MNIH A,HINTON G E.A scalable hierarchical distributed language model[C]∥Advances in Neural Information Processing Systems.2009:1081-1088. [39]COLLOBERT R,WESTON J,BOTTOU L,et al.Natural language processing (almost) from scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537. [40]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrent neural network based language model[C]∥Eleventh Annual Conference of the International Speech Communication Association.2010. [41]GUTMANN M U,HYVÄRINEN A.Noise-contrastive estimation of unnormalized statistical models,with applications tona-tural image statistics[J].Journal of Machine Learning Research,2012,13:307-361. [42]DEERWESTER S,DUMAIS S T,FURNAS G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society for Information Science,1990,41(6):391-407. [43]GOLUB G H,REINSCH C.Singular value decomposition and least squares solutions[M]∥Linear Algebra.Berlin:Springer,1971:134-151. [44]HARRIS Z S.Distributional structure[J].Word,1954,10(2/3):146-162. [45]JOZEFOWICZ R,VINYALS O,SCHUSTER M,et al.Exploring the limits of language modeling[J].arXiv:1602.02410. [46]HOWARD J,RUDER S.Universal language model fine-tuning for text classification[J].arXiv:1801.06146. [47]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]∥Advances in Neural Information Processing Systems.2017:5998-6008. [48]LIU P J,SALEH M,POT E,et al.Generating wikipedia by summarizing long sequences[J].arXiv:1801.10198. [49]SONG K,TAN X,QIN T,et al.Mass:Masked sequence to sequence pre-training for language generation[J].arXiv:1905.02450. [50]DONG L,YANG N,WANG W,et al.Unified Language Model Pre-training for Natural Language Understanding and Generation[J].arXiv:1905.03197. [51]SUN Y,WANG S,LI Y,et al.ERNIE:Enhanced Representation through Knowledge Integration[J].arXiv:1904.09223. [52]ZHANG Z,HAN X,LIU Z,et al.ERNIE:Enhanced Language Representation with Informative Entities[J].arXiv:1905.07129. [53]LIU X,HE P,CHEN W,et al.Multi-task deep neural networks for natural language understanding[J].arXiv:1901.11504. [54]SUN Y,WANG S,LI Y,et al.Ernie 2.0:A continual pre-trai- ning framework for language understanding[J].arXiv:1907.12412. [55]HINTON G,VINYALS O,DEAN J.Distilling the knowledge in a neural network[J].arXiv:1503.02531. [56]CUI Y,CHE W,LIU T,et al.Pre-Training with Whole Word Masking for Chinese BERT[J].arXiv:1906.08101. [57]JOSHI M,CHEN D,LIU Y,et al.SpanBERT:Improving pre-training by representing and predicting spans[J].arXiv:1907.10529. [58]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti- mized BERT pretraining approach[J].arXiv:1907.11692. [59]RADFORD A,WU J,CHILD R,et al.Language models are unsupervised multitask learners[J].OpenAI Blog,2019,1(8). [60]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentive language models beyond a fixed-length context[J].arXiv:1901.02860. [61]NIVEN T,KAO H Y.Probing neural network comprehension of natural language arguments[J].arXiv:1907.07355. [62]MCCOY R T,PAVLICK E,LINZEN T.Right for the Wrong Reasons:Diagnosing Syntactic Heuristics in Natural Language Inference[J].arXiv:1902.01007. [63]WOLF T,DEBUT L,SANH V,et al.Transformers:State-of-the-art Natural Language Processing[J].arXiv:1910.03771. [64]Bright.GitHub repository[OL].https://github.com/bright- mart/albert_zh. [65]LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A Lite BERT for Self-supervised Learning of Language Representations[J].arXiv:1909.11942. |
[1] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[2] | 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018 |
[3] | 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波. 语义通信系统的性能度量指标分析 Analysis of Performance Metrics of Semantic Communication Systems 计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071 |
[4] | 李小伟, 舒辉, 光焱, 翟懿, 杨资集. 自然语言处理在简历分析中的应用研究综述 Survey of the Application of Natural Language Processing for Resume Analysis 计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134 |
[5] | 赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123 |
[6] | 韩红旗, 冉亚鑫, 张运良, 桂婕, 高雄, 易梦琳. 基于共同子空间分类学习的跨媒体检索研究 Study on Cross-media Information Retrieval Based on Common Subspace Classification Learning 计算机科学, 2022, 49(5): 33-42. https://doi.org/10.11896/jsjkx.210200157 |
[7] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[8] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[9] | 侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006 |
[10] | 李昭奇, 黎塔. 基于wav2vec预训练的样例关键词识别 Query-by-Example with Acoustic Word Embeddings Using wav2vec Pretraining 计算机科学, 2022, 49(1): 59-64. https://doi.org/10.11896/jsjkx.210900007 |
[11] | 刘创, 熊德意. 多语言问答研究综述 Survey of Multilingual Question Answering 计算机科学, 2022, 49(1): 65-72. https://doi.org/10.11896/jsjkx.210900003 |
[12] | 陈志毅, 隋杰. 基于DeepFM和卷积神经网络的集成式多模态谣言检测方法 DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection 计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007 |
[13] | 刘凯, 张宏军, 陈飞琼. 基于领域适应嵌入的军事命名实体识别 Name Entity Recognition for Military Based on Domain Adaptive Embedding 计算机科学, 2022, 49(1): 292-297. https://doi.org/10.11896/jsjkx.201100007 |
[14] | 王胜, 张仰森, 陈若愚, 向尕. 基于细粒度差异特征的文本匹配方法 Text Matching Method Based on Fine-grained Difference Features 计算机科学, 2021, 48(8): 60-65. https://doi.org/10.11896/jsjkx.200700008 |
[15] | 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130 |
|