视觉富文档理解预训练综述

doi:10.11896/jsjkx.240300028

摘要/Abstract

摘要： 视觉富文档指语义结构不仅由文本内容决定,还与排版格式和表格结构等视觉元素相关的文档。现实生活中的票据理解和证件识别等应用场景,都需要对视觉富文档进行自动化的阅读、分析和处理。这一过程即为视觉富文档理解,属于自然语言处理和计算机视觉的交叉领域。近年来,视觉富文档理解领域的预训练技术在打破下游任务的训练壁垒和提升模型表现上取得了重大的进展。然而,目前对现有的预训练模型的归纳总结和深入分析仍然有所欠缺。为此,对视觉富文档理解领域预训练技术的相关研究进行了全面总结。首先,介绍了预训练技术的数据预处理阶段,包括预训练数据集和光学字符识别引擎。然后,对预训练技术的模型预训练阶段进行了阐述,提炼出单模态表示学习、多模态特征融合和预训练任务3个关键的技术模块,并基于上述模块归纳了预训练模型之间的共性和差异。此外,简要介绍了多模态大模型在视觉富文档理解领域的应用。接着,对预训练模型在下游任务上的表现进行了对比分析。最后,探讨了预训练技术面临的挑战和未来的研究方向。

关键词: 文档智能, 预训练模型, 自然语言处理, 计算机视觉, 深度学习

Abstract: Visually-rich document(VrD) refers to a document whose semantic structures are related to visual elements like typesetting formats and table structures in addition to being determined by the textual content.Numerous application scenarios,such as receipt understanding and card recognition,require automatically reading, analyzing and processing VrD(e.g.,forms,invoices,and resumes).This process is called visually-rich document understanding(VrDU),which is the cross-filed between natural language processing(NLP) and computer vision(CV).Recently,self-supervised pre-training techniques of VrDU have made significant progress in breaking down the training barriers between downstream tasks and improving model performance.However,a comprehensive summary and in-depth analysis of the pre-training models of VrDU is still lacking.To this end,we conduct an in-depth investigation and comprehensive summary of pre-training techniques of VrDU.Firstly,we introduce the data processing stage of pre-training technology,including the traditional pre-training datasets and optical character recognition(OCR) engines.Then,we discuss three key technique modules in the model pre-training stage,namely single-modality representation learning,multi-modal feature fusion,and pre-training tasks.Meanwhile,the similarities and differences between the pre-training models are elaborated on the basis of the above three modules.In addition,we briefly introduce the multi-modal large models applied in VrDU.Furthermore,we analyze the experimental results of pre-training models on three representative downstream tasks.Finally,the challenges and future research directions related to the pre-training models are pointed out.

Key words: Document intelligence, Pre-training models, Natural language processing, Computer vision, Deep learning

中图分类号:

TP391

张剑, 李晖, 张晟铭, 吴杰, 彭滢. 视觉富文档理解预训练综述[J]. 计算机科学, 2025, 52(1): 259-276. https://doi.org/10.11896/jsjkx.240300028

ZHANG Jian, LI Hui, ZHANG Shengming, WU Jie, PENG Ying. Review of Pre-training Methods for Visually-rich Document Understanding[J]. Computer Science, 2025, 52(1): 259-276. https://doi.org/10.11896/jsjkx.240300028

参考文献

[1]SARKHEL R,NANDI A.Deterministic Routing between Layout Abstractions for Multi-scale Classification of Visually Rich Documents[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI).2019:3360-3366.
[2]CUI L,XU Y H,LU T C,et al.Document AI:Benchmarks,Models and Applications[J].Journal of Chinese Information Processing,2022,36(6):1-19.
[3]LIU X J,GAO F Y,ZHANG Q,et al.Graph Convolution for Multimodal Information Extraction from Visually Rich Documents[C]//Proceedings of 2019 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:32-39.
[4]ZHANG P,XU Y L,CHENG Z Z,et al.TRIE:End-to-end Text Reading and Information Extraction for Document Understan-ding[C]//Proceedings of the 28th ACM International Confe-rence on Multimedia(MM).2020:1413-1422.
[5]YU W W,LU N,QI X B,et al.PICK:Processing Key Information Extraction from Documents Using Improved Graph Lear-ning-convolutional Networks[C]//Proceedings of the 25th International Conference on Pattern Recognition(ICPR).2021:4363-4370.
[6]WU T L,LI C,ZHANG M Y,et al.LAMPRET:Layout-aware Multimodal Pretraining for Document Understanding[J].arXiv:2104.08405,2021.
[7]GU J X,KUEN J,MORARIU V I,et al.UniDoc:Unified Pretraining Framework for Document Understanding[C]//Proceedings of the 35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:39-50.
[8]POWALSKI R,BORCHMANNŁ,JURKIEWICZ D,et al.Going Full-tilt Boogie on Document Understanding with Text-image-layout Transformer[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:732-747.
[9]PAN S J,YANG Q.A Survey on Transfer Learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359.
[10]YOSINSKI J,CLUNE J,BENGIO Y,et al.How Transferable Are Features in Deep Neural Networks?[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems(NeurIPS).2014:3320-3328.
[11]VASWANI A,SHAZEER N,USZKOREIT J,et al.Attention Is All You Need[C]//Proceedings of the 31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:6000-6010.
[12]XU Y H,LI M H,CUI L,et al.Layoutlm:Pre-training of Text and Layout for Document Image Understanding[C]//Procee-dings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD).2020:1192-1200.
[13]PENG Q M,PAN Y X,WANG W J,et al.ERNIE-Layout:Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:7747-7757.
[14]APPALARAJU S,JASANI B,KOTA B U,et al.Docformer:End-to-end Transformer for Document Understanding[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:993-1003.
[15]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pretrainingof Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2018:4171-4186.
[16]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition(CVPR).2016:770-778.
[17]BA L J,KIROS J R,HINTON G E.Layer Normalization[J].arXiv:1607.06450,2016.
[18]ZHANG H Y,WANG T B,LI M Z,et al.Comprehensive Review of Visual-language-oriented Multimodal Pretraining Me-thods[J].Journal of Image and Graphics,2022,27:2652-2682.
[19]LEWIS D,AGAM G,ARGAMON S,et al.Building a Test Collection for Complex Document Information Processing[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2006:665-666.
[20]HARLEY A W,UFKES A,DERPANIS K G.Evaluation ofDeep Convolutional Nets for Document Image Classification and Retrieval[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition(ICDAR).2015:991-995.
[21]ZHONG X,TANG J,YEPES A J.Publaynet:Largest Dataset Ever for Document Layout Analysis[C]//Proceddings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1015-1022.
[22]LI M H,XU Y H,CUI L,et al.DocBank:A Benchmark Dataset for Document Layout Analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics(COLING).2020:949-960.
[23]BITEN A F,TITO R,GOMEZ L,et al.Ocr-idl:Ocr Annotations for Industry Document Library Dataset[C]//Workshop at European Conference on Computer Vision(ECCV).2022:241-252.
[24]WU Y H,SCHUSTER M,CHEN Z F,et al.Google’s Neural Machine Translation System:Bridging the Gap between Human and Machine Translation[J].arXiv:1609.08144,2016.
[25]LI Y L,QIAN Y X,YU Y C,et al.Structext:Structured Text Understanding with Multi-modal Transformers[C]//Procee-dings of the 29th ACM International Conference on Multimedia(MM).2021:1912-1920.
[26]LUO C W,TANG G Z,ZHENG Q,et al.Bi-VLDoc:Bidirec-tional Vision-language Modeling for Visually-rich Document Understanding[J].arXiv:2206.13155,2022.
[27]BAI H L,LIU Z G,MENG X J,et al.Wukong-Reader:Multi-modal Pre-training for Fine-grained Visual Document Understan-ding[C]//Proceedings of the 61st Annual Meeting of the Asso-ciation for Computational Linguistics.2023:13386-13401.
[28]LIU Y,OTT M,GOYAL N,et al.Roberta:A Robustly Opti-mized Bert Pretraining Approach[J].arXiv:1907.11692,2019.
[29]LI P Z,GU J X,KUEN J,et al.SelfDoc:Self-supervised Document Representation Learning[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:5648-5656.
[30]WANG Z L,GU J X,TENSMEYER C,et al.MGDoc:Pre-trai-ning with Multi-granular Hierarchy for Document Image Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:3984-3993.
[31]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings Using Siamese BERT-networks[C]//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992.
[32]LI C L,BI B,YAN M,et al.StructuralLM:Structural Pre-trai-ning for Form Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:6309-6318.
[33]WANG J P,JIN L W,DING K.LiLT:A Simple yet Effective Language-independent Layout Transformer for Structured Do-cument Understanding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:7747-7757.
[34]HONG T,KIM D,JI M,et al.BROS:A Pre-trained LanguageModel Focusing on Text and Layout for Better Key Information Extraction from Documents[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence(AAAI).2022:10767-10775.
[35]XU Y,XU Y H,LV T C,et al.LayoutLMv2:Multi-modal Pre-training for Visually-Rich Document Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:2579-2591.
[36]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional Networks for Biomedical Image Segmentation[C]//Proceedings of Medical Image Computing and Computer-Assisted Intervention 18th International Conference(MICCAI).2015:234-241.
[37]XU Y H,LV T C,CUI L,et al.LayoutXLM:Multimodal Pre-training for Multilingual Visually-rich Document Understanding[J].arXiv:2104.08836,2021.
[38]XIE S N,GIRSHICK R,DOLLáR P,et al.Aggregated Residual Transformations for Deep Neural Networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1492-1500.
[39]REN S,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-time Object Detection with Region Proposal Networks [C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems(NeurIPS).2015:91-99.
[40]HE K M,GKIOXARI G,DOLLÁR P,et al.Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Compu-ter Vision(ICCV).2017:2961-2969.
[41]ALI T,ROY P.Enhancing Document Information Analysis with Multi-Task Pre-training:A Robust Approach for Information Extraction in Visually-Rich Documents[J].arXiv:2310.16527,2023.
[42]ZHANG Z R,MA J F,DU J,et al.Multimodal Pre-Training Based on Graph Attention Network for Document Understan-ding[J].IEEE Transactions on Multimedia,2023,25:6743-6755.
[43]LIU Z,LIN Y T,CAO Y,et al.Swin transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:10012-10022.
[44]HUANG Y P,LV T C,CUI L,et al.LayoutLMv3:Pre-training for Document AI with Unified Text and Image Masking[C]//Proceddings of the 30th ACM International Conference on Multimedia(MM).2022:4083-4091.
[45]LI J L,XU Y H,LV T C,et al.Dit:Self-supervised Pre-training for Document Image Transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia(MM).2022:3530-3539.
[46]TU Y,GUO Y,CHEN H,et al.LayoutMask:Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(ACL).2023:15200-15212.
[47]LUO C W,CHENG C X,ZHENG Q,et al.GeoLayoutLM:Geometric Pre-training for Visual Information Extraction[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2023:7092-7101.
[48]APPALARAJU S,TANG P,DONG Q,et al.DocFormerv2:Local Features for Document Understanding[C]//Proceedings of the 38th AAAI Conference on Artificial Intelligence(AAAI).2024:709-718.
[49]LI Q W,LI Z C,CAI X T,et al.Enhancing Visually-Rich Document Understanding via Layout Structure Modeling[C]//Proceedings of the 31st ACM International Conference on Multimedia(MM).2023:4513-4523.
[50]ZHENG Z H,WANG P,LIU W,et al.Distance-IoU Loss:Faster and Better Learning for Bounding Box Regression[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence(AAAI).2020,34(7):12993-13000.
[51]BROWN T B,MANN B,RYDER N,et al.Language ModelsAre Few-shot Learners[C]//Proceedings of the 34th Annual Conference on Neural Information Processing Systems(NeurIPS).2020:1877-1901.
[52]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[53]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing Vision-language Understanding with Advanced Large language Models[J].arXiv:2304.10592,2023.
[54]LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Proceedings of the 37th Annual Conference on Neural Information Processing Systems(NeurIPS).2023.
[55]YE J B,HU A W,XU H Y,et al.mPLUG-DocOwl:Modularized Multimodal Large Language Model for Document Understan-ding[J].arXiv:2307.02499,2023.
[56]HU A W,XU H Y,YE J B,et al.mPLUG-DocOwl 1.5:Unified Structure Learning for OCR-free Document Understanding[J].arXiv:2403.12895,2024.
[57]YE J B,HU A W,XU H Y,et al.UReader:Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model[C]//Proceedings of Findings of the Association for Computational Linguistics:EMNLP.2023:2841-2858.
[58]BAI J Z,BAI S,YANG S S,et al.Qwen-VL:A Frontier Large Vision-Language Model with Versatile Abilities[J].arXiv:2308.12966,2023.
[59]FENG H,LIU Q,LIU H,et al.DocPedia:Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding[J].arXiv:2311.11810,2023.
[60]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution and Text Label Are Important Things for Large Multi-modal Mo-dels[J].arXiv:2311.06607,2023.
[61]HUANG Z,CHEN K,HE J H,et al.ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1516-1520.
[62]JAUME G,EKENEL H K,THIRAN J P.FUNSD:A Dataset for Form Understanding in Noisy Scanned Documents[C]//Workshop at 2019 International Conference on Document Analysis and Recognition.2019:1-6.
[63]PARK S,SHIN S,LEE B,et al.CORD:A Consolidated Receipt Dataset for Post-OCR Parsing[C]//Workshop on Document Intelligence at NeurIPS.2019.
[64]GUO H,QIN X M,LIU J M,et al.Eaten:Entity-aware Attention for Single Shot Visual Text Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Re-cognition(ICDAR).2019:254-259.
[65]SUN H B,KUANG Z H,YUE X Y,et al.Spatial Dual-modality Graph Reasoning for Key Information Extraction[J].arXiv:2103.14470,2021.
[66]WANG J P,LIU C Y,JIN L W,et al.Towards Robust Visual Information Extraction in Real World:New Dataset and Novel Solution[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence(AAAI).2021:2738-2745.
[67]STANISŁAWEK T,GRALIИSKI F,WRÁBLEWSKA A,et al.Kleister:Key Information Extraction Datasets Involving Long Documents with Complex Layouts[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:564-579.
[68]MATHEW M,KARATZAS D,JAWAHAR C V.DOCVQA:A Dataset for VQA on Document Image[C]//Proceedings of 2021 IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:2200-2209.
[69]QI L,LV S,LI H Y,et al.DuReadervis:A Chinese Dataset for Open-domain Document Visual Question Answering[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:1338-1351.
[70]LEVENSHTEIN V I.Binary Codes Capable of Correcting Deletions,Insertions,and Reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[71]MA K,SHU Z X,BAI X,et al.DocUNet:Document Image Un-warping via A Stacked U-Net[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:4700-4709.
[72]BANDYOPADHYAY H,DASGUPTA T,DAS N,et al.A Gated and Bifurcated Stacked U-net Module for Document Image Dewarping[C]//Proceedings of 25th International Conference on Pattern and Recognition(ICPR).2021:10548-10554.
[73]JIANG X W,LONG R J,XUE N,et al.Revisiting Document Image Dewarping by Grid Regularization[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:4533-4542.
[74]QIN H,LI Y J,LIANG Q K,et al.Asymcnet:A DocumentImages-relevant Asymmetric Geometry Correction Network[J].Journal of Image and Graphics,2023,28(8):2314-2329.
[75]HE K M,CHEN X L,XIE S L,et al.Masked Autoencoders Are Scalable Vision Learners[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:15979-15988.
[76]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage Is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//Proceddings of the 9th International Confe-rence on Learning Representations(ICLR).2021.
[77]RADFORDA,NARASIMHAN K,SALIMANS T,et al.Improving Language Understanding by Generative Pre-training[EB/OL].[2024-02-26].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[78]ZHANG S S,ROLLER S,GOYAL N,et al.OPT:Open Pre-trained Transformer Language Models[J].arXiv:2205.01068,2022.
[79]ZENG A H,LIU X,DU Z X,et al.GLM-130B:An Open Bilingual Pre-trained Model[C]//Proceddings of the 11th International Conference on Learning Representations(ICLR).2023.
[80]RADFORDA,KIM J W,HALLACY C,et al.Learning Transferable Visual Models from Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning(ICML).2021:8748-8763.
[81]LI J N,LI D X,XIONG C M,et al.BLIP:Bootstrapping Language-image Pre-training for Unified Vision-language Understanding and Generation[C]//Proceedings of the 39th International Conference on Machine Learning(ICML).2022:12888-12900.
[82]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-image Pre-training with Frozen Image Encoders and Large Language Models[C]//Proceedings of the 40th International Conference on Machine Learning(ICML).2023:19730-19742.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed