Computer Science ›› 2025, Vol. 52 ›› Issue (1): 259-276.doi: 10.11896/jsjkx.240300028

• Artificial Intelligence • Previous Articles     Next Articles

Review of Pre-training Methods for Visually-rich Document Understanding

ZHANG Jian1,2, LI Hui1, ZHANG Shengming2, WU Jie2, PENG Ying2   

  1. 1 School of Cyber Engineering,Xidian University,Xi’an 710071,China
    2 CETC Cyberspace Security Technology Co.,Ltd.,Chengdu 610095,China
  • Received:2024-03-05 Revised:2024-07-26 Online:2025-01-15 Published:2025-01-09
  • About author:ZHANG Jian,born in 1976,Ph.D,professor.His main research interests include natural language process,computer vision and multi-modal learning.
    ZHANG Shengming,born in 1995,master.His main research interests include natural language process,computer vision and multi-modal learning.
  • Supported by:
    National Natural Science Foundation of China(61932015).

Abstract: Visually-rich document(VrD) refers to a document whose semantic structures are related to visual elements like typesetting formats and table structures in addition to being determined by the textual content.Numerous application scenarios,such as receipt understanding and card recognition,require automatically reading, analyzing and processing VrD(e.g.,forms,invoices,and resumes).This process is called visually-rich document understanding(VrDU),which is the cross-filed between natural language processing(NLP) and computer vision(CV).Recently,self-supervised pre-training techniques of VrDU have made significant progress in breaking down the training barriers between downstream tasks and improving model performance.However,a comprehensive summary and in-depth analysis of the pre-training models of VrDU is still lacking.To this end,we conduct an in-depth investigation and comprehensive summary of pre-training techniques of VrDU.Firstly,we introduce the data processing stage of pre-training technology,including the traditional pre-training datasets and optical character recognition(OCR) engines.Then,we discuss three key technique modules in the model pre-training stage,namely single-modality representation learning,multi-modal feature fusion,and pre-training tasks.Meanwhile,the similarities and differences between the pre-training models are elaborated on the basis of the above three modules.In addition,we briefly introduce the multi-modal large models applied in VrDU.Furthermore,we analyze the experimental results of pre-training models on three representative downstream tasks.Finally,the challenges and future research directions related to the pre-training models are pointed out.

Key words: Document intelligence, Pre-training models, Natural language processing, Computer vision, Deep learning

CLC Number: 

  • TP391
[1]SARKHEL R,NANDI A.Deterministic Routing between Layout Abstractions for Multi-scale Classification of Visually Rich Documents[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI).2019:3360-3366.
[2]CUI L,XU Y H,LU T C,et al.Document AI:Benchmarks,Models and Applications[J].Journal of Chinese Information Processing,2022,36(6):1-19.
[3]LIU X J,GAO F Y,ZHANG Q,et al.Graph Convolution for Multimodal Information Extraction from Visually Rich Documents[C]//Proceedings of 2019 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:32-39.
[4]ZHANG P,XU Y L,CHENG Z Z,et al.TRIE:End-to-end Text Reading and Information Extraction for Document Understan-ding[C]//Proceedings of the 28th ACM International Confe-rence on Multimedia(MM).2020:1413-1422.
[5]YU W W,LU N,QI X B,et al.PICK:Processing Key Information Extraction from Documents Using Improved Graph Lear-ning-convolutional Networks[C]//Proceedings of the 25th International Conference on Pattern Recognition(ICPR).2021:4363-4370.
[6]WU T L,LI C,ZHANG M Y,et al.LAMPRET:Layout-aware Multimodal Pretraining for Document Understanding[J].arXiv:2104.08405,2021.
[7]GU J X,KUEN J,MORARIU V I,et al.UniDoc:Unified Pretraining Framework for Document Understanding[C]//Proceedings of the 35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:39-50.
[8]POWALSKI R,BORCHMANNŁ,JURKIEWICZ D,et al.Going Full-tilt Boogie on Document Understanding with Text-image-layout Transformer[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:732-747.
[9]PAN S J,YANG Q.A Survey on Transfer Learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359.
[10]YOSINSKI J,CLUNE J,BENGIO Y,et al.How Transferable Are Features in Deep Neural Networks?[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems(NeurIPS).2014:3320-3328.
[11]VASWANI A,SHAZEER N,USZKOREIT J,et al.Attention Is All You Need[C]//Proceedings of the 31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:6000-6010.
[12]XU Y H,LI M H,CUI L,et al.Layoutlm:Pre-training of Text and Layout for Document Image Understanding[C]//Procee-dings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD).2020:1192-1200.
[13]PENG Q M,PAN Y X,WANG W J,et al.ERNIE-Layout:Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:7747-7757.
[14]APPALARAJU S,JASANI B,KOTA B U,et al.Docformer:End-to-end Transformer for Document Understanding[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:993-1003.
[15]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pretrainingof Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2018:4171-4186.
[16]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition(CVPR).2016:770-778.
[17]BA L J,KIROS J R,HINTON G E.Layer Normalization[J].arXiv:1607.06450,2016.
[18]ZHANG H Y,WANG T B,LI M Z,et al.Comprehensive Review of Visual-language-oriented Multimodal Pretraining Me-thods[J].Journal of Image and Graphics,2022,27:2652-2682.
[19]LEWIS D,AGAM G,ARGAMON S,et al.Building a Test Collection for Complex Document Information Processing[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2006:665-666.
[20]HARLEY A W,UFKES A,DERPANIS K G.Evaluation ofDeep Convolutional Nets for Document Image Classification and Retrieval[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition(ICDAR).2015:991-995.
[21]ZHONG X,TANG J,YEPES A J.Publaynet:Largest Dataset Ever for Document Layout Analysis[C]//Proceddings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1015-1022.
[22]LI M H,XU Y H,CUI L,et al.DocBank:A Benchmark Dataset for Document Layout Analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics(COLING).2020:949-960.
[23]BITEN A F,TITO R,GOMEZ L,et al.Ocr-idl:Ocr Annotations for Industry Document Library Dataset[C]//Workshop at European Conference on Computer Vision(ECCV).2022:241-252.
[24]WU Y H,SCHUSTER M,CHEN Z F,et al.Google’s Neural Machine Translation System:Bridging the Gap between Human and Machine Translation[J].arXiv:1609.08144,2016.
[25]LI Y L,QIAN Y X,YU Y C,et al.Structext:Structured Text Understanding with Multi-modal Transformers[C]//Procee-dings of the 29th ACM International Conference on Multimedia(MM).2021:1912-1920.
[26]LUO C W,TANG G Z,ZHENG Q,et al.Bi-VLDoc:Bidirec-tional Vision-language Modeling for Visually-rich Document Understanding[J].arXiv:2206.13155,2022.
[27]BAI H L,LIU Z G,MENG X J,et al.Wukong-Reader:Multi-modal Pre-training for Fine-grained Visual Document Understan-ding[C]//Proceedings of the 61st Annual Meeting of the Asso-ciation for Computational Linguistics.2023:13386-13401.
[28]LIU Y,OTT M,GOYAL N,et al.Roberta:A Robustly Opti-mized Bert Pretraining Approach[J].arXiv:1907.11692,2019.
[29]LI P Z,GU J X,KUEN J,et al.SelfDoc:Self-supervised Document Representation Learning[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:5648-5656.
[30]WANG Z L,GU J X,TENSMEYER C,et al.MGDoc:Pre-trai-ning with Multi-granular Hierarchy for Document Image Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:3984-3993.
[31]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings Using Siamese BERT-networks[C]//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992.
[32]LI C L,BI B,YAN M,et al.StructuralLM:Structural Pre-trai-ning for Form Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:6309-6318.
[33]WANG J P,JIN L W,DING K.LiLT:A Simple yet Effective Language-independent Layout Transformer for Structured Do-cument Understanding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:7747-7757.
[34]HONG T,KIM D,JI M,et al.BROS:A Pre-trained LanguageModel Focusing on Text and Layout for Better Key Information Extraction from Documents[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence(AAAI).2022:10767-10775.
[35]XU Y,XU Y H,LV T C,et al.LayoutLMv2:Multi-modal Pre-training for Visually-Rich Document Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:2579-2591.
[36]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional Networks for Biomedical Image Segmentation[C]//Proceedings of Medical Image Computing and Computer-Assisted Intervention 18th International Conference(MICCAI).2015:234-241.
[37]XU Y H,LV T C,CUI L,et al.LayoutXLM:Multimodal Pre-training for Multilingual Visually-rich Document Understanding[J].arXiv:2104.08836,2021.
[38]XIE S N,GIRSHICK R,DOLLáR P,et al.Aggregated Residual Transformations for Deep Neural Networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1492-1500.
[39]REN S,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-time Object Detection with Region Proposal Networks [C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems(NeurIPS).2015:91-99.
[40]HE K M,GKIOXARI G,DOLLÁR P,et al.Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Compu-ter Vision(ICCV).2017:2961-2969.
[41]ALI T,ROY P.Enhancing Document Information Analysis with Multi-Task Pre-training:A Robust Approach for Information Extraction in Visually-Rich Documents[J].arXiv:2310.16527,2023.
[42]ZHANG Z R,MA J F,DU J,et al.Multimodal Pre-Training Based on Graph Attention Network for Document Understan-ding[J].IEEE Transactions on Multimedia,2023,25:6743-6755.
[43]LIU Z,LIN Y T,CAO Y,et al.Swin transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:10012-10022.
[44]HUANG Y P,LV T C,CUI L,et al.LayoutLMv3:Pre-training for Document AI with Unified Text and Image Masking[C]//Proceddings of the 30th ACM International Conference on Multimedia(MM).2022:4083-4091.
[45]LI J L,XU Y H,LV T C,et al.Dit:Self-supervised Pre-training for Document Image Transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia(MM).2022:3530-3539.
[46]TU Y,GUO Y,CHEN H,et al.LayoutMask:Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(ACL).2023:15200-15212.
[47]LUO C W,CHENG C X,ZHENG Q,et al.GeoLayoutLM:Geometric Pre-training for Visual Information Extraction[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2023:7092-7101.
[48]APPALARAJU S,TANG P,DONG Q,et al.DocFormerv2:Local Features for Document Understanding[C]//Proceedings of the 38th AAAI Conference on Artificial Intelligence(AAAI).2024:709-718.
[49]LI Q W,LI Z C,CAI X T,et al.Enhancing Visually-Rich Document Understanding via Layout Structure Modeling[C]//Proceedings of the 31st ACM International Conference on Multimedia(MM).2023:4513-4523.
[50]ZHENG Z H,WANG P,LIU W,et al.Distance-IoU Loss:Faster and Better Learning for Bounding Box Regression[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence(AAAI).2020,34(7):12993-13000.
[51]BROWN T B,MANN B,RYDER N,et al.Language ModelsAre Few-shot Learners[C]//Proceedings of the 34th Annual Conference on Neural Information Processing Systems(NeurIPS).2020:1877-1901.
[52]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[53]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing Vision-language Understanding with Advanced Large language Models[J].arXiv:2304.10592,2023.
[54]LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Proceedings of the 37th Annual Conference on Neural Information Processing Systems(NeurIPS).2023.
[55]YE J B,HU A W,XU H Y,et al.mPLUG-DocOwl:Modularized Multimodal Large Language Model for Document Understan-ding[J].arXiv:2307.02499,2023.
[56]HU A W,XU H Y,YE J B,et al.mPLUG-DocOwl 1.5:Unified Structure Learning for OCR-free Document Understanding[J].arXiv:2403.12895,2024.
[57]YE J B,HU A W,XU H Y,et al.UReader:Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model[C]//Proceedings of Findings of the Association for Computational Linguistics:EMNLP.2023:2841-2858.
[58]BAI J Z,BAI S,YANG S S,et al.Qwen-VL:A Frontier Large Vision-Language Model with Versatile Abilities[J].arXiv:2308.12966,2023.
[59]FENG H,LIU Q,LIU H,et al.DocPedia:Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding[J].arXiv:2311.11810,2023.
[60]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution and Text Label Are Important Things for Large Multi-modal Mo-dels[J].arXiv:2311.06607,2023.
[61]HUANG Z,CHEN K,HE J H,et al.ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1516-1520.
[62]JAUME G,EKENEL H K,THIRAN J P.FUNSD:A Dataset for Form Understanding in Noisy Scanned Documents[C]//Workshop at 2019 International Conference on Document Analysis and Recognition.2019:1-6.
[63]PARK S,SHIN S,LEE B,et al.CORD:A Consolidated Receipt Dataset for Post-OCR Parsing[C]//Workshop on Document Intelligence at NeurIPS.2019.
[64]GUO H,QIN X M,LIU J M,et al.Eaten:Entity-aware Attention for Single Shot Visual Text Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Re-cognition(ICDAR).2019:254-259.
[65]SUN H B,KUANG Z H,YUE X Y,et al.Spatial Dual-modality Graph Reasoning for Key Information Extraction[J].arXiv:2103.14470,2021.
[66]WANG J P,LIU C Y,JIN L W,et al.Towards Robust Visual Information Extraction in Real World:New Dataset and Novel Solution[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence(AAAI).2021:2738-2745.
[67]STANISŁAWEK T,GRALIИSKI F,WRÁBLEWSKA A,et al.Kleister:Key Information Extraction Datasets Involving Long Documents with Complex Layouts[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:564-579.
[68]MATHEW M,KARATZAS D,JAWAHAR C V.DOCVQA:A Dataset for VQA on Document Image[C]//Proceedings of 2021 IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:2200-2209.
[69]QI L,LV S,LI H Y,et al.DuReadervis:A Chinese Dataset for Open-domain Document Visual Question Answering[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:1338-1351.
[70]LEVENSHTEIN V I.Binary Codes Capable of Correcting Deletions,Insertions,and Reversals[J].Soviet Physics Doklady,1966,10(8):707-710.
[71]MA K,SHU Z X,BAI X,et al.DocUNet:Document Image Un-warping via A Stacked U-Net[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:4700-4709.
[72]BANDYOPADHYAY H,DASGUPTA T,DAS N,et al.A Gated and Bifurcated Stacked U-net Module for Document Image Dewarping[C]//Proceedings of 25th International Conference on Pattern and Recognition(ICPR).2021:10548-10554.
[73]JIANG X W,LONG R J,XUE N,et al.Revisiting Document Image Dewarping by Grid Regularization[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:4533-4542.
[74]QIN H,LI Y J,LIANG Q K,et al.Asymcnet:A DocumentImages-relevant Asymmetric Geometry Correction Network[J].Journal of Image and Graphics,2023,28(8):2314-2329.
[75]HE K M,CHEN X L,XIE S L,et al.Masked Autoencoders Are Scalable Vision Learners[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:15979-15988.
[76]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage Is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//Proceddings of the 9th International Confe-rence on Learning Representations(ICLR).2021.
[77]RADFORDA,NARASIMHAN K,SALIMANS T,et al.Improving Language Understanding by Generative Pre-training[EB/OL].[2024-02-26].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
[78]ZHANG S S,ROLLER S,GOYAL N,et al.OPT:Open Pre-trained Transformer Language Models[J].arXiv:2205.01068,2022.
[79]ZENG A H,LIU X,DU Z X,et al.GLM-130B:An Open Bilingual Pre-trained Model[C]//Proceddings of the 11th International Conference on Learning Representations(ICLR).2023.
[80]RADFORDA,KIM J W,HALLACY C,et al.Learning Transferable Visual Models from Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning(ICML).2021:8748-8763.
[81]LI J N,LI D X,XIONG C M,et al.BLIP:Bootstrapping Language-image Pre-training for Unified Vision-language Understanding and Generation[C]//Proceedings of the 39th International Conference on Machine Learning(ICML).2022:12888-12900.
[82]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-image Pre-training with Frozen Image Encoders and Large Language Models[C]//Proceedings of the 40th International Conference on Machine Learning(ICML).2023:19730-19742.
[1] LI Yahe, XIE Zhipeng. Active Learning Based on Maximum Influence Set [J]. Computer Science, 2025, 52(1): 289-297.
[2] ZHANG Xin, ZHANG Han, NIU Manyu, JI Lixia. Adversarial Sample Detection in Computer Vision:A Survey [J]. Computer Science, 2025, 52(1): 345-361.
[3] SU Chaoran, ZHANG Dalong, HUANG Yong, DONG An. RF Fingerprint Recognition Based on SE Attention Multi-source Domain Adversarial Network [J]. Computer Science, 2025, 52(1): 412-419.
[4] ZHANG Yusong, XU Shuai, YAN Xingyu, GUAN Donghai, XU Jianqiu. Survey on Cross-city Human Mobility Prediction [J]. Computer Science, 2025, 52(1): 102-119.
[5] LIU Yuming, DAI Yu, CHEN Gongping. Review of Federated Learning in Medical Image Processing [J]. Computer Science, 2025, 52(1): 183-193.
[6] LI Yujie, MA Zihang, WANG Yifu, WANG Xinghe, TAN Benying. Survey of Vision Transformers(ViT) [J]. Computer Science, 2025, 52(1): 194-209.
[7] ZHU Xiaoyan, WANG Wenge, WANG Jiayin, ZHANG Xuanping. Just-In-Time Software Defect Prediction Approach Based on Fine-grained Code Representationand Feature Fusion [J]. Computer Science, 2025, 52(1): 242-249.
[8] XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120.
[9] ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161.
[10] DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs [J]. Computer Science, 2024, 51(9): 71-79.
[11] GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER [J]. Computer Science, 2024, 51(8): 272-280.
[12] CHEN Siyu, MA Hailong, ZHANG Jianhui. Encrypted Traffic Classification of CNN and BiGRU Based on Self-attention [J]. Computer Science, 2024, 51(8): 396-402.
[13] SUN Yumo, LI Xinhang, ZHAO Wenjie, ZHU Li, LIANG Ya’nan. Driving Towards Intelligent Future:The Application of Deep Learning in Rail Transit Innovation [J]. Computer Science, 2024, 51(8): 1-10.
[14] KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33.
[15] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!