Computer Science ›› 2025, Vol. 52 ›› Issue (1): 259-276.doi: 10.11896/jsjkx.240300028
• Artificial Intelligence • Previous Articles Next Articles
ZHANG Jian1,2, LI Hui1, ZHANG Shengming2, WU Jie2, PENG Ying2
CLC Number:
[1]SARKHEL R,NANDI A.Deterministic Routing between Layout Abstractions for Multi-scale Classification of Visually Rich Documents[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence(IJCAI).2019:3360-3366. [2]CUI L,XU Y H,LU T C,et al.Document AI:Benchmarks,Models and Applications[J].Journal of Chinese Information Processing,2022,36(6):1-19. [3]LIU X J,GAO F Y,ZHANG Q,et al.Graph Convolution for Multimodal Information Extraction from Visually Rich Documents[C]//Proceedings of 2019 Conference of the North Ameri-can Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2019:32-39. [4]ZHANG P,XU Y L,CHENG Z Z,et al.TRIE:End-to-end Text Reading and Information Extraction for Document Understan-ding[C]//Proceedings of the 28th ACM International Confe-rence on Multimedia(MM).2020:1413-1422. [5]YU W W,LU N,QI X B,et al.PICK:Processing Key Information Extraction from Documents Using Improved Graph Lear-ning-convolutional Networks[C]//Proceedings of the 25th International Conference on Pattern Recognition(ICPR).2021:4363-4370. [6]WU T L,LI C,ZHANG M Y,et al.LAMPRET:Layout-aware Multimodal Pretraining for Document Understanding[J].arXiv:2104.08405,2021. [7]GU J X,KUEN J,MORARIU V I,et al.UniDoc:Unified Pretraining Framework for Document Understanding[C]//Proceedings of the 35th Annual Conference on Neural Information Processing Systems(NeurIPS).2021:39-50. [8]POWALSKI R,BORCHMANNŁ,JURKIEWICZ D,et al.Going Full-tilt Boogie on Document Understanding with Text-image-layout Transformer[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:732-747. [9]PAN S J,YANG Q.A Survey on Transfer Learning[J].IEEE Transactions on Knowledge and Data Engineering,2010,22(10):1345-1359. [10]YOSINSKI J,CLUNE J,BENGIO Y,et al.How Transferable Are Features in Deep Neural Networks?[C]//Proceedings of the 28th Annual Conference on Neural Information Processing Systems(NeurIPS).2014:3320-3328. [11]VASWANI A,SHAZEER N,USZKOREIT J,et al.Attention Is All You Need[C]//Proceedings of the 31st Annual Conference on Neural Information Processing Systems(NeurIPS).2017:6000-6010. [12]XU Y H,LI M H,CUI L,et al.Layoutlm:Pre-training of Text and Layout for Document Image Understanding[C]//Procee-dings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining(KDD).2020:1192-1200. [13]PENG Q M,PAN Y X,WANG W J,et al.ERNIE-Layout:Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:7747-7757. [14]APPALARAJU S,JASANI B,KOTA B U,et al.Docformer:End-to-end Transformer for Document Understanding[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:993-1003. [15]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pretrainingof Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies(NAACL-HLT).2018:4171-4186. [16]HE K M,ZHANG X Y,REN S Q,et al.Deep Residual Learning for Image Recognition[C]//Proceedings of 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition(CVPR).2016:770-778. [17]BA L J,KIROS J R,HINTON G E.Layer Normalization[J].arXiv:1607.06450,2016. [18]ZHANG H Y,WANG T B,LI M Z,et al.Comprehensive Review of Visual-language-oriented Multimodal Pretraining Me-thods[J].Journal of Image and Graphics,2022,27:2652-2682. [19]LEWIS D,AGAM G,ARGAMON S,et al.Building a Test Collection for Complex Document Information Processing[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.2006:665-666. [20]HARLEY A W,UFKES A,DERPANIS K G.Evaluation ofDeep Convolutional Nets for Document Image Classification and Retrieval[C]//Proceedings of the 13th International Conference on Document Analysis and Recognition(ICDAR).2015:991-995. [21]ZHONG X,TANG J,YEPES A J.Publaynet:Largest Dataset Ever for Document Layout Analysis[C]//Proceddings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1015-1022. [22]LI M H,XU Y H,CUI L,et al.DocBank:A Benchmark Dataset for Document Layout Analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics(COLING).2020:949-960. [23]BITEN A F,TITO R,GOMEZ L,et al.Ocr-idl:Ocr Annotations for Industry Document Library Dataset[C]//Workshop at European Conference on Computer Vision(ECCV).2022:241-252. [24]WU Y H,SCHUSTER M,CHEN Z F,et al.Google’s Neural Machine Translation System:Bridging the Gap between Human and Machine Translation[J].arXiv:1609.08144,2016. [25]LI Y L,QIAN Y X,YU Y C,et al.Structext:Structured Text Understanding with Multi-modal Transformers[C]//Procee-dings of the 29th ACM International Conference on Multimedia(MM).2021:1912-1920. [26]LUO C W,TANG G Z,ZHENG Q,et al.Bi-VLDoc:Bidirec-tional Vision-language Modeling for Visually-rich Document Understanding[J].arXiv:2206.13155,2022. [27]BAI H L,LIU Z G,MENG X J,et al.Wukong-Reader:Multi-modal Pre-training for Fine-grained Visual Document Understan-ding[C]//Proceedings of the 61st Annual Meeting of the Asso-ciation for Computational Linguistics.2023:13386-13401. [28]LIU Y,OTT M,GOYAL N,et al.Roberta:A Robustly Opti-mized Bert Pretraining Approach[J].arXiv:1907.11692,2019. [29]LI P Z,GU J X,KUEN J,et al.SelfDoc:Self-supervised Document Representation Learning[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2021:5648-5656. [30]WANG Z L,GU J X,TENSMEYER C,et al.MGDoc:Pre-trai-ning with Multi-granular Hierarchy for Document Image Understanding[C]//Proceedings of 2022 Conference on Empirical Methods in Natural Language Processing(EMNLP).2022:3984-3993. [31]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings Using Siamese BERT-networks[C]//Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992. [32]LI C L,BI B,YAN M,et al.StructuralLM:Structural Pre-trai-ning for Form Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:6309-6318. [33]WANG J P,JIN L W,DING K.LiLT:A Simple yet Effective Language-independent Layout Transformer for Structured Do-cument Understanding[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:7747-7757. [34]HONG T,KIM D,JI M,et al.BROS:A Pre-trained LanguageModel Focusing on Text and Layout for Better Key Information Extraction from Documents[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence(AAAI).2022:10767-10775. [35]XU Y,XU Y H,LV T C,et al.LayoutLMv2:Multi-modal Pre-training for Visually-Rich Document Understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(ACL-IJCNLP).2021:2579-2591. [36]RONNEBERGER O,FISCHER P,BROX T.U-Net:Convolu-tional Networks for Biomedical Image Segmentation[C]//Proceedings of Medical Image Computing and Computer-Assisted Intervention 18th International Conference(MICCAI).2015:234-241. [37]XU Y H,LV T C,CUI L,et al.LayoutXLM:Multimodal Pre-training for Multilingual Visually-rich Document Understanding[J].arXiv:2104.08836,2021. [38]XIE S N,GIRSHICK R,DOLLáR P,et al.Aggregated Residual Transformations for Deep Neural Networks[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2017:1492-1500. [39]REN S,HE K M,GIRSHICK R,et al.Faster R-CNN:Towards Real-time Object Detection with Region Proposal Networks [C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems(NeurIPS).2015:91-99. [40]HE K M,GKIOXARI G,DOLLÁR P,et al.Mask R-CNN[C]//Proceedings of 2017 IEEE International Conference on Compu-ter Vision(ICCV).2017:2961-2969. [41]ALI T,ROY P.Enhancing Document Information Analysis with Multi-Task Pre-training:A Robust Approach for Information Extraction in Visually-Rich Documents[J].arXiv:2310.16527,2023. [42]ZHANG Z R,MA J F,DU J,et al.Multimodal Pre-Training Based on Graph Attention Network for Document Understan-ding[J].IEEE Transactions on Multimedia,2023,25:6743-6755. [43]LIU Z,LIN Y T,CAO Y,et al.Swin transformer:Hierarchical Vision Transformer Using Shifted Windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision(ICCV).2021:10012-10022. [44]HUANG Y P,LV T C,CUI L,et al.LayoutLMv3:Pre-training for Document AI with Unified Text and Image Masking[C]//Proceddings of the 30th ACM International Conference on Multimedia(MM).2022:4083-4091. [45]LI J L,XU Y H,LV T C,et al.Dit:Self-supervised Pre-training for Document Image Transformer[C]//Proceedings of the 30th ACM International Conference on Multimedia(MM).2022:3530-3539. [46]TU Y,GUO Y,CHEN H,et al.LayoutMask:Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics(ACL).2023:15200-15212. [47]LUO C W,CHENG C X,ZHENG Q,et al.GeoLayoutLM:Geometric Pre-training for Visual Information Extraction[C]//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2023:7092-7101. [48]APPALARAJU S,TANG P,DONG Q,et al.DocFormerv2:Local Features for Document Understanding[C]//Proceedings of the 38th AAAI Conference on Artificial Intelligence(AAAI).2024:709-718. [49]LI Q W,LI Z C,CAI X T,et al.Enhancing Visually-Rich Document Understanding via Layout Structure Modeling[C]//Proceedings of the 31st ACM International Conference on Multimedia(MM).2023:4513-4523. [50]ZHENG Z H,WANG P,LIU W,et al.Distance-IoU Loss:Faster and Better Learning for Bounding Box Regression[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence(AAAI).2020,34(7):12993-13000. [51]BROWN T B,MANN B,RYDER N,et al.Language ModelsAre Few-shot Learners[C]//Proceedings of the 34th Annual Conference on Neural Information Processing Systems(NeurIPS).2020:1877-1901. [52]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023. [53]ZHU D,CHEN J,SHEN X,et al.Minigpt-4:Enhancing Vision-language Understanding with Advanced Large language Models[J].arXiv:2304.10592,2023. [54]LIU H,LI C,WU Q,et al.Visual instruction tuning[C]//Proceedings of the 37th Annual Conference on Neural Information Processing Systems(NeurIPS).2023. [55]YE J B,HU A W,XU H Y,et al.mPLUG-DocOwl:Modularized Multimodal Large Language Model for Document Understan-ding[J].arXiv:2307.02499,2023. [56]HU A W,XU H Y,YE J B,et al.mPLUG-DocOwl 1.5:Unified Structure Learning for OCR-free Document Understanding[J].arXiv:2403.12895,2024. [57]YE J B,HU A W,XU H Y,et al.UReader:Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model[C]//Proceedings of Findings of the Association for Computational Linguistics:EMNLP.2023:2841-2858. [58]BAI J Z,BAI S,YANG S S,et al.Qwen-VL:A Frontier Large Vision-Language Model with Versatile Abilities[J].arXiv:2308.12966,2023. [59]FENG H,LIU Q,LIU H,et al.DocPedia:Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding[J].arXiv:2311.11810,2023. [60]LI Z,YANG B,LIU Q,et al.Monkey:Image Resolution and Text Label Are Important Things for Large Multi-modal Mo-dels[J].arXiv:2311.06607,2023. [61]HUANG Z,CHEN K,HE J H,et al.ICDAR2019 Competition on Scanned Receipt OCR and Information Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Recognition(ICDAR).2019:1516-1520. [62]JAUME G,EKENEL H K,THIRAN J P.FUNSD:A Dataset for Form Understanding in Noisy Scanned Documents[C]//Workshop at 2019 International Conference on Document Analysis and Recognition.2019:1-6. [63]PARK S,SHIN S,LEE B,et al.CORD:A Consolidated Receipt Dataset for Post-OCR Parsing[C]//Workshop on Document Intelligence at NeurIPS.2019. [64]GUO H,QIN X M,LIU J M,et al.Eaten:Entity-aware Attention for Single Shot Visual Text Extraction[C]//Proceedings of 2019 International Conference on Document Analysis and Re-cognition(ICDAR).2019:254-259. [65]SUN H B,KUANG Z H,YUE X Y,et al.Spatial Dual-modality Graph Reasoning for Key Information Extraction[J].arXiv:2103.14470,2021. [66]WANG J P,LIU C Y,JIN L W,et al.Towards Robust Visual Information Extraction in Real World:New Dataset and Novel Solution[C]//Proceedings of the 35th AAAI Conference on Artificial Intelligence(AAAI).2021:2738-2745. [67]STANISŁAWEK T,GRALIИSKI F,WRÁBLEWSKA A,et al.Kleister:Key Information Extraction Datasets Involving Long Documents with Complex Layouts[C]//Proceedings of the 16th International Conference on Document Analysis and Recognition(ICDAR).2021:564-579. [68]MATHEW M,KARATZAS D,JAWAHAR C V.DOCVQA:A Dataset for VQA on Document Image[C]//Proceedings of 2021 IEEE/CVF Winter Conference on Applications of Computer Vision(WACV).2021:2200-2209. [69]QI L,LV S,LI H Y,et al.DuReadervis:A Chinese Dataset for Open-domain Document Visual Question Answering[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(ACL).2022:1338-1351. [70]LEVENSHTEIN V I.Binary Codes Capable of Correcting Deletions,Insertions,and Reversals[J].Soviet Physics Doklady,1966,10(8):707-710. [71]MA K,SHU Z X,BAI X,et al.DocUNet:Document Image Un-warping via A Stacked U-Net[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2018:4700-4709. [72]BANDYOPADHYAY H,DASGUPTA T,DAS N,et al.A Gated and Bifurcated Stacked U-net Module for Document Image Dewarping[C]//Proceedings of 25th International Conference on Pattern and Recognition(ICPR).2021:10548-10554. [73]JIANG X W,LONG R J,XUE N,et al.Revisiting Document Image Dewarping by Grid Regularization[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:4533-4542. [74]QIN H,LI Y J,LIANG Q K,et al.Asymcnet:A DocumentImages-relevant Asymmetric Geometry Correction Network[J].Journal of Image and Graphics,2023,28(8):2314-2329. [75]HE K M,CHEN X L,XIE S L,et al.Masked Autoencoders Are Scalable Vision Learners[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:15979-15988. [76]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage Is Worth 16x16 Words:Transformers for Image Recognition at Scale[C]//Proceddings of the 9th International Confe-rence on Learning Representations(ICLR).2021. [77]RADFORDA,NARASIMHAN K,SALIMANS T,et al.Improving Language Understanding by Generative Pre-training[EB/OL].[2024-02-26].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf. [78]ZHANG S S,ROLLER S,GOYAL N,et al.OPT:Open Pre-trained Transformer Language Models[J].arXiv:2205.01068,2022. [79]ZENG A H,LIU X,DU Z X,et al.GLM-130B:An Open Bilingual Pre-trained Model[C]//Proceddings of the 11th International Conference on Learning Representations(ICLR).2023. [80]RADFORDA,KIM J W,HALLACY C,et al.Learning Transferable Visual Models from Natural Language Supervision[C]//Proceedings of the 38th International Conference on Machine Learning(ICML).2021:8748-8763. [81]LI J N,LI D X,XIONG C M,et al.BLIP:Bootstrapping Language-image Pre-training for Unified Vision-language Understanding and Generation[C]//Proceedings of the 39th International Conference on Machine Learning(ICML).2022:12888-12900. [82]LI J N,LI D X,SAVARESE S,et al.BLIP-2:BootstrappingLanguage-image Pre-training with Frozen Image Encoders and Large Language Models[C]//Proceedings of the 40th International Conference on Machine Learning(ICML).2023:19730-19742. |
[1] | LI Yahe, XIE Zhipeng. Active Learning Based on Maximum Influence Set [J]. Computer Science, 2025, 52(1): 289-297. |
[2] | ZHANG Xin, ZHANG Han, NIU Manyu, JI Lixia. Adversarial Sample Detection in Computer Vision:A Survey [J]. Computer Science, 2025, 52(1): 345-361. |
[3] | SU Chaoran, ZHANG Dalong, HUANG Yong, DONG An. RF Fingerprint Recognition Based on SE Attention Multi-source Domain Adversarial Network [J]. Computer Science, 2025, 52(1): 412-419. |
[4] | ZHANG Yusong, XU Shuai, YAN Xingyu, GUAN Donghai, XU Jianqiu. Survey on Cross-city Human Mobility Prediction [J]. Computer Science, 2025, 52(1): 102-119. |
[5] | LIU Yuming, DAI Yu, CHEN Gongping. Review of Federated Learning in Medical Image Processing [J]. Computer Science, 2025, 52(1): 183-193. |
[6] | LI Yujie, MA Zihang, WANG Yifu, WANG Xinghe, TAN Benying. Survey of Vision Transformers(ViT) [J]. Computer Science, 2025, 52(1): 194-209. |
[7] | ZHU Xiaoyan, WANG Wenge, WANG Jiayin, ZHANG Xuanping. Just-In-Time Software Defect Prediction Approach Based on Fine-grained Code Representationand Feature Fusion [J]. Computer Science, 2025, 52(1): 242-249. |
[8] | XU Jinlong, GUI Zhonghua, LI Jia'nan, LI Yingying, HAN Lin. FP8 Quantization and Inference Memory Optimization Based on MLIR [J]. Computer Science, 2024, 51(9): 112-120. |
[9] | ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161. |
[10] | DU Yu, YU Zishu, PENG Xiaohui, XU Zhiwei. Padding Load:Load Reducing Cluster Resource Waste and Deep Learning Training Costs [J]. Computer Science, 2024, 51(9): 71-79. |
[11] | GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER [J]. Computer Science, 2024, 51(8): 272-280. |
[12] | CHEN Siyu, MA Hailong, ZHANG Jianhui. Encrypted Traffic Classification of CNN and BiGRU Based on Self-attention [J]. Computer Science, 2024, 51(8): 396-402. |
[13] | SUN Yumo, LI Xinhang, ZHAO Wenjie, ZHU Li, LIANG Ya’nan. Driving Towards Intelligent Future:The Application of Deep Learning in Rail Transit Innovation [J]. Computer Science, 2024, 51(8): 1-10. |
[14] | KONG Lingchao, LIU Guozhu. Review of Outlier Detection Algorithms [J]. Computer Science, 2024, 51(8): 20-33. |
[15] | TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159. |
|