计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230300101-8.doi: 10.11896/jsjkx.230300101

• 人工智能 • 上一篇    下一篇

基于空间域和频率域特征融合的场景文本识别

霍华骑1, 陆璐1,2   

  1. 1 华南理工大学计算机科学与工程学院 广州 510006
    2 鹏城实验室 广东 深圳 518055
  • 发布日期:2023-11-09
  • 通讯作者: 陆璐(lul@scut.edu.cn)
  • 作者简介:(hhq-yyq@qq.com)
  • 基金资助:
    广东省重点领域研究计划(2022B0101070001)

Scene Text Recognition Based on Feature Fusion in Space Domain and Frequency Domain

HUO Huaqi1, LU Lu1,2   

  1. 1 School of Computer Science and Engineering,South China University of Technology,Guangzhou 510006,China
    2 PENGCHENG Laboratory,Shenzhen,Guangdong 518055,China
  • Published:2023-11-09
  • About author:HUO Huaqi,born in 1998,postgra-duate.His main research interests include deep learning and scene text recognition.
    LU Lu,born in 1971,Ph.D,professor,Ph.D supervisor.His main research interests include deep learning,software reliability and high performance computing.
  • Supported by:
    Research Plan of Key Fields of Guangdong Province(2022B0101070001).

摘要: 对于小样本语言无关场景的文本识别,现有的方法往往面临鲁棒性低和泛化能力差的问题。针对这一问题,一方面,在特征提取阶段,提出了基于空间域和频率域特征融合的双流网络结构,其包含一个提取空间域特征的深度残差卷积网络分支,以及提取频率域特征的一维快速傅里叶变换和浅层神经网络分支,接着使用通道注意力机制融合这两种特征。另一方面,在序列建模阶段,针对语言无关场景的特点,提出一种多尺度一维卷积模块用来代替双向长短期记忆网络。然后结合现有的TPS矫正模块和CTC解码器搭建完整模型。训练过程中采用了迁移学习的方法,先在大型英文数据集上进行预训练,后在目标数据集上进行微调。在文中整理的两个小样本语言无关数据集上的实验结果表明,所提模型在准确率上优于现有的模型,验证了其在该场景下的具有较高的鲁棒性和泛化能力;此外,在语言相关场景的5个基准数据集上的相关实验(不用微调)表明,使用文中所述特征提取模块的方法优于对比的基线方法,证明了所提出的双流特征融合网络的有效性和通用性。

关键词: 深度学习, 场景文本识别, 双流网络, 频率域分支, 小样本

Abstract: Existing scene text recognition methods often face the problems of low robustness and poor generalization ability in the few-shot and language-independent scene.To solve this problem,on the one hand,a dual-stream network structure based on the fusion of space domain and frequency domain features is proposed in the feature extraction stage.It consists of a deep residual convolutional network branch for extracting spatial domain features,and a shallow neural network with one-dimensional fast fourier transform(FFT) branch for extracted frequency features.And then apply the channel attention mechanism to fuse the two features.On the other hand,in the sequence modeling stage,a multi-scale one-dimensional convolution module is proposed to replace the bidirectional long short-term memory(BiLSTM) according to the characteristics of the language-independent scene.Finally,a complete model is built by combining the existing TPS rectification module and CTC decoder.The transfer learning me-thod is adopted in the training process.Pre-training is performed on the large English datasets first,and then fine-tuning is performed on the target datasets.Experimental results on two few-shot language-independent datasets compiled in the paper show that the method is superior to the existing methods in terms of accuracy,which verifies that it has high robustness and generalization ability in this scenario.Moreover,the method using the feature extraction module described in the paper is better than the baseline on the five benchmark datasets of language-dependent scene(no fine-tuning),which verifies the effectiveness and versati-lity of the dual-stream feature fusion network proposed in the paper.

Key words: Deep learning, Scene text recognition, Dual-stream network, Frequency domain branch, Few-shot

中图分类号: 

  • TP391
[1]CHEN X,JIN L,ZHU Y,et al.Text recognition in the wild:A survey[J].ACM Computing Surveys(CSUR),2021,54(2):1-35.
[2]YAO C,BAI X,LIU W.A unified framework for multioriented text detection and recognition[J].IEEE Transactions on Image Processing,2014,23(11):4737-4749.
[3]JADERBERG M,SIMONYAN K,VEDALDI A,et al.Synthetic data and artificial neural networks for natural scene text recognition[J].arXiv:1406.2227,2014.
[4]GUPTA A,VEDALDI A,ZISSERMAN A.Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:2315-2324.
[5]HAN K,WANG Y,CHEN H,et al.A survey on vision transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(1):87-110.
[6]SHI B,BAI X,YAO C.An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(11):2298-2304.
[7]SHI B,WANG X,LYU P,et al.Robust scene text recognition with automatic rectification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:4168-4176.
[8]YUE X,KUANG Z,LIN C,et al.Robustscanner:Dynamicallyenhancing positional clues for robust text recognition[C]//European Conference on Computer Vision.Springer,2020:135-151.
[9]BAEK J,KIM G,LEE J,et al.What is wrong with scene text recognition model comparisons? dataset and model analysis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2019:4715-4723.
[10]HU W,CAI X,HOU J,et al.Gtc:Guided training of ctc towards efficient and accurate scene text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI Press,2020,34(7):11005-11012.
[11]SHI B,YANG M,WANG X,et al.Aster:An attentional scene text recognizer with flexible rectification [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(9):2035-2048.
[12]LI H,WANG P,SHEN C,et al.Show,attend and read:A simple and strong baseline for irregular text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI Press,2019,33(1):8610-8617.
[13]ATIENZA R.Vision transformer for fast and efficient scenetext recognition[C]//International Conference on Document Analysis and Recognition.Springer,2021:319-334.
[14]QIAO Z,ZHOU Y,YANG D,et al.Seed:Semantics enhancedencoder-decoder framework for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:13528-13537.
[15]BAUTISTA D,ATIENZA R.Scene Text Recognition with Permuted Autoregressive Sequence Models[C]//European Confe-rence on Computer Vision.Springer,2022:178-196.
[16]DU Y,CHEN Z,JIA C,et al.Svtr:Scene text recognition with a single visual model[J].arXiv:2205.00159,2022.
[17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:770-778.
[18]WANG H,WU X,HUANG Z,et al.High-frequency component helps explain the generalization of convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:8684-8694.
[19]LI Y,BIAN S,WANG C,et al.Detection of Deepfakes Based on Dual-stream Network[J].Computer Science,2022,49(S2):558-566.
[20]MAO X,LIU Y,SHEN W,et al.Deep residual fourier transformation for single image deblurring[J].arXiv:2111.11745,2021.
[21]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2018:7132-7141.
[22]KARATZAS D,SHAFAIT F,UCHIDA S,et al.ICDAR 2013robust reading competition[C]//2013 12th International Conference on Document Analysis and Recognition.IEEE,2013:1484-1493.
[23]MISHRA A,ALAHARI K,JAWAHAR C.Scene text recognition using higher order language priors[C]//BMVC-British Machine Vision Conference.BMVA,2012:1-11.
[24]WANG K,BABENKO B,BELONGIE S.End-to-end scene text recognition[C]//2011 International Conference on Computer Vision.IEEE,2011:1457-1464.
[25]KARATZAS D,GOMEZ-BIGORDA L,NICOLAOU A,et al.ICDAR 2015 competition on robust reading[C]//2015 13th International Conference on Document Analysis and Recognition(ICDAR).IEEE,2015:1156-1160.
[26]RISNUMAWAN A,SHIVAKUMARA P,CHAN C,et al.A robust arbitrary text detection system for natural scene images[J].Expert Systems with Applications,2014,41(18):8027-8048.
[27]HE M,LIU Y,YANG Z,et al.ICPR2018 contest on robustreading for multi-type web images[C]//2018 24th International Conference on Pattern Recognition(ICPR).Elsevier,2018:7-12.
[28]FANG S,XIE H,WANG Y,et al.Read like humans:Autonomous,bidirectional and iterative language modeling for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:7098-7107.
[29]XIAO Z,NIE Z,SONG C,et al.An extended attention mechanism for scene text recognition[J].Expert Systems with Applications,2022,203:117377.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!