基于空间域和频率域特征融合的场景文本识别

doi:10.11896/jsjkx.230300101

Abstract

Abstract: Existing scene text recognition methods often face the problems of low robustness and poor generalization ability in the few-shot and language-independent scene.To solve this problem,on the one hand,a dual-stream network structure based on the fusion of space domain and frequency domain features is proposed in the feature extraction stage.It consists of a deep residual convolutional network branch for extracting spatial domain features,and a shallow neural network with one-dimensional fast fourier transform(FFT) branch for extracted frequency features.And then apply the channel attention mechanism to fuse the two features.On the other hand,in the sequence modeling stage,a multi-scale one-dimensional convolution module is proposed to replace the bidirectional long short-term memory(BiLSTM) according to the characteristics of the language-independent scene.Finally,a complete model is built by combining the existing TPS rectification module and CTC decoder.The transfer learning me-thod is adopted in the training process.Pre-training is performed on the large English datasets first,and then fine-tuning is performed on the target datasets.Experimental results on two few-shot language-independent datasets compiled in the paper show that the method is superior to the existing methods in terms of accuracy,which verifies that it has high robustness and generalization ability in this scenario.Moreover,the method using the feature extraction module described in the paper is better than the baseline on the five benchmark datasets of language-dependent scene(no fine-tuning),which verifies the effectiveness and versati-lity of the dual-stream feature fusion network proposed in the paper.

Key words: Deep learning, Scene text recognition, Dual-stream network, Frequency domain branch, Few-shot

CLC Number:

TP391

HUO Huaqi, LU Lu. Scene Text Recognition Based on Feature Fusion in Space Domain and Frequency Domain[J].Computer Science, 2023, 50(11A): 230300101-8.

References

[1]CHEN X,JIN L,ZHU Y,et al.Text recognition in the wild:A survey[J].ACM Computing Surveys(CSUR),2021,54(2):1-35.
[2]YAO C,BAI X,LIU W.A unified framework for multioriented text detection and recognition[J].IEEE Transactions on Image Processing,2014,23(11):4737-4749.
[3]JADERBERG M,SIMONYAN K,VEDALDI A,et al.Synthetic data and artificial neural networks for natural scene text recognition[J].arXiv:1406.2227,2014.
[4]GUPTA A,VEDALDI A,ZISSERMAN A.Synthetic data for text localisation in natural images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:2315-2324.
[5]HAN K,WANG Y,CHEN H,et al.A survey on vision transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(1):87-110.
[6]SHI B,BAI X,YAO C.An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,39(11):2298-2304.
[7]SHI B,WANG X,LYU P,et al.Robust scene text recognition with automatic rectification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:4168-4176.
[8]YUE X,KUANG Z,LIN C,et al.Robustscanner:Dynamicallyenhancing positional clues for robust text recognition[C]//European Conference on Computer Vision.Springer,2020:135-151.
[9]BAEK J,KIM G,LEE J,et al.What is wrong with scene text recognition model comparisons? dataset and model analysis[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.IEEE,2019:4715-4723.
[10]HU W,CAI X,HOU J,et al.Gtc:Guided training of ctc towards efficient and accurate scene text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI Press,2020,34(7):11005-11012.
[11]SHI B,YANG M,WANG X,et al.Aster:An attentional scene text recognizer with flexible rectification [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,41(9):2035-2048.
[12]LI H,WANG P,SHEN C,et al.Show,attend and read:A simple and strong baseline for irregular text recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI Press,2019,33(1):8610-8617.
[13]ATIENZA R.Vision transformer for fast and efficient scenetext recognition[C]//International Conference on Document Analysis and Recognition.Springer,2021:319-334.
[14]QIAO Z,ZHOU Y,YANG D,et al.Seed:Semantics enhancedencoder-decoder framework for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:13528-13537.
[15]BAUTISTA D,ATIENZA R.Scene Text Recognition with Permuted Autoregressive Sequence Models[C]//European Confe-rence on Computer Vision.Springer,2022:178-196.
[16]DU Y,CHEN Z,JIA C,et al.Svtr:Scene text recognition with a single visual model[J].arXiv:2205.00159,2022.
[17]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016:770-778.
[18]WANG H,WU X,HUANG Z,et al.High-frequency component helps explain the generalization of convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2020:8684-8694.
[19]LI Y,BIAN S,WANG C,et al.Detection of Deepfakes Based on Dual-stream Network[J].Computer Science,2022,49(S2):558-566.
[20]MAO X,LIU Y,SHEN W,et al.Deep residual fourier transformation for single image deblurring[J].arXiv:2111.11745,2021.
[21]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2018:7132-7141.
[22]KARATZAS D,SHAFAIT F,UCHIDA S,et al.ICDAR 2013robust reading competition[C]//2013 12th International Conference on Document Analysis and Recognition.IEEE,2013:1484-1493.
[23]MISHRA A,ALAHARI K,JAWAHAR C.Scene text recognition using higher order language priors[C]//BMVC-British Machine Vision Conference.BMVA,2012:1-11.
[24]WANG K,BABENKO B,BELONGIE S.End-to-end scene text recognition[C]//2011 International Conference on Computer Vision.IEEE,2011:1457-1464.
[25]KARATZAS D,GOMEZ-BIGORDA L,NICOLAOU A,et al.ICDAR 2015 competition on robust reading[C]//2015 13th International Conference on Document Analysis and Recognition(ICDAR).IEEE,2015:1156-1160.
[26]RISNUMAWAN A,SHIVAKUMARA P,CHAN C,et al.A robust arbitrary text detection system for natural scene images[J].Expert Systems with Applications,2014,41(18):8027-8048.
[27]HE M,LIU Y,YANG Z,et al.ICPR2018 contest on robustreading for multi-type web images[C]//2018 24th International Conference on Pattern Recognition(ICPR).Elsevier,2018:7-12.
[28]FANG S,XIE H,WANG Y,et al.Read like humans:Autonomous,bidirectional and iterative language modeling for scene text recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.IEEE,2021:7098-7107.
[29]XIAO Z,NIE Z,SONG C,et al.An extended attention mechanism for scene text recognition[J].Expert Systems with Applications,2022,203:117377.

Related Articles 15

[1]	ZHAO Mingmin, YANG Qiuhui, HONG Mei, CAI Chuang. Smart Contract Fuzzing Based on Deep Learning and Information Feedback [J]. Computer Science, 2023, 50(9): 117-122.
[2]	LI Haiming, ZHU Zhiheng, LIU Lei, GUO Chenkai. Multi-task Graph-embedding Deep Prediction Model for Mobile App Rating Recommendation [J]. Computer Science, 2023, 50(9): 160-167.
[3]	HUANG Hanqiang, XING Yunbing, SHEN Jianfei, FAN Feiyi. Sign Language Animation Splicing Model Based on LpTransformer Network [J]. Computer Science, 2023, 50(9): 184-191.
[4]	ZHU Ye, HAO Yingguang, WANG Hongyu. Deep Learning Based Salient Object Detection in Infrared Video [J]. Computer Science, 2023, 50(9): 227-234.
[5]	ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[6]	SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78.
[7]	WANG Xu, WU Yanxia, ZHANG Xue, HONG Ruize, LI Guangsheng. Survey of Rotating Object Detection Research in Computer Vision [J]. Computer Science, 2023, 50(8): 79-92.
[8]	ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[9]	ZHANG Xiao, DONG Hongbin. Lightweight Multi-view Stereo Integrating Coarse Cost Volume and Bilateral Grid [J]. Computer Science, 2023, 50(8): 125-132.
[10]	WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[11]	LI Kun, GUO Wei, ZHANG Fan, DU Jiayu, YANG Meiyue. Adversarial Malware Generation Method Based on Genetic Algorithm [J]. Computer Science, 2023, 50(7): 325-331.
[12]	WANG Mingxia, XIONG Yun. Disease Diagnosis Prediction Algorithm Based on Contrastive Learning [J]. Computer Science, 2023, 50(7): 46-52.
[13]	SHEN Zhehui, WANG Kailai, KONG Xiangjie. Exploring Station Spatio-Temporal Mobility Pattern:A Short and Long-term Traffic Prediction Framework [J]. Computer Science, 2023, 50(7): 98-106.
[14]	HUO Weile, JING Tao, REN Shuang. Review of 3D Object Detection for Autonomous Driving [J]. Computer Science, 2023, 50(7): 107-118.
[15]	ZHOU Bo, JIANG Peifeng, DUAN Chang, LUO Yuetong. Study on Single Background Object Detection Oriented Improved-RetinaNet Model and Its Application [J]. Computer Science, 2023, 50(7): 137-142.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Scene Text Recognition Based on Feature Fusion in Space Domain and Frequency Domain

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0