面向图像数据的ConvNeXt特征提取研究

doi:10.11896/jsjkx.230500196

摘要/Abstract

摘要： 卷积神经网络在计算机视觉任务中已取得诸多成果,无论是目标检测还是分割,都依赖于提取到的特征信息,一些模糊性的数据和物体形状各异等问题为特征提取带来了极大的挑战。传统的卷积结构只能学习到特征图相邻空间位置的上下文信息,无法对全局信息进行提取,而自注意力机制等模型虽具有更大的感受野和建立全局的依赖关系,但存在计算复杂度过高和需要大量数据等不足。为此,提出了一种CNN与LSTM结合的模型,该模型在增强局部感受野的前提下,可以更好地结合图像数据的全局信息。研究以主干网络ConvNeXt-T为基础模型,通过拼接不同大小卷积核以融合多尺度特征来解决物体形状各异的问题,并从水平和垂直两个方向聚合双向长短期记忆网络关注全局与局部信息的交互性。实验对公开访问的CIFAR-10,CIFAR-100,Tiny ImageNet数据集进行图像分类任务,所提出的网络在3个数据集实验中相较于基础模型ConvNeXt-T在准确率上分别提高了3.18%,2.91%,1.03%。实验证明改进后的ConvNeXt-T网络相较于基础模型在参数量和准确性方面都有了大幅度提升,可提取到更加有效的特征信息。

关键词: 特征提取, 局部感受野, ConvNeXt-T, 多尺度特征, 双向长短期记忆网络

Abstract: Convolutional neural networks have achieved many results in computer vision tasks,both in target detection and segmentation,which depend on the extracted feature information.Some problems such as ambiguous data and varying object shapes pose great challenges for feature extraction.The traditional convolutional structure can only learn the contextual information of the neighboring spatial locations of the feature map and cannot extract the global information,while models such as the self-attentive mechanism,although having a larger perceptual field and establishing global dependencies,are insufficient due to their high computational complexity and the need for large amounts of data.Therefore,this paper proposes a model combining CNN and LSTM,which can better combine the global information of image data while enhancing the local perceptual field.It uses the backbone network ConvNeXt-T as the base model to solve the problem of different object shapes by splicing different size convolutional kernels to fuse multi-scale features,and aggregates two-way long and short-term memory networks from both horizontal and vertical directions.Focus on the interactivity of global and local information.Experiments are conducted on publicly accessible CIFAR-10,CIFAR-100,and Tiny ImageNet datasets for image classification tasks,and the accuracy of the proposed network improves 3.18%,2.91%,and 1.03% in the three datasets respectively,compared to the base model ConvNeXt-T.Experiments demonstrate that the improved ConvNeXt-T network has substantially improved the number of parameters and accuracy compared with the base model,and can extract more effective feature information.

Key words: Feature extraction, Local receptive field, ConvNeXt-T, Multi-scale features, Bidirectional long and short-term memory network

中图分类号:

TP391

杨鹏跃, 王锋, 魏巍. 面向图像数据的ConvNeXt特征提取研究[J]. 计算机科学, 2024, 51(6A): 230500196-7. https://doi.org/10.11896/jsjkx.230500196

YANG Pengyue, WANG Feng, WEI Wei. ConvNeXt Feature Extraction Study for Image Data[J]. Computer Science, 2024, 51(6A): 230500196-7. https://doi.org/10.11896/jsjkx.230500196

参考文献

[1]HE K M,ZHANG X Y,REN S Q,et al.Delving deep into rectifiers:Surpassing human-level performance on imagenet classification[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1026-1034.
[2]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenetclassification with deep convolutional neural networks[J].Communications of the ACM,2017,60(6):84-90.
[3]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[4]GIRSHICK R,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2014:580-587.
[5]HE K M,GKIOXARI G,DOLLAR P,et al.Mask r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2961-2969.
[6]RUSSAKOVSKY O,JIA D,HAO S,et al.Imagenet large scale visual recognition challenge[J].International Journal of Computer Vision,2015,115:211-252.
[7]SIMONYAN K,ZISSERMAN A.Two-stream convolutionalnetworks for action recognition in videos[J].arXiv:1406.2199,2014.
[8]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1-9.
[9]XIE S N,GIRSHICK R,DOLLAR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1492-1500.
[10]HUAN G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[11]HOWARD A G,ZHU M L,CHEN B,et al.Mobilenets:Effi-cient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[12]TAN M X,LE Q.Efficientnet:Rethinking model scaling forconvolutional neural networks[C]//International Conference on Machine Learning.PMLR,2019:6105-6114.
[13]RADOSAVOVIC I,KOSARAJU R P,GIRSHICK R,et al.Designing network design spaces[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10428-10436.
[14]WANG W H,BAO H B,DONG L,et al.Image as a foreign language:Beit pretraining for all vision and vision-language tasks[J].arXiv:2208.10442,2022.
[15]JIA M L,TANG L M,CHEN B C,et al.Visual prompt tuning[C]//17th European Conference Computer Vision(ECCV 2022).Tel Aviv,Israel,Part XXXIII.Cham:Springer Nature Switzerland,2022:709-727.
[16]BAHNG H,JAHANIAN A,SANKARANARAYANAN S,et al.Visual prompting:Modifying pixel space to adapt pre-trained models[J].arXiv:2203.17274,2022.
[17]JIA M L,WU Z X,REITER A,et al.Exploring visual engagement signals for representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:4206-4217.
[18]YANG T J N,ZHU Y,XIE Y S,et al.Aim:Adapting image models for efficient video action recognition[J].arXiv:2302.03024,2023.
[19]GRAVES A,WAYNE G,DANIHELKA I.Neural turing ma-chines[J].arXiv:1410.5401,2014.
[20]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[21]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[22]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].arXiv:1706.03762,2017.
[23]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[24]LIU Y,OTT M,GOYAL N,et al.Roberta:A robustly opti-mized bert pretraining approach[J].arXiv:1907.11692,2019.
[25]YANG Z L,DAI Z H,YANG Y M,et al.Xlnet:Generalized autoregressive pretraining for language understanding[J].arXiv:1906,08237,2019.
[26]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J].arXiv:1908.02265,2019.
[27]SU W,ZHU X,CAO Y,et al.Vl-bert:Pre-training of generic visual-linguistic representations[J].arXiv:1908.08530,2019.
[28]BERTASIUS G,WANG H,TORRESANI L.Is space-time attention all you need for video understanding?[C]//ICML.2021.
[29]GIRDHAR R,CARREIRA J,DOERSCH C,et al.Video action transformer network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:244-253.
[30]WANG F,JIANG M Q,QIAN C,et al.Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:3156-3164.
[31]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[32]BELLO I,ZOPH B,VASWANI A,et al.Attention augmented convolutional networks[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:3286-3295.
[33]ZHANG H,GOODFELLOW I,METAXAS D,et al.Self-attention generative adversarial networks[C]//International Confe-rence on Machine Learning.PMLR,2019:7354-7363.
[34]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[35]LIU Z,LIN Y T,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[36]TOLSTIKHIN I O,HOULSBY N,KOLESNIKOV A,et al.Mlp-mixer:An all-mlp architecture for vision[J].Advances in Neural Information Processing Systems,2021,34:24261-24272.
[37]LIU H X,DAI Z H,SO D,et al.Pay attention to mlps[J].Advances in Neural Information Processing Systems,2021,34:9204-9215.
[38]LIU Z,MAO H Z,WU C Y,et al.A convnet for the 2020s[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:11976-11986.
[39]TANG D Y,QIN B,LIU T.Document modeling with gated recurrent neural network for sentiment classification[C]//Proceedings of the 2015 Conference on Empirical Methods in natural Language Processing.2015:1422-1432.
[40]LAI S W,XU L H,LIU K,et al.Recurrent convolutional neural networks for text classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2015.
[41]LE Y,YANG X.Tiny imagenet visual recognition challenge[J].CS 231N,2015,7(7):3.
[42]ZHANG H Y,CISSE M,DAUPHIN Y N,et al.mixup:Beyond empirical risk minimization[J].arXiv:1710.09412,2017.
[43]KRIZHEVSKY A,HINTON G.Learning multiple layers of features from tiny images[J].https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[44]GRAVES A.Long short-term memory[J].Supervised Se-quence Labelling with Recurrent Neural Networks,2012:37-45.
[45]SUNKARA R,LUO T.No more strided convolutions or poo-ling:a new CNN building block for low-resolution images and small objects[C]//Machine Learning and Knowledge Discovery in Databases:European Conference(ECML PKDD 2022).Grenoble,France,Part III.Cham:Springer Nature Switzerland,2023:443-459.
[46]SANDLER M,HOWARD A,ZHU M,et al.Mobilenetv2:Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:4510-4520.
[47]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[48]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[49]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed