计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 196-204.doi: 10.11896/jsjkx.221100234
张峰, 黄仕鑫, 花强, 董春茹
ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru
摘要: 图像分类作为一种常见的视觉识别任务,有着广阔的应用场景。在处理图像分类问题时,传统的方法通常使用卷积神经网络,然而,卷积网络的感受野有限,难以建模图像的全局关系表示,导致分类精度低,难以处理复杂多样的图像数据。为了对全局关系进行建模,一些研究者将Transformer应用于图像分类任务,但为了满足Transformer的序列化和并行化要求,需要将图像分割成大小相等、互不重叠的图像块,破坏了相邻图像数据块之间的局部信息。此外,由于Transformer具有较少的先验知识,模型往往需要在大规模数据集上进行预训练,因此计算复杂度较高。为了同时建模图像相邻块之间的局部信息并充分利用图像的全局信息,提出了一种基于Depth-wise卷积的视觉Transformer(Efficient Pyramid Vision Transformer,EPVT)模型。EPVT模型可以实现以较低的计算成本提取相邻图像块之间的局部和全局信息。EPVT模型主要包含3个关键组件:局部感知模块(Local Perceptron Module,LPM)、空间信息融合模块(Spatial Information Fusion,SIF)和“+卷积前馈神经网络(Convolution Feed-forward Network,CFFN)。LPM模块用于捕获图像的局部相关性;SIF模块用于融合相邻图像块之间的局部信息,并利用不同图像块之间的远距离依赖关系,提升模型的特征表达能力,使模型学习到输出特征在不同维度下的语义信息;CFFN模块用于编码位置信息和重塑张量。在图像分类数据集ImageNet-1K上,所提模型优于现有的同等规模的视觉Transformer分类模型,取得了82.6%的分类准确度,证明了该模型在大规模数据集上具有竞争力。
中图分类号:
[1]ZHU Z,HUANG G,DENG J,et al.Webface260m:a benchmark unveiling the power of million-scale deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:10492-10502. [2]LIU X,ZHANG P,YU C,et al.Watching you:Global-Guidedreciprocal learning for video-based person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:13334-13343. [3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].New York:Communications of the ACM,2017,60(6):84-90. [4]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778. [5]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:2014,1409.1556. [6]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:2017,1704.04861. [7]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:2015,1511.07122. [8]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2020,2010.11929. [9]ZHOU L Y,YUAN T T,CHEN S Y.Sequence-to-sequence sign language recognition and translation in Chinese continuous sign language [J].Computer Science,2022,49(9):155-161. [10] HU F Y,WANG X J,SHEN M F,,et al.Research progress of image instance segmentation by deep convolutional neural network [J].Computer Science,2022,49(5):10-24. [11]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324. [12]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2009:248-255. [13]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Piscataway:IEEE Computer Society,2015:1-9. [14]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.New York:PMLR,2019:6105-6114. [15]BAY H,TUYTELAARS T,GOOL L V.Surf:Speeded up robust features[C]//European Conference on Computer Vision.Berlin:Springer,2006:404-417. [16]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022. [17]WANG W,XIE E,LI X,et al.Pvt v2:Improved baselines with pyramid vision transformer[J].Computational Visual Media,2022,8(3):415-424. [18]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919. [19]ZHOU J,WANG P,WANG F,et al.ELSA:Enhanced local self-attention for vision transformer[J].arXiv:2021,2112.12786. [20]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:3-19. [21]SRINIVAS A,LIN T Y,PARMAR N,et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:16519-16529. [22] CHEN C Q.Development of convolutional neural network and its application in computer vision [J].Computer Science,2019,46(3):63-73. [23] XU Y,ZHANG Q,ZHANG J,et al.Vitae:Vision transformer advanced by exploring intrinsic inductive bias[J].Advances in Neural Information Processing Systems,2021,34:28522-28535. [24] ZHANG J H,LIU F,QI J Y.A bottleneck transformer based lightweight micro-expression recognition architecture [J].Computer Science,2022,49(6A):370-377. [25]LI K,WANG Y,ZHANG J,et al.Uniformer:Unifying convolution and self-attention for visual recognition[J].arXiv:2022,2201.09450. [26]WU H,XIAO B,CODELLA N,et al.CvT:Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:22-31. [27]CHU X,TIAN Z,ZHANG B,et al.Conditional positional en-codings for vision transformers[J].arXiv:2021,2102.10882. [28]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400. [29]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:5270-5279. [30]WANG Y,YANG Y,BAI J,et al.Evolving attention with resi-dual convolutions[C]//International Conference on Machine Learning.New York:ACM 2021:10971-10980. [31]BA J L,KIROS J R,HINTON G E.Layer normalization [J].arXiv:2016,1607.06450. [32]GLOROT X,BORDES A,BENGIO Y.Deep sparse rectifierneural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.Cambridge:JMLR Workshop and Conference Proceedings,2011:315-323. |
|