计算机科学 ›› 2024, Vol. 51 ›› Issue (2): 196-204.doi: 10.11896/jsjkx.221100234

• 计算机图形学&多媒体 • 上一篇    下一篇

基于Depth-wise卷积和视觉Transformer的图像分类模型

张峰, 黄仕鑫, 花强, 董春茹   

  1. 河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室 河北 保定071002
  • 收稿日期:2022-11-28 修回日期:2023-06-14 出版日期:2024-02-15 发布日期:2024-02-22
  • 通讯作者: 董春茹(dongcr@hbu.edu.cn)
  • 作者简介:(fengzhang@hbu.edu.cn)
  • 基金资助:
    科技部重点研发项目(2022YFE0196100);河北省自然科学基金面上项目(F2018201115);河北省教育厅科学技术研究重点项目(ZD2019021);河北大学高层次创新人才科研启动经费项目

Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer

ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru   

  1. Hebei Key Laboratory of Machine Learning and Computational Intelligence,College of Mathematics and Information Science,Hebei University,Baoding,Hebei 071002,China
  • Received:2022-11-28 Revised:2023-06-14 Online:2024-02-15 Published:2024-02-22
  • About author:ZHANG Feng,born in 1976,Ph.D,associate professor,master supervisor,is a member of CCF(No.65203M).Her main research interests include machine learning and intelligent decision-ma-king.DONG Chunru,born in 1980,Ph.D,associate professor,master supervisor.His main research interests include deep learning and image processing.
  • Supported by:
    National Key R&D Program of China(2022YFE0196100),Natural Science Foundation of Hebei Province(F2018201115),Key Scientific Research Foundation of Education Department of Hebei Province(ZD2019021) and Hebei University High-level Innovative Talent Research Start-up Funding Project.

摘要: 图像分类作为一种常见的视觉识别任务,有着广阔的应用场景。在处理图像分类问题时,传统的方法通常使用卷积神经网络,然而,卷积网络的感受野有限,难以建模图像的全局关系表示,导致分类精度低,难以处理复杂多样的图像数据。为了对全局关系进行建模,一些研究者将Transformer应用于图像分类任务,但为了满足Transformer的序列化和并行化要求,需要将图像分割成大小相等、互不重叠的图像块,破坏了相邻图像数据块之间的局部信息。此外,由于Transformer具有较少的先验知识,模型往往需要在大规模数据集上进行预训练,因此计算复杂度较高。为了同时建模图像相邻块之间的局部信息并充分利用图像的全局信息,提出了一种基于Depth-wise卷积的视觉Transformer(Efficient Pyramid Vision Transformer,EPVT)模型。EPVT模型可以实现以较低的计算成本提取相邻图像块之间的局部和全局信息。EPVT模型主要包含3个关键组件:局部感知模块(Local Perceptron Module,LPM)、空间信息融合模块(Spatial Information Fusion,SIF)和“+卷积前馈神经网络(Convolution Feed-forward Network,CFFN)。LPM模块用于捕获图像的局部相关性;SIF模块用于融合相邻图像块之间的局部信息,并利用不同图像块之间的远距离依赖关系,提升模型的特征表达能力,使模型学习到输出特征在不同维度下的语义信息;CFFN模块用于编码位置信息和重塑张量。在图像分类数据集ImageNet-1K上,所提模型优于现有的同等规模的视觉Transformer分类模型,取得了82.6%的分类准确度,证明了该模型在大规模数据集上具有竞争力。

关键词: 深度学习, 图像分类, Depth-wise卷积, 视觉Transformer, 注意力机制

Abstract: Deep learning-based image classification models have been successfully applied in various scenarios.The current image classification models can be categorized into two classes:the CNN-based classifiers and the Transformer-based classifiers.Due to its limited receptive field,the CNN-based classifiers cannot model the global relation of image,which decreases the classification accuracy.While the Transformer-based classifiers usually segmente the image into non-overlapping image patches with equal size,which harms the local information between each pair of adjacent image patches.Additionally,the Transformer-based classification models often require pre-training on large datasets,resulting in high computational costs.To tackle these problems,an efficient pyramid vision Transformer(EPVT) based on depth-wise convolution is proposed in this paper to extract both the local and glo-bal information between adjacent image patches at a low computational cost.The EPVT model consists of three key components:local perception module(LP),spatial information fusion module(SIF) and convolutional feed-forward network module(CFFN).The LP module is used to capture the local correlation of image patches.SIF module is used to fuse local information between adjacent image patches and improve the feature expression ability of the proposed EPVT by utilizing the long-distance dependence between different image patches.CFFN module is used to encode the location information and reconstruct tensors between feature image patches.To validate the proposed EPVT model’s performance,various experiments are conducted on the benchmark datasets,and experimental results show the EPVT achieves 82.6% classification accuracy on ImageNet-1K,which outperforms most of the SOTA models with lower computational complexity.

Key words: Deep learning, Image classification, Depth-wise convolution, Visual transformer, Self-attention mechanism

中图分类号: 

  • TP391
[1]ZHU Z,HUANG G,DENG J,et al.Webface260m:a benchmark unveiling the power of million-scale deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:10492-10502.
[2]LIU X,ZHANG P,YU C,et al.Watching you:Global-Guidedreciprocal learning for video-based person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:13334-13343.
[3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].New York:Communications of the ACM,2017,60(6):84-90.
[4]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778.
[5]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:2014,1409.1556.
[6]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:2017,1704.04861.
[7]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:2015,1511.07122.
[8]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2020,2010.11929.
[9]ZHOU L Y,YUAN T T,CHEN S Y.Sequence-to-sequence sign language recognition and translation in Chinese continuous sign language [J].Computer Science,2022,49(9):155-161.
[10] HU F Y,WANG X J,SHEN M F,,et al.Research progress of image instance segmentation by deep convolutional neural network [J].Computer Science,2022,49(5):10-24.
[11]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[12]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2009:248-255.
[13]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Piscataway:IEEE Computer Society,2015:1-9.
[14]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.New York:PMLR,2019:6105-6114.
[15]BAY H,TUYTELAARS T,GOOL L V.Surf:Speeded up robust features[C]//European Conference on Computer Vision.Berlin:Springer,2006:404-417.
[16]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022.
[17]WANG W,XIE E,LI X,et al.Pvt v2:Improved baselines with pyramid vision transformer[J].Computational Visual Media,2022,8(3):415-424.
[18]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919.
[19]ZHOU J,WANG P,WANG F,et al.ELSA:Enhanced local self-attention for vision transformer[J].arXiv:2021,2112.12786.
[20]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:3-19.
[21]SRINIVAS A,LIN T Y,PARMAR N,et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:16519-16529.
[22] CHEN C Q.Development of convolutional neural network and its application in computer vision [J].Computer Science,2019,46(3):63-73.
[23] XU Y,ZHANG Q,ZHANG J,et al.Vitae:Vision transformer advanced by exploring intrinsic inductive bias[J].Advances in Neural Information Processing Systems,2021,34:28522-28535.
[24] ZHANG J H,LIU F,QI J Y.A bottleneck transformer based lightweight micro-expression recognition architecture [J].Computer Science,2022,49(6A):370-377.
[25]LI K,WANG Y,ZHANG J,et al.Uniformer:Unifying convolution and self-attention for visual recognition[J].arXiv:2022,2201.09450.
[26]WU H,XIAO B,CODELLA N,et al.CvT:Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:22-31.
[27]CHU X,TIAN Z,ZHANG B,et al.Conditional positional en-codings for vision transformers[J].arXiv:2021,2102.10882.
[28]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400.
[29]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:5270-5279.
[30]WANG Y,YANG Y,BAI J,et al.Evolving attention with resi-dual convolutions[C]//International Conference on Machine Learning.New York:ACM 2021:10971-10980.
[31]BA J L,KIROS J R,HINTON G E.Layer normalization [J].arXiv:2016,1607.06450.
[32]GLOROT X,BORDES A,BENGIO Y.Deep sparse rectifierneural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.Cambridge:JMLR Workshop and Conference Proceedings,2011:315-323.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!