基于Depth-wise卷积和视觉Transformer的图像分类模型

doi:10.11896/jsjkx.221100234

Abstract

Abstract: Deep learning-based image classification models have been successfully applied in various scenarios.The current image classification models can be categorized into two classes:the CNN-based classifiers and the Transformer-based classifiers.Due to its limited receptive field,the CNN-based classifiers cannot model the global relation of image,which decreases the classification accuracy.While the Transformer-based classifiers usually segmente the image into non-overlapping image patches with equal size,which harms the local information between each pair of adjacent image patches.Additionally,the Transformer-based classification models often require pre-training on large datasets,resulting in high computational costs.To tackle these problems,an efficient pyramid vision Transformer(EPVT) based on depth-wise convolution is proposed in this paper to extract both the local and glo-bal information between adjacent image patches at a low computational cost.The EPVT model consists of three key components:local perception module(LP),spatial information fusion module(SIF) and convolutional feed-forward network module(CFFN).The LP module is used to capture the local correlation of image patches.SIF module is used to fuse local information between adjacent image patches and improve the feature expression ability of the proposed EPVT by utilizing the long-distance dependence between different image patches.CFFN module is used to encode the location information and reconstruct tensors between feature image patches.To validate the proposed EPVT model’s performance,various experiments are conducted on the benchmark datasets,and experimental results show the EPVT achieves 82.6% classification accuracy on ImageNet-1K,which outperforms most of the SOTA models with lower computational complexity.

Key words: Deep learning, Image classification, Depth-wise convolution, Visual transformer, Self-attention mechanism

CLC Number:

TP391

ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru. Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer[J].Computer Science, 2024, 51(2): 196-204.

References

[1]ZHU Z,HUANG G,DENG J,et al.Webface260m:a benchmark unveiling the power of million-scale deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:10492-10502.
[2]LIU X,ZHANG P,YU C,et al.Watching you:Global-Guidedreciprocal learning for video-based person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:13334-13343.
[3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].New York:Communications of the ACM,2017,60(6):84-90.
[4]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778.
[5]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:2014,1409.1556.
[6]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:2017,1704.04861.
[7]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:2015,1511.07122.
[8]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2020,2010.11929.
[9]ZHOU L Y,YUAN T T,CHEN S Y.Sequence-to-sequence sign language recognition and translation in Chinese continuous sign language [J].Computer Science,2022,49(9):155-161.
[10] HU F Y,WANG X J,SHEN M F,,et al.Research progress of image instance segmentation by deep convolutional neural network [J].Computer Science,2022,49(5):10-24.
[11]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[12]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2009:248-255.
[13]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Piscataway:IEEE Computer Society,2015:1-9.
[14]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.New York:PMLR,2019:6105-6114.
[15]BAY H,TUYTELAARS T,GOOL L V.Surf:Speeded up robust features[C]//European Conference on Computer Vision.Berlin:Springer,2006:404-417.
[16]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022.
[17]WANG W,XIE E,LI X,et al.Pvt v2:Improved baselines with pyramid vision transformer[J].Computational Visual Media,2022,8(3):415-424.
[18]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919.
[19]ZHOU J,WANG P,WANG F,et al.ELSA:Enhanced local self-attention for vision transformer[J].arXiv:2021,2112.12786.
[20]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:3-19.
[21]SRINIVAS A,LIN T Y,PARMAR N,et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:16519-16529.
[22] CHEN C Q.Development of convolutional neural network and its application in computer vision [J].Computer Science,2019,46(3):63-73.
[23] XU Y,ZHANG Q,ZHANG J,et al.Vitae:Vision transformer advanced by exploring intrinsic inductive bias[J].Advances in Neural Information Processing Systems,2021,34:28522-28535.
[24] ZHANG J H,LIU F,QI J Y.A bottleneck transformer based lightweight micro-expression recognition architecture [J].Computer Science,2022,49(6A):370-377.
[25]LI K,WANG Y,ZHANG J,et al.Uniformer:Unifying convolution and self-attention for visual recognition[J].arXiv:2022,2201.09450.
[26]WU H,XIAO B,CODELLA N,et al.CvT:Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:22-31.
[27]CHU X,TIAN Z,ZHANG B,et al.Conditional positional en-codings for vision transformers[J].arXiv:2021,2102.10882.
[28]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400.
[29]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:5270-5279.
[30]WANG Y,YANG Y,BAI J,et al.Evolving attention with resi-dual convolutions[C]//International Conference on Machine Learning.New York:ACM 2021:10971-10980.
[31]BA J L,KIROS J R,HINTON G E.Layer normalization [J].arXiv:2016,1607.06450.
[32]GLOROT X,BORDES A,BENGIO Y.Deep sparse rectifierneural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.Cambridge:JMLR Workshop and Conference Proceedings,2011:315-323.

Related Articles 15

[1]	CHEN Runhuan, DAI Hua, ZHENG Guineng, LI Hui , YANG Geng. Urban Electricity Load Forecasting Method Based on Discrepancy Compensation and Short-termSampling Contrastive Loss [J]. Computer Science, 2024, 51(4): 158-164.
[2]	LIN Binwei, YU Zhiyong, HUANG Fangwan, GUO Xianwei. Data Completion and Prediction of Street Parking Spaces Based on Transformer [J]. Computer Science, 2024, 51(4): 165-173.
[3]	SONG Hao, MAO Kuanmin, ZHU Zhou. Algorithm of Stereo Matching Based on GAANET [J]. Computer Science, 2024, 51(4): 229-235.
[4]	XUE Jinqiang, WU Qin. Progressive Multi-stage Image Denoising Algorithm Combining Convolutional Neural Network and Multi-layer Perceptron [J]. Computer Science, 2024, 51(4): 243-253.
[5]	ZENG Ruiren, XIE Jiangtao, LI Peihua. Global Covariance Pooling Based on Fast Maximum Singular Value Power Normalization [J]. Computer Science, 2024, 51(4): 254-261.
[6]	CHEN Jinyin, LI Xiao, JIN Haibo, CHEN Ruoxi, ZHENG Haibin, LI Hu. CheatKD:Knowledge Distillation Backdoor Attack Method Based on Poisoned Neuronal Assimilation [J]. Computer Science, 2024, 51(3): 351-359.
[7]	HUANG Kun, SUN Weiwei. Traffic Speed Forecasting Algorithm Based on Missing Data [J]. Computer Science, 2024, 51(3): 72-80.
[8]	ZHENG Cheng, SHI Jingwei, WEI Suhua, CHENG Jiaming. Dual Feature Adaptive Fusion Network Based on Dependency Type Pruning for Aspect-basedSentiment Analysis [J]. Computer Science, 2024, 51(3): 205-213.
[9]	WANG Wenjie, YANG Yan, JING Lili, WANG Jie, LIU Yan. LNG-Transformer:An Image Classification Network Based on Multi-scale Information Interaction [J]. Computer Science, 2024, 51(2): 189-195.
[10]	WANG Yangmin, HU Chengyu, YAN Xuesong, ZENG Deze. Study on Deep Reinforcement Learning for Energy-aware Virtual Machine Scheduling [J]. Computer Science, 2024, 51(2): 293-299.
[11]	HUANG Changxi, ZHAO Chengxin, JIANG Xiaoteng, LING Hefei, LIU Hui. Screen-shooting Resilient DCT Domain Watermarking Method Based on Deep Learning [J]. Computer Science, 2024, 51(2): 343-351.
[12]	HUANG Wenke, TENG Fei, WANG Zidan, FENG Li. Image Segmentation Based on Deep Learning:A Survey [J]. Computer Science, 2024, 51(2): 107-116.
[13]	CAI Jiacheng, DONG Fangmin, SUN Shuifa, TANG Yongheng. Unsupervised Learning of Monocular Depth Estimation:A Survey [J]. Computer Science, 2024, 51(2): 117-134.
[14]	HOU Jing, DENG Xiaomei, HAN Pengwu. Survey on Domain Limited Relation Extraction [J]. Computer Science, 2024, 51(1): 252-265.
[15]	YAN Zhihao, ZHOU Zhangbing, LI Xiaocui. Survey on Generative Diffusion Model [J]. Computer Science, 2024, 51(1): 273-283.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0