Computer Science ›› 2024, Vol. 51 ›› Issue (2): 196-204.doi: 10.11896/jsjkx.221100234

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer

ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru   

  1. Hebei Key Laboratory of Machine Learning and Computational Intelligence,College of Mathematics and Information Science,Hebei University,Baoding,Hebei 071002,China
  • Received:2022-11-28 Revised:2023-06-14 Online:2024-02-15 Published:2024-02-22
  • About author:ZHANG Feng,born in 1976,Ph.D,associate professor,master supervisor,is a member of CCF(No.65203M).Her main research interests include machine learning and intelligent decision-ma-king.DONG Chunru,born in 1980,Ph.D,associate professor,master supervisor.His main research interests include deep learning and image processing.
  • Supported by:
    National Key R&D Program of China(2022YFE0196100),Natural Science Foundation of Hebei Province(F2018201115),Key Scientific Research Foundation of Education Department of Hebei Province(ZD2019021) and Hebei University High-level Innovative Talent Research Start-up Funding Project.

Abstract: Deep learning-based image classification models have been successfully applied in various scenarios.The current image classification models can be categorized into two classes:the CNN-based classifiers and the Transformer-based classifiers.Due to its limited receptive field,the CNN-based classifiers cannot model the global relation of image,which decreases the classification accuracy.While the Transformer-based classifiers usually segmente the image into non-overlapping image patches with equal size,which harms the local information between each pair of adjacent image patches.Additionally,the Transformer-based classification models often require pre-training on large datasets,resulting in high computational costs.To tackle these problems,an efficient pyramid vision Transformer(EPVT) based on depth-wise convolution is proposed in this paper to extract both the local and glo-bal information between adjacent image patches at a low computational cost.The EPVT model consists of three key components:local perception module(LP),spatial information fusion module(SIF) and convolutional feed-forward network module(CFFN).The LP module is used to capture the local correlation of image patches.SIF module is used to fuse local information between adjacent image patches and improve the feature expression ability of the proposed EPVT by utilizing the long-distance dependence between different image patches.CFFN module is used to encode the location information and reconstruct tensors between feature image patches.To validate the proposed EPVT model’s performance,various experiments are conducted on the benchmark datasets,and experimental results show the EPVT achieves 82.6% classification accuracy on ImageNet-1K,which outperforms most of the SOTA models with lower computational complexity.

Key words: Deep learning, Image classification, Depth-wise convolution, Visual transformer, Self-attention mechanism

CLC Number: 

  • TP391
[1]ZHU Z,HUANG G,DENG J,et al.Webface260m:a benchmark unveiling the power of million-scale deep face recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:10492-10502.
[2]LIU X,ZHANG P,YU C,et al.Watching you:Global-Guidedreciprocal learning for video-based person re-identification[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:13334-13343.
[3]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[J].New York:Communications of the ACM,2017,60(6):84-90.
[4]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2016:770-778.
[5]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[J].arXiv:2014,1409.1556.
[6]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:2017,1704.04861.
[7]YU F,KOLTUN V.Multi-scale context aggregation by dilated convolutions[J].arXiv:2015,1511.07122.
[8]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2020,2010.11929.
[9]ZHOU L Y,YUAN T T,CHEN S Y.Sequence-to-sequence sign language recognition and translation in Chinese continuous sign language [J].Computer Science,2022,49(9):155-161.
[10] HU F Y,WANG X J,SHEN M F,,et al.Research progress of image instance segmentation by deep convolutional neural network [J].Computer Science,2022,49(5):10-24.
[11]LECUN Y,BOTTOU L,BENGIO Y,et al.Gradient-basedlearning applied to document recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[12]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2009:248-255.
[13]SZEGEDY C,LIU W,JIA Y,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Piscataway:IEEE Computer Society,2015:1-9.
[14]TAN M,LE Q.Efficientnet:Rethinking model scaling for con-volutional neural networks[C]//International Conference on Machine Learning.New York:PMLR,2019:6105-6114.
[15]BAY H,TUYTELAARS T,GOOL L V.Surf:Speeded up robust features[C]//European Conference on Computer Vision.Berlin:Springer,2006:404-417.
[16]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:10012-10022.
[17]WANG W,XIE E,LI X,et al.Pvt v2:Improved baselines with pyramid vision transformer[J].Computational Visual Media,2022,8(3):415-424.
[18]HAN K,XIAO A,WU E,et al.Transformer in transformer[J].Advances in Neural Information Processing Systems,2021,34:15908-15919.
[19]ZHOU J,WANG P,WANG F,et al.ELSA:Enhanced local self-attention for vision transformer[J].arXiv:2021,2112.12786.
[20]WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision(ECCV).Berlin:Springer,2018:3-19.
[21]SRINIVAS A,LIN T Y,PARMAR N,et al.Bottleneck transformers for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Computer Society,2021:16519-16529.
[22] CHEN C Q.Development of convolutional neural network and its application in computer vision [J].Computer Science,2019,46(3):63-73.
[23] XU Y,ZHANG Q,ZHANG J,et al.Vitae:Vision transformer advanced by exploring intrinsic inductive bias[J].Advances in Neural Information Processing Systems,2021,34:28522-28535.
[24] ZHANG J H,LIU F,QI J Y.A bottleneck transformer based lightweight micro-expression recognition architecture [J].Computer Science,2022,49(6A):370-377.
[25]LI K,WANG Y,ZHANG J,et al.Uniformer:Unifying convolution and self-attention for visual recognition[J].arXiv:2022,2201.09450.
[26]WU H,XIAO B,CODELLA N,et al.CvT:Introducing convolutions to vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Computer Society,2021:22-31.
[27]CHU X,TIAN Z,ZHANG B,et al.Conditional positional en-codings for vision transformers[J].arXiv:2021,2102.10882.
[28]XIAO T,SINGH M,MINTUN E,et al.Early convolutions help transformers see better[J].Advances in Neural Information Processing Systems,2021,34:30392-30400.
[29]CHEN Y,DAI X,CHEN D,et al.Mobile-former:Bridging mobilenet and transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Pisca-taway:IEEE Computer Society,2022:5270-5279.
[30]WANG Y,YANG Y,BAI J,et al.Evolving attention with resi-dual convolutions[C]//International Conference on Machine Learning.New York:ACM 2021:10971-10980.
[31]BA J L,KIROS J R,HINTON G E.Layer normalization [J].arXiv:2016,1607.06450.
[32]GLOROT X,BORDES A,BENGIO Y.Deep sparse rectifierneural networks[C]//Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics.Cambridge:JMLR Workshop and Conference Proceedings,2011:315-323.
[1] CHEN Runhuan, DAI Hua, ZHENG Guineng, LI Hui , YANG Geng. Urban Electricity Load Forecasting Method Based on Discrepancy Compensation and Short-termSampling Contrastive Loss [J]. Computer Science, 2024, 51(4): 158-164.
[2] LIN Binwei, YU Zhiyong, HUANG Fangwan, GUO Xianwei. Data Completion and Prediction of Street Parking Spaces Based on Transformer [J]. Computer Science, 2024, 51(4): 165-173.
[3] SONG Hao, MAO Kuanmin, ZHU Zhou. Algorithm of Stereo Matching Based on GAANET [J]. Computer Science, 2024, 51(4): 229-235.
[4] XUE Jinqiang, WU Qin. Progressive Multi-stage Image Denoising Algorithm Combining Convolutional Neural Network and
Multi-layer Perceptron
[J]. Computer Science, 2024, 51(4): 243-253.
[5] ZENG Ruiren, XIE Jiangtao, LI Peihua. Global Covariance Pooling Based on Fast Maximum Singular Value Power Normalization [J]. Computer Science, 2024, 51(4): 254-261.
[6] CHEN Jinyin, LI Xiao, JIN Haibo, CHEN Ruoxi, ZHENG Haibin, LI Hu. CheatKD:Knowledge Distillation Backdoor Attack Method Based on Poisoned Neuronal Assimilation [J]. Computer Science, 2024, 51(3): 351-359.
[7] HUANG Kun, SUN Weiwei. Traffic Speed Forecasting Algorithm Based on Missing Data [J]. Computer Science, 2024, 51(3): 72-80.
[8] ZHENG Cheng, SHI Jingwei, WEI Suhua, CHENG Jiaming. Dual Feature Adaptive Fusion Network Based on Dependency Type Pruning for Aspect-basedSentiment Analysis [J]. Computer Science, 2024, 51(3): 205-213.
[9] WANG Wenjie, YANG Yan, JING Lili, WANG Jie, LIU Yan. LNG-Transformer:An Image Classification Network Based on Multi-scale Information Interaction [J]. Computer Science, 2024, 51(2): 189-195.
[10] WANG Yangmin, HU Chengyu, YAN Xuesong, ZENG Deze. Study on Deep Reinforcement Learning for Energy-aware Virtual Machine Scheduling [J]. Computer Science, 2024, 51(2): 293-299.
[11] HUANG Changxi, ZHAO Chengxin, JIANG Xiaoteng, LING Hefei, LIU Hui. Screen-shooting Resilient DCT Domain Watermarking Method Based on Deep Learning [J]. Computer Science, 2024, 51(2): 343-351.
[12] HUANG Wenke, TENG Fei, WANG Zidan, FENG Li. Image Segmentation Based on Deep Learning:A Survey [J]. Computer Science, 2024, 51(2): 107-116.
[13] CAI Jiacheng, DONG Fangmin, SUN Shuifa, TANG Yongheng. Unsupervised Learning of Monocular Depth Estimation:A Survey [J]. Computer Science, 2024, 51(2): 117-134.
[14] HOU Jing, DENG Xiaomei, HAN Pengwu. Survey on Domain Limited Relation Extraction [J]. Computer Science, 2024, 51(1): 252-265.
[15] YAN Zhihao, ZHOU Zhangbing, LI Xiaocui. Survey on Generative Diffusion Model [J]. Computer Science, 2024, 51(1): 273-283.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!