计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700183-6.doi: 10.11896/jsjkx.240700183

• 智能医学工程 • 上一篇    下一篇

基于双向多层级交互网络的肺部CT图像分类

龙肖1, 黄巍2, 胡凯1   

  1. 1 湘潭大学计算机学院·网络空间安全学院 湖南 湘潭 411105
    2 湖南省长沙市第一医院放射科计算机医学图像处理研究中心 长沙 410005
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 胡凯(kaihu@xtu.edu.cn)
  • 作者简介:(1664208966@qq.com)
  • 基金资助:
    国家自然科学基金(62272404);湖南省自然科学基金(2022JJ30571);湖南省科技厅项目(2021SK53105);湖南省教育厅项目(23A0146);湖南省大学生创新创业训练计划项目(S202310530178);湖南省普通本科高校教学改革研究项目(202401000574)

Bi-MI ViT:Bi-directional Multi-level Interaction Vision Transformer for Lung CT ImageClassification

LONG Xiao1, HUANG Wei2, HU Kai1   

  1. 1 School of Computer Science & School of Cyberspace Science,Xiangtan University,Xiangtan,Hunan 411105,China
    2 Computer Medical Image Processing Research Center,Department of Radiology,The First Hospital of Changsha,Changsha 410005,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:LONG Xiao,born in 2003,undergra-duate.Her main research interests include deep learning and medical image processing.
    HU Kai,born in 1984,Ph.D,professor,is a senior member of CCF.His main research interests include machine learning,pattern recognition,bioinformatics,and medical image processing.
  • Supported by:
    National Natural Science Foundation of China(62272404),Natural Science Foundation of Hunan Province of China(2022JJ30571),Science and Technology Department of Hunan Province(2021SK53105),Project of Education Department of Hunan Province(23A0146),Innovation and Entrepreneurship Training Program for Hunan University Students(S202310530178) and Project of Undergraduate Teaching Reform Research of Hunan Province(202401000574).

摘要: 近年来,基于局部窗口的Self-Attention机制在视觉分类任务中表现突出。然而,由于存在感受野有限和建模能力弱的问题,其在处理复杂数据时效果不佳。肺部CT图像中的特征复杂多样,包括结节的形状、大小、密度等,给深入挖掘数据中的深层次特征带来挑战。针对这些问题,文中提出了一个全新的双向多层级交互网络模型Bi-directional Multi-level Interaction Vision Transformer(Bi-MI ViT)。该网络通过双向多层级交互机制有效融合空间和通道信息,从而显著提升特征提取的准确性和全面性。在Transformer分支中,引入了高效的级联组注意力机制,旨在丰富注意力头特征的多样性,并增强模型对关键信息的捕捉能力。同时,在卷积神经网络(Convolutional Neural Networks,CNNs)分支中,通过设计DP block,并利用点卷积(Point-Wise Convolution,PW)和深度卷积(Depth-Wise Convolution,DW)深入挖掘局部信息,以优化模型的表达能力。此外,深度特征提取模块(Deep Feature Extraction,DFE)的建立增强了特征传播和复用,提高了数据利用效率,实现了实质性的性能改进。实验结果显示,在公开的COVID19-CT数据集和私有的LUAD-CT数据集上,所提算法优于对比的8种方法,实现了准确分类。

关键词: 肺部CT图像, 双向多层级交互, 卷积神经网络, Transformer, 分类

Abstract: In recent years,the local-window based Self-Attention mechanism has gained prominence in vision tasks.However,due to the limited receptive field and weak modeling ability,it is not effective in dealing with complex data.The features in lung CT images are complex and diverse,including the shape,size and density of nodules,which bring challenges to mining the deep features in the data.To address these issues,this paper proposes a bi-directional multi-level interaction vision Transformer(Bi-MI ViT) backbone network that effectively integrates spatial and channel information through an innovative bi-directional multi-level interaction mechanism.This integration significantly improves the accuracy and comprehensiveness of feature extraction.Within the Transformer branch,we introduce an efficient cascaded group attention(CGA) strategy to enrich the diversity of attention head features and enhance the model’s ability to capture key information.Simultaneously,in the convolutional neural network(CNN) branch,we utilize a depth-wise and point-wise(DP) block structure along with point-wise convolution(PW) and depth-wise convolution(DW) to deeply mine local information and optimize model representation ability.Additionally,our establishment of a deep feature extraction(DFE) module enhances feature propagation and reuse while optimizing data utilization efficiency,leading to substantial performance improvement.Experimental results on both of the public COVID-CT dataset and private LUAD-CT dataset demonstrate that the proposed method outperforms the eight comparison methods in classification accuracy.

Key words: Lung CT images, Bi-directional multi-layer interaction, Convolutional neural network, Transformer, Classification

中图分类号: 

  • TP391
[1]MOBINY A,VAN NGUYEN H.Fast capsnet for lung cancerscreening[C]//International conference on medical image computing and computer-assisted intervention.Cham:Springer International Publishing,2018:741-749.
[2]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2020.
[3]TOUVRON H,CORD M,DOUZE M,et al.Training data-efficient image transformers & distillation through attention[C]//International Conference on Machine Learning.PMLR,2021:10347-10357.
[4]YANG C,XU J,DE MELLO S,et al.Gpvit:a high resolution non-hierarchical vision transformer with group propagation[J].arXiv:2212.06795,2022.
[5]SHI B,DARRELL T,WANG X.Top-down visual attentionfrom analysis by synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:2102-2112.
[6]DING M,SHEN Y,FAN L,et al.Visual dependency transformers:Dependency tree emerges from reversed attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14528-14539.
[7]LEE Y,KIM J,WILLETTE J,et al.Mpvit:Multi-path visiontransformerfor dense prediction[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:7287-7296.
[8]REN P,LI C,WANG G,et al.Beyond fixation:Dynamic window visual transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:11987-11997.
[9]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:10012-10022.
[10]CHU X,TIAN Z,WANG Y,et al.Twins:Revisiting the design of spatial attention in vision transformers[J].Advances in Neural Information Processing Systems,2021,34:9355-9366.
[11]HUANG Z,BEN Y,LUO G,et al.Shuffle transformer:Rethinking spatial shuffle for vision transformer[J].arXiv:2106.03650,2021.
[12]SUN W,CHEN X,ZHANG X,et al.A Multi-Feature Learning Model with Enhanced Local Attention for Vehicle Re-Identification[J].Computers,Materials & Continua,2021,69(3).
[13]ZHOU J,WANG P,WANG F,et al.Elsa:Enhanced local self-attention for vision transformer[J].arXiv:2112.12786,2021.
[14]ARAR M,SHAMIR A,BERMANO A H.Learned queries for efficient local attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:10841-10852.
[15]LIU X,PENG H,ZHENG N,et al.Efficientvit:Memory efficient vision transformer with cascaded group attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:14420-14430.
[16]LIU C,DING W,CHEN P,et al.RB-Net:Training highly accurate and efficient binary neural networks with reshaped point-wise convolution and balanced activation[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(9):6414-6424.
[17]HUA B S,TRAN M K,YEUNG S K.Pointwise convolutionalneural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:984-993.
[18]CHOLLET F.Xception:Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1251-1258.
[19]HOWARD A G,ZHU M,CHEN B,et al.Mobilenets:Efficient convolutional neural networks for mobile vision applications[J].arXiv:1704.04861,2017.
[20]HUANG G,LIU Z,VAN DER MAATEN L,et al.Denselyconnected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[21]ZEILER M D,TAYLOR G W,FERGUS R.Adaptive deconvolutional networks for mid and high level feature learning[C]//2011 International Conference on Computer Vision.IEEE,2011:2018-2025.
[22]YANG X,HE X,ZHAO J,et al.Covid-ct-dataset:a ct scan dataset about covid-19[J].arXiv:2003.13865,2020.
[23]CHEN C F R,FAN Q,PANDA R.Crossvit:Cross-attentionmulti-scale vision transformer for image classification[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:357-366.
[24]GUO J,HAN K,WU H,et al.Cmt:Convolutional neural networks meet vision transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:12175-12185.
[25]CHEN Q,WU Q,WANG J,et al.Mixformer:Mixing features across windows and dimensions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:5249-5259.
[26]MEHTA S,RASTEGARI M.Mobilevit:light-weight,general-purpose,and mobile-friendly vision transformer[J].arXiv:2110.02178,2021.
[27]LI Y,YUAN G,WEN Y,et al.Efficientformer:vision trans-formers at mobilenet speed[J].Advances in Neural Information Processing Systems,2022,35:12934-12949.
[28]LIU Z,SHEN L.CECT:Controllable ensemble CNN and transformer for COVID-19 image classification[J].Computers in Biology and Medicine,2024,173:108388.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!