基于移位窗口金字塔Transformer的遥感图像目标检测

doi:10.11896/jsjkx.211100208

计算机科学 ›› 2023, Vol. 50 ›› Issue (1): 105-113.doi: 10.11896/jsjkx.211100208

• 计算机图形学&多媒体 • 上一篇下一篇

基于移位窗口金字塔Transformer的遥感图像目标检测

蔡肖¹, 陈志华¹, 盛斌²

1 华东理工大学信息科学与工程学院上海 200237
2上海交通大学电子信息与电气工程学院上海 200240

收稿日期:2021-11-22 修回日期:2022-06-08 出版日期:2023-01-15 发布日期:2023-01-09
通讯作者: 陈志华(czh@ecust.edu.cn)
作者简介:1060627557@qq.com
基金资助:
国家自然科学基金(61672228);装备预研教育部联合基金(6141A02022373)x

SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing

CAI Xiao¹, CEHN Zhihua¹, SHENG Bin²

1 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
2 School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China

Received:2021-11-22 Revised:2022-06-08 Online:2023-01-15 Published:2023-01-09
About author:CAI Xiao,born in 1996,postgraduate,is a member of China Computer Fedration.His main research interests include image processing and computer gra-phics.
CHEN Zhihua,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include image processing and computer graphics.
Supported by:
National Natural Science Foundation of China(62076127).

摘要/Abstract

摘要： 目标检测任务是计算机视觉领域中基础且备受关注的工作,遥感图像目标检测任务因在交通、军事、农业等方面具有重要应用价值,也成为研究的一大热点。相比自然图像,遥感图像由于受到复杂背景的干扰,以及天气、小型和不规则物体等诸多因素的影响,遥感图像目标检测任务要实现较高的精度是极具挑战性的。文中提出了一种新颖的基于移位窗口Transformer的目标检测网络。模型应用了移位窗口式Transformer模块作为特征提取的骨干,其中,Transformer的自注意力机制对于检测混乱背景下的目标十分有效,移位窗口式的模式则有效避免了大量的平方级复杂度计算。在获得骨干网络提取的特征图之后,模型使用了金字塔架构以融合不同尺度、不同语义的局部和全局特征,有效地减少了特征层之间的信息丢失,并捕捉到固有的多尺度层级关系。此外,文中还提出了自混合视觉转换器模块和跨层视觉转换器模块。自混合视觉转换器模块重新渲染了深层特征图以增强目标特征识别和表达,跨层视觉转换器模块则依据特征上下文交互等级重新排列各特征层像素的信息表达。模块融入到自下而上和自上而下双向特征路径之中,以充分利用包含不同语义的全局和局部信息。所提网络模型在UCAS-AOD数据集和RSOD数据集上进行训练并测试,实验结果表明,模型在遥感图像目标检测任务上效果显著,尤其适用于不规则的目标和小目标类别,如立交桥和汽车。

关键词: 深度学习, 目标检测, 遥感图像, 注意力机制, Transformer

Abstract: The task of object detection is a basic and highly concerned work in the field of computer vision.Because object detection in remote sensing has important application value in transportation,military,agriculture,etc.,it has also become a major research hotspot.Compared with natural images,remote sensing images are affected by many factors such as complex background interference,weather,irregularities,and small objects.It is extremely challenging to achieve higher accuracy in remote sensing image object detection tasks.This paper proposes a novel object detection network based on Transformer,swin pyramid Transformer(SPT).SPT uses a sliding window Transformer module as the backbone of feature extraction.Among it,the self-attention mechanism of Transformer is very effective for detecting objects in a chaotic background,and the sliding window mode efficiently avoids a large number of square-level complexity calculations.After obtaining the feature map extracted by the backbone network,SPT uses a pyramid architecture to fuse different scale and semantic features,pithily reducing the loss of information between feature layers,and capturing the inherent multi-scale hierarchical relationship.In addition,this paper proposes self-mixed Transformer(SMT) module and cross-layer Transformer(CLT) module.SMT re-renders the highest-level feature map to enhance object feature recognition and expression.According to the feature context interaction,the feature expressions of the pixels of each feature layer are rearranged by CLT,and the CLT module is integrated into the bottom-up and top-down dual paths of the pyramid to make full use of global and local information containing different semantics.Our SPT network model is trained and tested on the UCAS-AOD and RSOD datasets.Experimental results show that SPT is high-performing in remote sensing image object detection tasks,especially suitable for irregular and small target categories,such as overpass and car.

Key words: Deep learning, Object detection, Remote sensing, Attention mechanism, Transformer

中图分类号:

TP751

蔡肖, 陈志华, 盛斌. 基于移位窗口金字塔Transformer的遥感图像目标检测[J]. 计算机科学, 2023, 50(1): 105-113. https://doi.org/10.11896/jsjkx.211100208

CAI Xiao, CEHN Zhihua, SHENG Bin. SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing[J]. Computer Science, 2023, 50(1): 105-113. https://doi.org/10.11896/jsjkx.211100208

参考文献

[1]HARRIS C G,STEPHENS M.A combined corner and edge detector[C]//Proceedings of the Alvey Vision Conference.Alvey Vision Club,1988:1-6.
[2]HARIS K,EFSTRATIADIS S N,MAGLAVERAS N,et al.Hybrid image segmentation using watersheds and fast region merging [J].IEEE Transactions on Image Processing,1998,7(12):1684-1699.
[3]YAN Q,XU L,SHI J P,et al.Hierarchical saliency detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2013:1155-1162.
[4]CORTES C,VAPNIK V.Support-vector networks [J].Machine Learning,1995,20(3):273-297.
[5]VIOLA P A,JONES M J.Rapid object detection using a boosted cascade of simple features[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2001:511-518.
[6]ZHANG F,DU B,ZHANG L P,et al.Weakly supervised lear-ning based on coupled convolutional neural networks for aircraft detection [J].IEEE Transactions on Geoscience and Remote Sensing,2016,54(9):5553-5563.
[7]CHEN S Q,ZHAN R H,ZHANG J.Geospatial object detection in remote sensing imagery based on multiscale single-shot detector with activated semantics [J].Remote Sensing,2018,10(6):820.
[8]YANG X,YANG J R,YAN J C,et al.SCRDet:Towards more robust detection for small,cluttered and rotated objects[C]//Proceedings of the International Conference on Computer Vision.IEEE,2019:8232-8241.
[9]YANG X,YAN J C,YANG X K,et al.SCRDet++:Detecting small,cluttered and rotated objects via instance-levelfeature denoising and rotation loss smoothing [J].arXiv:2004.13316,2020.
[10]ZOU F H,XIAO W,JI W T,et al.Arbitrary-oriented object detection via dense feature fusion and attention model for remote sensing super-resolution image [J].Neural Computing and Applications,2020,32(18):14549-14562.
[11]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:2117-2125.
[12]VASWANI A,SHAZEER N,PARMER N,et al.Attention is all you need[C]//Neural Information Processing Systems.2017:5998-6008.
[13]ZHU X Z,SU W J,LU L W,et al.Swin Transformer:Hierarchical vision transformer using shifted windows [J].arXiv:2103.14030,2021.
[14]GIRSHICK R B,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2014:580-587.
[15]REN S Q,HE K M,GIRSHICK R B,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[16]CAI Z W,VASCONCELOS N.Cascade R-CNN:Delving intohigh quality object detection[C]//IEEE Conferenceon Compu-ter Vision and Pattern Recognition.IEEE Computer Society,2018:6154-6162.
[17]QIAO S Y,CHEN L C,YUILLE A.DetectoRS:Detecting objects with recursive feature pyramid and switchable atrous convolution[J].arXiv:2006.02334,2020.
[18]TAN M X,PANG R,LEQ V.EfficientDet:Scalable and efficient object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:10778-10787.
[19]REDMON J,DIVVALA S K,GIRSHICK.You Only LookOnce:Unified,Real-Time Object Detection[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.IEEE Computer Society,2016:779-788.
[20]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single Shot MultiBox Detector[C]//European Conference on Computer Vision.Springer,2016:21-37.
[21]LIN T Y,GOYAL P,GIRSHICK R B,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2020,42(2):318-327.
[22]BOCHOKNOVSKIY A,WANG C Y,LIAO H Y M.YOLOv4:Optimal Speed and Accuracy of Object Detection [J].arXiv:2004.10934,2020.
[23]LAW H,DENG J.CornerNet:Detecting objects as paired keypoints[J].International Journal of Computer Vision,2020,128(3):642-656.
[24]DUAN K W,BAI S,XIE L X,et al.CenterNet:Keypoint Triplets for Object Detection[C]//IEEE International Conference on Computer Vision.IEEE,2019:6568-6577.
[25]ZHOU X Y,KOLTUN V,KRÄHENBÜHL P.Probabilistictwo-stage detection[J].arXiv:2103.07461,2021.
[26]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[J].arXiv:2010.11929,2020.
[27]YUAN L,CHEN Y P,W T,et al.Tokens-to-Token ViT:Trai-ning Vision Transformers from Scratch on ImageNet[J].arXiv:2101.11986,2021.
[28]WANG W H,XIE E Z,LI X,et al.Pyramid Vision Transfor-mer:A Versatile Backbone for Dense Prediction without Convolutions[J].arXiv:2102.12122,2021.
[29]HUANG Z L,BEN Y C,LUO G Z,et al.Shuffle Transformer:Rethinking Spatial Shuffle for Vision Transformer[J].arXiv:2106.03650,2021.
[30]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End Object Detection with Transformers[C]//European Conference on Computer Vision.Springer,2020:213-229.
[31]ZHU X Z,SU W J,LU L W,et al.Deformable DETR:Defor-mable Transformers for End-to-End Object Detection[C]//International Conference on Learning Representations.OpenReview,2021.
[32]GUO C X,FAN B,ZHANG Q,et al.AugFPN:ImprovingMulti-Scale Feature Learning for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:12592-12601.
[33]ZHU H G,CHEN X G,DAI W Q,et al.Orientation Robust Object Detection in Aerial Images Using Deep Convolutional Neural Network[C]//International Conference on Image Proces-sing.IEEE,2015:3735-3739.
[34]LONG Y,GONG Y P,XIAO Z F,et al.Accurate object localization in remote sensing images based on convolutional neural networks[J].IEEE Transactions on Geoscience and Remote Sen-sing,2017,55(5):2486-2498.
[35]CHEN K,WANG J Q,PANG J M,et al.MMDetection:Open mmlab detection toolbox and benchmark[J].arXiv:1906.07155,2019.
[36]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//International Conference on Learning Representations.IEEE Computer Society,2009:248-255.
[37]ZHANG H K,CHANG H,MA B P,et al.Dynamic R-CNN:Towards High Quality Object Detection via Dynamic Training[C]//European Conference on Computer Vision.Springer,2020:260-275.
[38]PANG J M,CHEN K,SHI J P,et al.Libra R-CNN:Towards Balanced Learning for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2019:821-830.
[39]DAI J F,QI H Z,XIONG Y W,et al.Deformable Convolutional Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:764-773.
[40]WANG C Y,BOCHKOVSKIY A,LIAO M H Y.Scaled-YOLOv4:Scaling Cross Stage Partial Network[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2021:13029-13038.
[41]WANG C Y,YEH I H,LIAO M H Y.You Only Learn One Representation:Unified Network for Multiple Tasks[J].arXiv:2105.04206,2021.
[42]JOCHER G,STOKEN A,CHAURASIA A,et al.Ultralytics/yolov5:v6.0-YOLOv5n ‘Nano' models,Roboflow integration,TensorFlow export,OpenCV DNN support[EB/OL].https://doi.org/10.5281/zenodo.5563715.

相关文章 15

[1]	张婧媛, 王宏霞, 何沛松. 基于Transformer的多任务图像拼接篡改检测算法 Multitask Transformer-based Network for Image Splicing Manipulation Detection 计算机科学, 2023, 50(1): 114-122. https://doi.org/10.11896/jsjkx.211100269
[2]	王斌, 梁宇栋, 刘哲, 张超, 李德玉. 亮度自调节的无监督图像去雾与低光图像增强算法研究 Study on Unsupervised Image Dehazing and Low-light Image Enhancement Algorithms Based on Luminance Adjustment 计算机科学, 2023, 50(1): 123-130. https://doi.org/10.11896/jsjkx.211100058
[3]	李雪辉, 张拥军, 史殿习, 徐化池, 史燕燕. 融合注意力特征的无锚框视觉目标跟踪方法 AFTM:Anchor-free Object Tracking Method with Attention Features 计算机科学, 2023, 50(1): 138-146. https://doi.org/10.11896/jsjkx.211000083
[4]	赵倩, 周冬明, 杨浩, 王长城. 残差注意力与多特征融合的图像去模糊 Image Deblurring Based on Residual Attention and Multi-feature Fusion 计算机科学, 2023, 50(1): 147-155. https://doi.org/10.11896/jsjkx.211100161
[5]	孙凯丽, 罗旭东, 罗有容. 预训练语言模型的应用综述 Survey of Applications of Pretrained Language Models 计算机科学, 2023, 50(1): 176-184. https://doi.org/10.11896/jsjkx.220800223
[6]	郑诚, 梅亮, 赵伊研, 张苏航. 基于双向注意力机制和门控图卷积网络的文本分类方法 Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks 计算机科学, 2023, 50(1): 221-228. https://doi.org/10.11896/jsjkx.211100095
[7]	荣欢, 钱敏峰, 马廷淮, 孙圣杰. 基于先验知识图谱的多代理被遮挡目标类别推理模型 Novel Class Reasoning Model Towards Covered Area in Given Image Based on InformedKnowledge Graph Reasoning and Multi-agent Collaboration 计算机科学, 2023, 50(1): 243-252. https://doi.org/10.11896/jsjkx.220700112
[8]	李小玲, 吴昊天, 周涛, 鲁辉. 一种基于强化学习的口令猜解模型 Password Guessing Model Based on Reinforcement Learning 计算机科学, 2023, 50(1): 334-341. https://doi.org/10.11896/jsjkx.211100001
[9]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[10]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[11]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[12]	徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[13]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[14]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[15]	汤凌韬, 王迪, 张鲁飞, 刘盛云. 基于安全多方计算和差分隐私的联邦学习方案 Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy 计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于移位窗口金字塔Transformer的遥感图像目标检测

SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0