Computer Science ›› 2023, Vol. 50 ›› Issue (1): 105-113.doi: 10.11896/jsjkx.211100208

• Computer Graphics & Multimedia • Previous Articles     Next Articles

SPT:Swin Pyramid Transformer for Object Detection of Remote Sensing

CAI Xiao1, CEHN Zhihua1, SHENG Bin2   

  1. 1 School of Information Science and Engineering,East China University of Science and Technology,Shanghai 200237,China
    2 School of Electronic Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2021-11-22 Revised:2022-06-08 Online:2023-01-15 Published:2023-01-09
  • About author:CAI Xiao,born in 1996,postgraduate,is a member of China Computer Fedration.His main research interests include image processing and computer gra-phics.
    CHEN Zhihua,born in 1969,Ph.D,professor,is a member of China Computer Federation.His main research interests include image processing and computer graphics.
  • Supported by:
    National Natural Science Foundation of China(62076127).

Abstract: The task of object detection is a basic and highly concerned work in the field of computer vision.Because object detection in remote sensing has important application value in transportation,military,agriculture,etc.,it has also become a major research hotspot.Compared with natural images,remote sensing images are affected by many factors such as complex background interference,weather,irregularities,and small objects.It is extremely challenging to achieve higher accuracy in remote sensing image object detection tasks.This paper proposes a novel object detection network based on Transformer,swin pyramid Transformer(SPT).SPT uses a sliding window Transformer module as the backbone of feature extraction.Among it,the self-attention mechanism of Transformer is very effective for detecting objects in a chaotic background,and the sliding window mode efficiently avoids a large number of square-level complexity calculations.After obtaining the feature map extracted by the backbone network,SPT uses a pyramid architecture to fuse different scale and semantic features,pithily reducing the loss of information between feature layers,and capturing the inherent multi-scale hierarchical relationship.In addition,this paper proposes self-mixed Transformer(SMT) module and cross-layer Transformer(CLT) module.SMT re-renders the highest-level feature map to enhance object feature recognition and expression.According to the feature context interaction,the feature expressions of the pixels of each feature layer are rearranged by CLT,and the CLT module is integrated into the bottom-up and top-down dual paths of the pyramid to make full use of global and local information containing different semantics.Our SPT network model is trained and tested on the UCAS-AOD and RSOD datasets.Experimental results show that SPT is high-performing in remote sensing image object detection tasks,especially suitable for irregular and small target categories,such as overpass and car.

Key words: Deep learning, Object detection, Remote sensing, Attention mechanism, Transformer

CLC Number: 

  • TP751
[1]HARRIS C G,STEPHENS M.A combined corner and edge detector[C]//Proceedings of the Alvey Vision Conference.Alvey Vision Club,1988:1-6.
[2]HARIS K,EFSTRATIADIS S N,MAGLAVERAS N,et al.Hybrid image segmentation using watersheds and fast region merging [J].IEEE Transactions on Image Processing,1998,7(12):1684-1699.
[3]YAN Q,XU L,SHI J P,et al.Hierarchical saliency detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2013:1155-1162.
[4]CORTES C,VAPNIK V.Support-vector networks [J].Machine Learning,1995,20(3):273-297.
[5]VIOLA P A,JONES M J.Rapid object detection using a boosted cascade of simple features[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2001:511-518.
[6]ZHANG F,DU B,ZHANG L P,et al.Weakly supervised lear-ning based on coupled convolutional neural networks for aircraft detection [J].IEEE Transactions on Geoscience and Remote Sensing,2016,54(9):5553-5563.
[7]CHEN S Q,ZHAN R H,ZHANG J.Geospatial object detection in remote sensing imagery based on multiscale single-shot detector with activated semantics [J].Remote Sensing,2018,10(6):820.
[8]YANG X,YANG J R,YAN J C,et al.SCRDet:Towards more robust detection for small,cluttered and rotated objects[C]//Proceedings of the International Conference on Computer Vision.IEEE,2019:8232-8241.
[9]YANG X,YAN J C,YANG X K,et al.SCRDet++:Detecting small,cluttered and rotated objects via instance-levelfeature denoising and rotation loss smoothing [J].arXiv:2004.13316,2020.
[10]ZOU F H,XIAO W,JI W T,et al.Arbitrary-oriented object detection via dense feature fusion and attention model for remote sensing super-resolution image [J].Neural Computing and Applications,2020,32(18):14549-14562.
[11]LIN T Y,DOLLÁR P,GIRSHICK R,et al.Feature pyramidnetworks for object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:2117-2125.
[12]VASWANI A,SHAZEER N,PARMER N,et al.Attention is all you need[C]//Neural Information Processing Systems.2017:5998-6008.
[13]ZHU X Z,SU W J,LU L W,et al.Swin Transformer:Hierarchical vision transformer using shifted windows [J].arXiv:2103.14030,2021.
[14]GIRSHICK R B,DONAHUE J,DARRELL T,et al.Rich feature hierarchies for accurate object detection and semantic segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2014:580-587.
[15]REN S Q,HE K M,GIRSHICK R B,et al.Faster R-CNN:Towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[16]CAI Z W,VASCONCELOS N.Cascade R-CNN:Delving intohigh quality object detection[C]//IEEE Conferenceon Compu-ter Vision and Pattern Recognition.IEEE Computer Society,2018:6154-6162.
[17]QIAO S Y,CHEN L C,YUILLE A.DetectoRS:Detecting objects with recursive feature pyramid and switchable atrous convolution[J].arXiv:2006.02334,2020.
[18]TAN M X,PANG R,LEQ V.EfficientDet:Scalable and efficient object detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:10778-10787.
[19]REDMON J,DIVVALA S K,GIRSHICK.You Only LookOnce:Unified,Real-Time Object Detection[C]//IEEE Confe-rence on Computer Vision and Pattern Recognition.IEEE Computer Society,2016:779-788.
[20]LIU W,ANGUELOV D,ERHAN D,et al.SSD:Single Shot MultiBox Detector[C]//European Conference on Computer Vision.Springer,2016:21-37.
[21]LIN T Y,GOYAL P,GIRSHICK R B,et al.Focal Loss for Dense Object Detection[J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2020,42(2):318-327.
[22]BOCHOKNOVSKIY A,WANG C Y,LIAO H Y M.YOLOv4:Optimal Speed and Accuracy of Object Detection [J].arXiv:2004.10934,2020.
[23]LAW H,DENG J.CornerNet:Detecting objects as paired keypoints[J].International Journal of Computer Vision,2020,128(3):642-656.
[24]DUAN K W,BAI S,XIE L X,et al.CenterNet:Keypoint Triplets for Object Detection[C]//IEEE International Conference on Computer Vision.IEEE,2019:6568-6577.
[25]ZHOU X Y,KOLTUN V,KRÄHENBÜHL P.Probabilistictwo-stage detection[J].arXiv:2103.07461,2021.
[26]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[J].arXiv:2010.11929,2020.
[27]YUAN L,CHEN Y P,W T,et al.Tokens-to-Token ViT:Trai-ning Vision Transformers from Scratch on ImageNet[J].arXiv:2101.11986,2021.
[28]WANG W H,XIE E Z,LI X,et al.Pyramid Vision Transfor-mer:A Versatile Backbone for Dense Prediction without Convolutions[J].arXiv:2102.12122,2021.
[29]HUANG Z L,BEN Y C,LUO G Z,et al.Shuffle Transformer:Rethinking Spatial Shuffle for Vision Transformer[J].arXiv:2106.03650,2021.
[30]CARION N,MASSA F,SYNNAEVE G,et al.End-to-End Object Detection with Transformers[C]//European Conference on Computer Vision.Springer,2020:213-229.
[31]ZHU X Z,SU W J,LU L W,et al.Deformable DETR:Defor-mable Transformers for End-to-End Object Detection[C]//International Conference on Learning Representations.OpenReview,2021.
[32]GUO C X,FAN B,ZHANG Q,et al.AugFPN:ImprovingMulti-Scale Feature Learning for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2020:12592-12601.
[33]ZHU H G,CHEN X G,DAI W Q,et al.Orientation Robust Object Detection in Aerial Images Using Deep Convolutional Neural Network[C]//International Conference on Image Proces-sing.IEEE,2015:3735-3739.
[34]LONG Y,GONG Y P,XIAO Z F,et al.Accurate object localization in remote sensing images based on convolutional neural networks[J].IEEE Transactions on Geoscience and Remote Sen-sing,2017,55(5):2486-2498.
[35]CHEN K,WANG J Q,PANG J M,et al.MMDetection:Open mmlab detection toolbox and benchmark[J].arXiv:1906.07155,2019.
[36]DENG J,DONG W,SOCHER R,et al.ImageNet:A large-scale hierarchical image database[C]//International Conference on Learning Representations.IEEE Computer Society,2009:248-255.
[37]ZHANG H K,CHANG H,MA B P,et al.Dynamic R-CNN:Towards High Quality Object Detection via Dynamic Training[C]//European Conference on Computer Vision.Springer,2020:260-275.
[38]PANG J M,CHEN K,SHI J P,et al.Libra R-CNN:Towards Balanced Learning for Object Detection[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2019:821-830.
[39]DAI J F,QI H Z,XIONG Y W,et al.Deformable Convolutional Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE Computer Society,2017:764-773.
[40]WANG C Y,BOCHKOVSKIY A,LIAO M H Y.Scaled-YOLOv4:Scaling Cross Stage Partial Network[C]//IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2021:13029-13038.
[41]WANG C Y,YEH I H,LIAO M H Y.You Only Learn One Representation:Unified Network for Multiple Tasks[J].arXiv:2105.04206,2021.
[42]JOCHER G,STOKEN A,CHAURASIA A,et al.Ultralytics/yolov5:v6.0-YOLOv5n ‘Nano' models,Roboflow integration,TensorFlow export,OpenCV DNN support[EB/OL].https://doi.org/10.5281/zenodo.5563715.
[1] LI Xuehui, ZHANG Yongjun, SHI Dianxi, XU Huachi, SHI Yanyan. AFTM:Anchor-free Object Tracking Method with Attention Features [J]. Computer Science, 2023, 50(1): 138-146.
[2] ZHAO Qian, ZHOU Dongming, YANG Hao, WANG Changchen. Image Deblurring Based on Residual Attention and Multi-feature Fusion [J]. Computer Science, 2023, 50(1): 147-155.
[3] SUN Kaili, LUO Xudong , Michael Y.LUO. Survey of Applications of Pretrained Language Models [J]. Computer Science, 2023, 50(1): 176-184.
[4] ZHENG Cheng, MEI Liang, ZHAO Yiyan, ZHANG Suhang. Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks [J]. Computer Science, 2023, 50(1): 221-228.
[5] RONG Huan, QIAN Minfeng, MA Tinghuai, SUN Shengjie. Novel Class Reasoning Model Towards Covered Area in Given Image Based on InformedKnowledge Graph Reasoning and Multi-agent Collaboration [J]. Computer Science, 2023, 50(1): 243-252.
[6] LI Xiaoling, WU Haotian, ZHOU Tao, LU Hui. Password Guessing Model Based on Reinforcement Learning [J]. Computer Science, 2023, 50(1): 334-341.
[7] ZHANG Jingyuan, WANG Hongxia, HE Peisong. Multitask Transformer-based Network for Image Splicing Manipulation Detection [J]. Computer Science, 2023, 50(1): 114-122.
[8] WANG Bin, LIANG Yudong, LIU Zhe, ZHANG Chao, LI Deyu. Study on Unsupervised Image Dehazing and Low-light Image Enhancement Algorithms Based on Luminance Adjustment [J]. Computer Science, 2023, 50(1): 123-130.
[9] ZHOU Fang-quan, CHENG Wei-qing. Sequence Recommendation Based on Global Enhanced Graph Neural Network [J]. Computer Science, 2022, 49(9): 55-63.
[10] DAI Yu, XU Lin-feng. Cross-image Text Reading Method Based on Text Line Matching [J]. Computer Science, 2022, 49(9): 139-145.
[11] ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[12] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[13] XIONG Li-qin, CAO Lei, LAI Jun, CHEN Xi-liang. Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization [J]. Computer Science, 2022, 49(9): 172-182.
[14] RAO Zhi-shuang, JIA Zhen, ZHANG Fan, LI Tian-rui. Key-Value Relational Memory Networks for Question Answering over Knowledge Graph [J]. Computer Science, 2022, 49(9): 202-207.
[15] TANG Ling-tao, WANG Di, ZHANG Lu-fei, LIU Sheng-yun. Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy [J]. Computer Science, 2022, 49(9): 297-305.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!