Computer Science ›› 2026, Vol. 53 ›› Issue (6A): 250700192-7.doi: 10.11896/jsjkx.250700192

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Pyramid Pooling Visual State Space Model for UAV-Satellite Cross-view Geo-localization

YUE Wenjie1, JIANG Jie1, ZHAN Lixin1, ZHOU Bingquan2, ZHOU Tianjian1   

  1. 1 College of System Engineering,National University of Defense Technology,Changsha 410000,China
    2 China Academy of Information and Communications Technology,Beijing 100000,China
  • Online:2026-06-16 Published:2026-06-12
  • About author:YUE Wenjie,born in 2001,postgra-duate.Her main research interest is UAV cross-view geo-localization.
    JIANG Jie,born in 1974,Ph.D,professor,Ph.D supervisor.His main research interests include artificial intelligence and deep learning,visualization and vi-sual analytics,virtual reality and intelligent interaction.

Abstract: Cross-view geo-localization between UAV and satellite images has emerged as a promising alternative to GNSS-INS,particularly in environments where satellite signals are weak or obstructed.However,significant visual discrepancies caused by differences in viewpoint,illumination,and resolution pose considerable challenges for image matching.To address this issue,a novel method called P2VSSM(Pyramid Pooling Visual State Space Model) is proposed.By integrating a pyramid pooling self-attention mechanism into the Mamba architecture,the model enhances feature extraction capabilities for cross-view images.The proposed PPSA module aggregates multi-scale contextual information,improving both semantic abstraction and global modeling.Additionally,the InfoNCE loss is introduced to replace the traditional triplet loss,thereby avoiding the heavy burden of hard negative mining and significantly improving the diversity of negative samples and the stability of contrastive learning during training.Experimental results on two public UAV-satellite datasets,University-1652 and SUES-200,demonstrate that the proposed method achieves state-of-the-art performance in both UAV-to-satellite and satellite-to-UAV retrieval tasks.Extensive ablation studies further confirm the effectiveness and robustness of the proposed approach.

Key words: Cross-view geo-localization, UAV-satellite matching, Mamba, Pyramid pooling, Representation learning

CLC Number: 

  • TP391
[1] MOHSAN S A H,OTHMAN N Q H,LI Y,et al.Unmanned aerial vehicles(UAVs):Practical aspects,applications,open challenges,security issues,and future trends[J].Intelligent Service Robotics,2023,16(1):109-137.
[2] ANGRISANO A.GNSS/INS integration methods[D].Naples:Università degli Studi di Napoli “Parthenope”,2010.
[3] GROVES P D,JIANG Z,RUDI M,et al.A portfolio approach to NLOS and multipath mitigation in dense urban areas[C]//The Institute of Navigation.2013.
[4] COUTURIER A,AKHLOUFI M A.A review on absolute vi-sual localization for UAV[J].Robotics and Autonomous Systems,2021,135:103666.
[5] GU A,DAO T.Mamba:Linear-time sequence modeling with selective state spaces[J].arXiv:2312.00752,2023.
[6] ZHU Q,FANG Y,CAI Y,et al.Rethinking scanning strategies with vision mamba in semantic segmentation of remote sensing imagery:an experimental study[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing,2024,17:18223-18234.
[7] ZHENG Z,WEI Y,YANG Y.University-1652:A multi-viewmulti-source benchmark for drone-based geo-localization[C]//Proceedings of the 28th ACM International Conference on Multimedia.2020:1395-1403.
[8] ZHU R,YIN L,YANG M,et al.SUES-200:A multi-heightmulti-scene cross-view image benchmark across drone and satellite[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(9):4825-4839.
[9] CASTALDO F,ZAMIR A,ANGST R,et al.Semantic cross-view matching[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.2015:9-17.
[10] LIN T Y,BELONGIE S,HAYS J.Cross-view image geoloca-lization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2013:891-898.
[11] SENLET T,ELGAMMAL A.A framework for global vehicle localization using stereo images and satellite and road maps[C]//2011 IEEE International Conference on Computer Vision Workshops(ICCV Workshops).IEEE,2011:2034-2041.
[12] WORKMAN S,SOUVENIR R,JACOBS N.Wide-area image geolocalization with aerial reference imagery[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:3961-3969.
[13] VO N N,HAYS J.Localizing and orienting street viewsusing overhead imagery[C]//European Conference on Computer Vision.Cham:Springer,2016:494-509.
[14] HU S,FENG M,NGUYEN R M H,et al.Cvm-net:Cross-view matching network for image-based ground-to-aerial geo-localization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7258-7267.
[15] SHI Y,LIU L,YU X,et al.Spatial-aware feature aggregation for cross-view image based geo-localization[C]//Advances in Neural Information Processing Systems.2019.
[16] SHI Y,YU X,LIU L,et al.Optimal feature transport for cross-view image geo-localization[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11990-11997.
[17] YANG H,LU X,ZHU Y.Cross-view geo-localization with layer-to-layer transformer[J].Advances in Neural Information Processing Systems,2021,34:29009-29020.
[18] ZHU S,SHAH M,CHEN C.Transgeo:Transformer is all you need for cross-view image geo-localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:1162-1171.
[19] ALI-BEY A,CHAIB-DRAA B,GIGUERE P.Mixvpr:Feature mixing for visual place recognition[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2023:2998-3007.
[20] YE J,LIN H,OU L,et al.Where am I? Cross-View Geo-localization with Natural Language Descriptions[J].arXiv:2412.17007,2024.
[21] ZHU L,LIAO B,ZHANG Q,et al.Vision mamba:efficient vi-sual representation learning with bidirectional state space model[C]//Proceedings of the 41st International Conference on Machine Learning(ICML'24).JMLR.org,2024:62429-62442.
[22] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.AnImage is Worth 16x16 Words:Transformers for Image Recognition at Scale[J].arXiv:2010.11929,2020.
[23] LIU Y,TIAN Y,ZHAO Y,et al.Vmamba:Visual state space model[J].Advances in Neural Information Processing Systems,2024,37:103031-103063.
[24] YANG C,CHEN Z,ESPINOSA M,et al.Plainmamba:Improving non-hierarchical mamba in visual recognition[J].arXiv:2403.17695,2024.
[25] SCHROFF F,KALENICHENKO D,PHILBIN J.Facenet:Aunified embedding for face recognition and clustering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:815-823.
[26] OORD A,LI Y,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[27] DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[28] LOSHCHILOV I,HUTTER F.Decoupled Weight Decay Regularization[C]//International Conference on Learning Representations.2017.
[29] DING L,ZHOU J,MENG L,et al.A practical cross-view image matching method between UAV and satellite for UAV-based geo-localization[J].Remote Sensing,2020,13(1):47.
[30] LEIBE B,LEONARDIS A,SCHIELE B.Robust object detection with interleaved categorization and segmentation[J].International Journal of Computer Vision,2008,77(1):259-289.
[31] WANG T,ZHENG Z,YAN C,et al.Each part matters:Localpatterns facilitate cross-view geo-localization[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,32(2):867-879.
[32] DAI M,HU J,ZHUANG J,et al.A transformer-based feature segmentation and region alignment method for UAV-view geo-localization[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,32(7):4376-4389.
[33] GE F,ZHANG Y,WANG L,et al.Multilevel feedback jointrepresentation learning network based on adaptive area elimination for cross-view geo-localization[J].IEEE Transactions on Geoscience and Remote Sensing,2024,62:1-15.
[34] CHEN Q,WANG T,YANG Z,et al.Sdpl:Shifting-dense partition learning for uav-view geo-localization[J].IEEE Transactions on Circuits and Systems for Video Technology,2024,34(11):11810-11824.
[35] DU H,HE J,ZHAO Y.CCR:A counterfactual causal reasoning-based method for cross-view geo-localization[J].IEEE Transactions on Circuits and Systems for Video Technology,2024,34(11):11630-11643.
[36] LYU H,ZHU H,ZHU R,et al.Direction-guided multiscale feature fusion network for geo-localization[J].IEEE Transactions on Geoscience and Remote Sensing,2024,62:1-13.
[1] LI Xiuying, CHEN Xuesong, LI Haoze, LIAO Hongwei, HAN Jiameng, DUAN Xiaoyi. MambaCS:Mamba-based Image Compressed Sensing Algorithm [J]. Computer Science, 2026, 53(6): 232-241.
[2] ZHANG Xin, CHEN Wen. CausalVulGNN:Framework for Software Vulnerability Explanation Based on Causal Inferenceand Graph Neural Networks [J]. Computer Science, 2026, 53(6): 427-436.
[3] HUANG Beibei, LIU Jinfeng. Causal Disentangled Representation Learning with Integrated Sparse Coding [J]. Computer Science, 2026, 53(4): 66-77.
[4] GAO Tai, REN Yanzhang, WANG Huiqing, LI Ying, WANG Bin. KGMamba:Gene Regulatory Network Prediction Model Based on Kolmogorov-Arnold Network Optimizing Graph Convolutional Network and Mamba [J]. Computer Science, 2026, 53(4): 101-111.
[5] LIU Yichen, LIN Yan, ZHOU Zeyu, GUO Shengnan, LIN Youfang, WAN Huaiyu. Efficient Semantic-aware Trajectory Representation Learning Method via State Space Model [J]. Computer Science, 2026, 53(4): 134-142.
[6] WANG Yiming, JIAO Min, ZHAO Suyun, CHEN Hong, LI Cuiping. Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering [J]. Computer Science, 2026, 53(3): 158-165.
[7] ZHANG Jing, PAN Jinghao, JIANG Wenchao. Background Structure-aware Few-shot Knowledge Graph Completion [J]. Computer Science, 2026, 53(2): 331-341.
[8] LI Fang, WANG Jie. DACSNet:Dual Attention Mechanism and Classification Supervision Network for Breast Lesion Detection in Ultrasound Images [J]. Computer Science, 2025, 52(9): 54-61.
[9] ZHU Rui, YE Yaqin, LI Shengwen, TANG Zijian, XIAO Yue. Dynamic Community Detection with Hierarchical Modularity Optimization [J]. Computer Science, 2025, 52(8): 127-135.
[10] ZENG Fanyun, LIAN Hechun, FENG Shanshan, WANG Qingmei. Material SEM Image Retrieval Method Based on Multi-scale Features and Enhanced HybridAttention Mechanism [J]. Computer Science, 2025, 52(6A): 240800014-7.
[11] LIAO Sirui, HUANG Feihu, ZHAN Pengxiang, PENG Jian, ZHANG Linghao. DCDAD:Differentiated Context Dependency for Time Series Anomaly Detection Method [J]. Computer Science, 2025, 52(6): 106-117.
[12] GUO Xuan, HOU Jinlin, WANG Wenjun, JIAO Pengfei. Dynamic Link Prediction Method for Adaptively Modeling Network Dynamics [J]. Computer Science, 2025, 52(6): 118-128.
[13] TAN Qiyin, YU Jiong, CHEN Zixin. Outlier Detection Method Based on Adaptive Graph Autoencoder [J]. Computer Science, 2025, 52(6): 129-138.
[14] WANG Jinghong, WU Zhibing, WANG Xizhao, LI Haokang. Semantic-aware Heterogeneous Graph Attention Network Based on Multi-view RepresentationLearning [J]. Computer Science, 2025, 52(6): 167-178.
[15] WU Jie, WAN Yuan, LIU Qiujie. Consistent Block Diagonal and Exclusive Multi-view Subspace Clustering [J]. Computer Science, 2025, 52(4): 138-146.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!