Computer Science ›› 2025, Vol. 52 ›› Issue (10): 144-150.doi: 10.11896/jsjkx.240800159

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Target Tracking Method Based on Cross Scale Fusion of Features and Trajectory Prompts

WEN Jing, ZHANG Songsong, LI Xufeng   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
  • Received:2024-08-29 Revised:2024-11-29 Online:2025-10-15 Published:2025-10-14
  • About author:WEN Jing,born in 1982,Ph.D,asso-ciated professor,master supervisor,is a member of CCF(No.22721M).Her main research interests include compu-ter vision and machine learning.
  • Supported by:
    Research Project by Shanxi Scholarship Council of China(2022-008).

Abstract: When Transformer is used alone for feature extraction in object tracking,the absence of inductive bias makes it difficult to adapt to change in target scale and appearance.To address this,this paper introduces target tracking method based on cross scale fusion of features and trajectory prompts(Cross Scale Fusion of features and Trajectory Prompts Tracker CSFTP-Tracker).In constructing the input for the object tracking network,both the template image and the search image are simultaneously fed into an encoder that fuses CNN and ViT.A key design element is the multi-level spatial-aware pyramid module (Multi-Level Spatial Awareness Pyramid,MSAP).Firstly,the multi-scale CNN features are enhanced with self-attention to strengthen target location information.These multi-scale features are then fused with the F-embeddings features from the ViT and input into the ViT encoder.This fusion strategy not only enhances information interaction between patches within the ViT but also enables the network to leverage both the local features of CNN and the global dependency capabilities of the Transformer.Furthermore,the fused features extracted by the ViT,along with the trajectory prompt features,are fed into the decoder,where autoregressive learning is employed to predict the target's position.Experimental results on the GOT-10k dataset show that,compared to the baseline models,the proposed network improves the average overlap(AO) by 1.3% and increases the success rate score at a 0.5 threshold(SR0.5) by 1.4%.

Key words: Transformer,Object tracking,Inductive bias,Encoder,Trajectory prompt

CLC Number: 

  • TP391
[1]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:A brief review[J].Computational Intelligence and Neuroscience,2018,2018(1):1-13.
[2]BERTINETTO L,VALMADRE J,HENRIQUES J F,et al.Fully-convolutional siamese networks for object tracking[C]//ECCV 2016 Workshops.Springer,2016:850-865.
[3]LI B,YAN J,WU W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2018:8971-8980.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[5]CHEN X,PENG H,WANG D,et al.Seqtrack:Sequence to sequence learning for visual object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2023:14572-14581.
[6]DOSOVITSKIY A.An image is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021.
[7]WANG N,ZHOU W,WANG J,et al.Transformer meets trac-ker:Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:1571-1580.
[8]CHEN X,YAN B,ZHU J,et al.Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:8126-8135.
[9]YU B,TANG M,ZHENG L,et al.High-performance discriminative tracking with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:9856-9865.
[10]YAN B,PENG H,FU J,et al.Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:10448-10457.
[11]ZHENG Y,ZHONG B,LIANG Q,et al.Odtrack:Online dense temporal token learning for visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2024:7588-7596.
[12]WEI X,BAI Y,ZHENG Y,et al.Autoregressive visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,New York:IEEE,2023:9697-9706.
[13]XIA C,WANG X,LYU F,et al.Vit-comer:Vision transformer with convolutional multi-scale feature interaction for dense predictions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2024:5493-5502.
[14]CHEN M M.Research on Object Tracking Algorithm Integra-ting Swin Transformer Multi-scale Features and Pooling Spatial Features[J].Journal of Chongqing Technology and Business University.Natural Science Edition,2025,42(3):110-117.
[15]XU W,WAN Y.ELA:Efficient Local Attention for Deep Conv-olutional Neural Networks[J].arXiv:2403.01123,2024.
[16]ZHU X,SU W,LU L,et al.Deformable DETR:Deformabletransformers for end-to-end object detection[C]//Proceedings of the International Conference on Learning Representations.2021.
[17]CHEN T,SAXENA S,LI L,et al.Pix2seq:A language modeling framework for object detection [C]//Proceedings of the International Conference on Learning Representations.2022.
[18]DE BOER P T,KROESE D P,MANNOR S,et al.A tutorial on the cross-entropy method[J].Annals of Operations Research,2005,134(1):19-67.
[19]GEVORGYAN Z.SIoU loss:More powerful learning for bounding box regression[J].arXiv:2303.15067,2023.
[20]HUANG L,ZHAO X,HUANG K.Got-10k:A large high-diver-sity benchmark for generic object tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(5):1562-1577.
[21]FAN H,LIN L,YANG F,et al.Lasot:A high-quality benchmark for large-scale single object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:5374-5383.
[22]MUELLER M,SMITH N,GHANEM B.A Benchmark andSimulator for UAV Tracking[C]//ECCV 2016 Workshops.Springer,2016:445-461.
[23]LI B,WU W,WANG Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:4282-4291.
[24]BHAY G,DANELLJAN M,GOOL L V,et al.Learning dis-criminative model prediction for tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Cision.New York:IEEE,2019:6182-6191.
[25]ZHANG Z,PENG H,FU J,et al.Ocean:Object-aware anchor-free tracking[C]//Computer Vision ECCV.Berlin:Springer,2020:771-787.
[26]DANELLJAN M,GOOL L V,TIMOFTE R.Probabilistic re-gression for visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:7183-7192.
[27]VOIGTLAENDER P,LUITEN J,TORR P H S,et al.SiamR-CNN:Visual tracking by re-detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:6578-6588.
[28]FU Z,FU Z,LIU Q,et al.SparseTT:Visual tracking withsparse transformers[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence.AAAI,2022:905-912.
[29]ZHANG Z,LIU Y,WANG X,et al.Learn to match:Automatic matching network design for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:13339-13348.
[30]LIN L,FAN H,ZHANG Z,et al.Swintrack:A simple andstrong baseline for transformer tracking[C]//Proceedings of Advances in Neural Information Processing Systems.2022:16743-16754.
[31]CUI Y,JIANG C,WANG L,et al.Mixformer:End-to-end tracking with iterative mixed attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2022:13608-13618.
[32]YE B,CHANG H,MA B,et al.Joint feature learning and relation modeling for tracking:A one-stream framework[C]//European Conference on Computer Vision.Berlin:Springer,2022:341-357.
[33]CAI Y,LIU J,TANG J,et al.Robust object modeling for visualtracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:9589-9600.
[34]XU Y,WANG Z,LI Z,et al.SiamFC++:Towards robust and accurate visual tracking with target estimation guidelines [C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2020:12549-12556.
[1] WANG Baocai, WU Guowei. Interpretable Credit Risk Assessment Model:Rule Extraction Approach Based on AttentionMechanism [J]. Computer Science, 2025, 52(10): 50-59.
[2] ZHENG Hanyuan, GE Rongjun, HE Shengji, LI Nan. Direct PET to CT Attenuation Correction Algorithm Based on Imaging Slice Continuity [J]. Computer Science, 2025, 52(10): 115-122.
[3] XU Hengyu, CHEN Kun, XU Lin, SUN Mingzhai, LU Zhou. SAM-Retina:Arteriovenous Segmentation in Dual-modal Retinal Image Based on SAM [J]. Computer Science, 2025, 52(10): 123-133.
[4] SHENG Xiaomeng, ZHAO Junli, WANG Guodong, WANG Yang. Immediate Generation Algorithm of High-fidelity Head Avatars Based on NeRF [J]. Computer Science, 2025, 52(10): 159-167.
[5] ZHENG Dichen, HE Jikai, LIU Yi, GAO Fan, ZHANG Dengyin. Low Light Image Adaptive Enhancement Algorithm Based on Retinex Theory [J]. Computer Science, 2025, 52(10): 168-175.
[6] RUAN Ning, LI Chun, MA Haoyue, JIA Yi, LI Tao. Review of Quantum-inspired Metaheuristic Algorithms and Its Applications [J]. Computer Science, 2025, 52(10): 190-200.
[7] XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua. Subject Knowledge Evaluation Method for Language Models Based on Multiple ChoiceQuestions [J]. Computer Science, 2025, 52(10): 201-207.
[8] WANG Jian, WANG Jingling, ZHANG Ge, WANG Zhangquan, GUO Shiyuan, YU Guiming. Multimodal Information Extraction Fusion Method Based on Dempster-Shafer Theory [J]. Computer Science, 2025, 52(10): 208-216.
[9] CHEN Yuyan, JIA Jiyuan, CHANG Jingwen, ZUO Kaiwen, XIAO Yanghua. SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models [J]. Computer Science, 2025, 52(10): 217-230.
[10] LI Sihui, CAI Guoyong, JIANG Hang, WEN Yimin. Novel Discrete Diffusion Text Generation Model with Convex Loss Function [J]. Computer Science, 2025, 52(10): 231-238.
[11] ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[12] CHEN Jiahao, DUAN Liguo, CHANG Xuanwei, LI Aiping, CUI Juanjuan, HAO Yuanbin. Text Sentiment Classification Method Based on Large-batch Adversarial Strategy and EnhancedFeature Extraction [J]. Computer Science, 2025, 52(10): 247-257.
[13] WANG Ye, WANG Zhongqing. Text Simplification for Aspect-based Sentiment Analysis Based on Large Language Model [J]. Computer Science, 2025, 52(10): 258-265.
[14] ZHAO Jinshuang, HUANG Degen. Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training [J]. Computer Science, 2025, 52(10): 266-274.
[15] SUN Liangxu, LI Linlin, LIU Guoli. Sub-problem Effectiveness Guided Multi-objective Evolution Algorithm [J]. Computer Science, 2025, 52(10): 296-307.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!