计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 144-150.doi: 10.11896/jsjkx.240800159

• 计算机图形学&多媒体 • 上一篇    下一篇

基于跨尺度融合特征与轨迹提示的目标跟踪方法

温静, 张松松, 李旭峰   

  1. 山西大学计算机与信息技术学院 太原 030006
  • 收稿日期:2024-08-29 修回日期:2024-11-29 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 温静(wjing@sxu.edu.cn)
  • 基金资助:
    山西省回国留学人员科研资助项目(2022-008)

Target Tracking Method Based on Cross Scale Fusion of Features and Trajectory Prompts

WEN Jing, ZHANG Songsong, LI Xufeng   

  1. School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China
  • Received:2024-08-29 Revised:2024-11-29 Online:2025-10-15 Published:2025-10-14
  • About author:WEN Jing,born in 1982,Ph.D,asso-ciated professor,master supervisor,is a member of CCF(No.22721M).Her main research interests include compu-ter vision and machine learning.
  • Supported by:
    Research Project by Shanxi Scholarship Council of China(2022-008).

摘要: 单纯使用Transformer进行目标跟踪的特征提取时,由于没有归纳偏差而无法自适应目标尺度和外观的变化。对此,借助CNN引入多尺度特性,提出了一种基于跨尺度融合特征与轨迹提示的目标跟踪方法(Cross Scale Fusion of Features and Trajectory Prompts Tracker,CSFTP-Tracker)。在构建目标跟踪网络输入时,将模板图像与搜索图像同时输入CNN与ViT网络融合的编码器中,设计了一种多级空间感知金字塔模块(Multi-Level Spatial Awareness Pyramid,MSAP)。首先,对多尺度CNN特征通过自注意力机制增强目标位置信息,然后将该多尺度特征与ViT中的F-embeddings特征相融合,输入ViT编码器。这种融合策略不仅增进了ViT内部补丁之间的信息交互,还使网络能够同时利用CNN的局部特性和Transformer的全局依赖能力。其次,将ViT提取的融合特征与轨迹提示特征输入解码器中,使用自回归学习目标位置。在GOT-10k数据集上的实验结果表明,相较于基线模型,所提出网络的平均重叠率(AO)提升了1.3%,成功率得分在阈值为0.5时(SR0.5)也提高了1.4%。

关键词: Transformer, 目标跟踪, 归纳偏差, 编码器, 轨迹提示

Abstract: When Transformer is used alone for feature extraction in object tracking,the absence of inductive bias makes it difficult to adapt to change in target scale and appearance.To address this,this paper introduces target tracking method based on cross scale fusion of features and trajectory prompts(Cross Scale Fusion of features and Trajectory Prompts Tracker CSFTP-Tracker).In constructing the input for the object tracking network,both the template image and the search image are simultaneously fed into an encoder that fuses CNN and ViT.A key design element is the multi-level spatial-aware pyramid module (Multi-Level Spatial Awareness Pyramid,MSAP).Firstly,the multi-scale CNN features are enhanced with self-attention to strengthen target location information.These multi-scale features are then fused with the F-embeddings features from the ViT and input into the ViT encoder.This fusion strategy not only enhances information interaction between patches within the ViT but also enables the network to leverage both the local features of CNN and the global dependency capabilities of the Transformer.Furthermore,the fused features extracted by the ViT,along with the trajectory prompt features,are fed into the decoder,where autoregressive learning is employed to predict the target's position.Experimental results on the GOT-10k dataset show that,compared to the baseline models,the proposed network improves the average overlap(AO) by 1.3% and increases the success rate score at a 0.5 threshold(SR0.5) by 1.4%.

Key words: Transformer,Object tracking,Inductive bias,Encoder,Trajectory prompt

中图分类号: 

  • TP391
[1]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:A brief review[J].Computational Intelligence and Neuroscience,2018,2018(1):1-13.
[2]BERTINETTO L,VALMADRE J,HENRIQUES J F,et al.Fully-convolutional siamese networks for object tracking[C]//ECCV 2016 Workshops.Springer,2016:850-865.
[3]LI B,YAN J,WU W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2018:8971-8980.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[5]CHEN X,PENG H,WANG D,et al.Seqtrack:Sequence to sequence learning for visual object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2023:14572-14581.
[6]DOSOVITSKIY A.An image is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021.
[7]WANG N,ZHOU W,WANG J,et al.Transformer meets trac-ker:Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:1571-1580.
[8]CHEN X,YAN B,ZHU J,et al.Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:8126-8135.
[9]YU B,TANG M,ZHENG L,et al.High-performance discriminative tracking with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:9856-9865.
[10]YAN B,PENG H,FU J,et al.Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:10448-10457.
[11]ZHENG Y,ZHONG B,LIANG Q,et al.Odtrack:Online dense temporal token learning for visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2024:7588-7596.
[12]WEI X,BAI Y,ZHENG Y,et al.Autoregressive visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,New York:IEEE,2023:9697-9706.
[13]XIA C,WANG X,LYU F,et al.Vit-comer:Vision transformer with convolutional multi-scale feature interaction for dense predictions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2024:5493-5502.
[14]CHEN M M.Research on Object Tracking Algorithm Integra-ting Swin Transformer Multi-scale Features and Pooling Spatial Features[J].Journal of Chongqing Technology and Business University.Natural Science Edition,2025,42(3):110-117.
[15]XU W,WAN Y.ELA:Efficient Local Attention for Deep Conv-olutional Neural Networks[J].arXiv:2403.01123,2024.
[16]ZHU X,SU W,LU L,et al.Deformable DETR:Deformabletransformers for end-to-end object detection[C]//Proceedings of the International Conference on Learning Representations.2021.
[17]CHEN T,SAXENA S,LI L,et al.Pix2seq:A language modeling framework for object detection [C]//Proceedings of the International Conference on Learning Representations.2022.
[18]DE BOER P T,KROESE D P,MANNOR S,et al.A tutorial on the cross-entropy method[J].Annals of Operations Research,2005,134(1):19-67.
[19]GEVORGYAN Z.SIoU loss:More powerful learning for bounding box regression[J].arXiv:2303.15067,2023.
[20]HUANG L,ZHAO X,HUANG K.Got-10k:A large high-diver-sity benchmark for generic object tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(5):1562-1577.
[21]FAN H,LIN L,YANG F,et al.Lasot:A high-quality benchmark for large-scale single object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:5374-5383.
[22]MUELLER M,SMITH N,GHANEM B.A Benchmark andSimulator for UAV Tracking[C]//ECCV 2016 Workshops.Springer,2016:445-461.
[23]LI B,WU W,WANG Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:4282-4291.
[24]BHAY G,DANELLJAN M,GOOL L V,et al.Learning dis-criminative model prediction for tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Cision.New York:IEEE,2019:6182-6191.
[25]ZHANG Z,PENG H,FU J,et al.Ocean:Object-aware anchor-free tracking[C]//Computer Vision ECCV.Berlin:Springer,2020:771-787.
[26]DANELLJAN M,GOOL L V,TIMOFTE R.Probabilistic re-gression for visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:7183-7192.
[27]VOIGTLAENDER P,LUITEN J,TORR P H S,et al.SiamR-CNN:Visual tracking by re-detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:6578-6588.
[28]FU Z,FU Z,LIU Q,et al.SparseTT:Visual tracking withsparse transformers[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence.AAAI,2022:905-912.
[29]ZHANG Z,LIU Y,WANG X,et al.Learn to match:Automatic matching network design for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:13339-13348.
[30]LIN L,FAN H,ZHANG Z,et al.Swintrack:A simple andstrong baseline for transformer tracking[C]//Proceedings of Advances in Neural Information Processing Systems.2022:16743-16754.
[31]CUI Y,JIANG C,WANG L,et al.Mixformer:End-to-end tracking with iterative mixed attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2022:13608-13618.
[32]YE B,CHANG H,MA B,et al.Joint feature learning and relation modeling for tracking:A one-stream framework[C]//European Conference on Computer Vision.Berlin:Springer,2022:341-357.
[33]CAI Y,LIU J,TANG J,et al.Robust object modeling for visualtracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:9589-9600.
[34]XU Y,WANG Z,LI Z,et al.SiamFC++:Towards robust and accurate visual tracking with target estimation guidelines [C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2020:12549-12556.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!