计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 144-150.doi: 10.11896/jsjkx.240800159
温静, 张松松, 李旭峰
WEN Jing, ZHANG Songsong, LI Xufeng
摘要: 单纯使用Transformer进行目标跟踪的特征提取时,由于没有归纳偏差而无法自适应目标尺度和外观的变化。对此,借助CNN引入多尺度特性,提出了一种基于跨尺度融合特征与轨迹提示的目标跟踪方法(Cross Scale Fusion of Features and Trajectory Prompts Tracker,CSFTP-Tracker)。在构建目标跟踪网络输入时,将模板图像与搜索图像同时输入CNN与ViT网络融合的编码器中,设计了一种多级空间感知金字塔模块(Multi-Level Spatial Awareness Pyramid,MSAP)。首先,对多尺度CNN特征通过自注意力机制增强目标位置信息,然后将该多尺度特征与ViT中的F-embeddings特征相融合,输入ViT编码器。这种融合策略不仅增进了ViT内部补丁之间的信息交互,还使网络能够同时利用CNN的局部特性和Transformer的全局依赖能力。其次,将ViT提取的融合特征与轨迹提示特征输入解码器中,使用自回归学习目标位置。在GOT-10k数据集上的实验结果表明,相较于基线模型,所提出网络的平均重叠率(AO)提升了1.3%,成功率得分在阈值为0.5时(SR0.5)也提高了1.4%。
中图分类号:
[1]VOULODIMOS A,DOULAMIS N,DOULAMIS A,et al.Deep learning for computer vision:A brief review[J].Computational Intelligence and Neuroscience,2018,2018(1):1-13. [2]BERTINETTO L,VALMADRE J,HENRIQUES J F,et al.Fully-convolutional siamese networks for object tracking[C]//ECCV 2016 Workshops.Springer,2016:850-865. [3]LI B,YAN J,WU W,et al.High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2018:8971-8980. [4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [5]CHEN X,PENG H,WANG D,et al.Seqtrack:Sequence to sequence learning for visual object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2023:14572-14581. [6]DOSOVITSKIY A.An image is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021. [7]WANG N,ZHOU W,WANG J,et al.Transformer meets trac-ker:Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:1571-1580. [8]CHEN X,YAN B,ZHU J,et al.Transformer tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2021:8126-8135. [9]YU B,TANG M,ZHENG L,et al.High-performance discriminative tracking with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:9856-9865. [10]YAN B,PENG H,FU J,et al.Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:10448-10457. [11]ZHENG Y,ZHONG B,LIANG Q,et al.Odtrack:Online dense temporal token learning for visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2024:7588-7596. [12]WEI X,BAI Y,ZHENG Y,et al.Autoregressive visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,New York:IEEE,2023:9697-9706. [13]XIA C,WANG X,LYU F,et al.Vit-comer:Vision transformer with convolutional multi-scale feature interaction for dense predictions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2024:5493-5502. [14]CHEN M M.Research on Object Tracking Algorithm Integra-ting Swin Transformer Multi-scale Features and Pooling Spatial Features[J].Journal of Chongqing Technology and Business University.Natural Science Edition,2025,42(3):110-117. [15]XU W,WAN Y.ELA:Efficient Local Attention for Deep Conv-olutional Neural Networks[J].arXiv:2403.01123,2024. [16]ZHU X,SU W,LU L,et al.Deformable DETR:Deformabletransformers for end-to-end object detection[C]//Proceedings of the International Conference on Learning Representations.2021. [17]CHEN T,SAXENA S,LI L,et al.Pix2seq:A language modeling framework for object detection [C]//Proceedings of the International Conference on Learning Representations.2022. [18]DE BOER P T,KROESE D P,MANNOR S,et al.A tutorial on the cross-entropy method[J].Annals of Operations Research,2005,134(1):19-67. [19]GEVORGYAN Z.SIoU loss:More powerful learning for bounding box regression[J].arXiv:2303.15067,2023. [20]HUANG L,ZHAO X,HUANG K.Got-10k:A large high-diver-sity benchmark for generic object tracking in the wild[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(5):1562-1577. [21]FAN H,LIN L,YANG F,et al.Lasot:A high-quality benchmark for large-scale single object tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:5374-5383. [22]MUELLER M,SMITH N,GHANEM B.A Benchmark andSimulator for UAV Tracking[C]//ECCV 2016 Workshops.Springer,2016:445-461. [23]LI B,WU W,WANG Q,et al.Siamrpn++:Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2019:4282-4291. [24]BHAY G,DANELLJAN M,GOOL L V,et al.Learning dis-criminative model prediction for tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Cision.New York:IEEE,2019:6182-6191. [25]ZHANG Z,PENG H,FU J,et al.Ocean:Object-aware anchor-free tracking[C]//Computer Vision ECCV.Berlin:Springer,2020:771-787. [26]DANELLJAN M,GOOL L V,TIMOFTE R.Probabilistic re-gression for visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:7183-7192. [27]VOIGTLAENDER P,LUITEN J,TORR P H S,et al.SiamR-CNN:Visual tracking by re-detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2020:6578-6588. [28]FU Z,FU Z,LIU Q,et al.SparseTT:Visual tracking withsparse transformers[C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence.AAAI,2022:905-912. [29]ZHANG Z,LIU Y,WANG X,et al.Learn to match:Automatic matching network design for visual tracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2021:13339-13348. [30]LIN L,FAN H,ZHANG Z,et al.Swintrack:A simple andstrong baseline for transformer tracking[C]//Proceedings of Advances in Neural Information Processing Systems.2022:16743-16754. [31]CUI Y,JIANG C,WANG L,et al.Mixformer:End-to-end tracking with iterative mixed attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE,2022:13608-13618. [32]YE B,CHANG H,MA B,et al.Joint feature learning and relation modeling for tracking:A one-stream framework[C]//European Conference on Computer Vision.Berlin:Springer,2022:341-357. [33]CAI Y,LIU J,TANG J,et al.Robust object modeling for visualtracking[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.New York:IEEE,2023:9589-9600. [34]XU Y,WANG Z,LI Z,et al.SiamFC++:Towards robust and accurate visual tracking with target estimation guidelines [C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2020:12549-12556. |
|