计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600186-6.doi: 10.11896/jsjkx.230600186

• 图像处理&多媒体技术 • 上一篇    下一篇

基于跟踪检测时序特征融合的视频遮挡目标分割方法

郑申海1,2, 高茜1, 刘鹏威1, 李伟生1,2   

  1. 1 重庆邮电大学计算机科学与技术学院 重庆 400065
    2 图像认知重庆市重点实验室(重庆邮电大学) 重庆 400065
  • 发布日期:2024-06-06
  • 通讯作者: 郑申海(zhengsh@cqupt.edu.cn)
  • 基金资助:
    国家自然科学基金(61902046);重庆市教委科学技术研究计划重点项目(KJZD-K202200606);重庆市自然科学基金(2022NSCQ-MSX3746)

Occluded Video Instance Segmentation Method Based on Feature Fusion of Tracking and Detection in Time Sequence

ZHENG Shenhai1,2, GAO Xi1, LIU Pengwei1, LI Weisheng1,2   

  1. 1 College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 Chongqing Key Laboratory of Image Cognition(Chongqing University of Posts and Telecommunications),Chongqing 400065,China
  • Published:2024-06-06
  • About author:ZHENG Shenhai,born in 1988,Ph.D,associate professor.His main research interests include machine learning,pattern recognition and medical image computing.
  • Supported by:
    National Natural Science Foundation of China(61902046),Science and Technology Research Program of Chongqing Municipal Education Commission(KJZD-K202200606) and Natural Science Foundation of Chongqing,China(2022NSCQ-MSX3746).

摘要: 视频实例分割是近年来兴起的一项在图像实例分割基础上引入时序特性的视觉任务,旨在同时对每一帧的目标进行分割并实现帧间的目标跟踪。移动互联网和人工智能的迅猛发展产生了大量的视频数据,但由于拍摄角度、快速运动和部分遮挡等,视频中的物体往往会出现分裂或模糊的情况,使得从视频数据中准确地分割目标并对目标进行处理和分析面临着重大挑战。经查阅和实践发现,现有的视频实例分割方法在遮挡情况下的表现较差。针对上述问题,提出了一种改进的遮挡视频实例分割算法——通过融合Transformer和跟踪检测的时序特征来改善分割性能。为增强网络对空间位置信息的学习能力,该算法将时间维度引入Transformer网络中,并考虑到视频中目标检测、跟踪和分割之间的相互依赖和促进关系,提出了一种能够有效地聚合目标在视频中的跟踪偏移的融合跟踪模块和检测时序特征模块,提升了遮挡环境下的目标分割性能。通过在OVIS和YouTube-VIS数据集上进行的实验,验证了所提方法的有效性。相比当前的基准方法,该方法展现出了更好的分割精度,进一步证明了其优越性。

关键词: 视频实例分割, 目标检测, 目标跟踪, 时序特征, 遮挡目标

Abstract: Video instance segmentation is a visual task that has emerged in recent years,which introduces temporal characteristics on the basis of image instance segmentation.It aims to simultaneously segment objects in each frame and achieve inter frame object tracking.A large amount of video data has been generated with the rapid development of mobile Internet and artificial intelligence.However,due to shooting angles,rapid motion,and partial occlusion,objects in videos often split or blur,posing significant challenges in accurately segmenting targets from video data and processing and analyzing them.After consulting and practicing,it is found that existing video instance segmentation methods perform poorly in occluded situations.In response to the above issues,this paper proposes an improved occlusion video instance segmentation algorithm,which improves segmentation performance by integrating the temporal features of Transformer and tracking detection.To enhance the learning ability of the network for spatial position information,this algorithm introduces the time dimension into the Transformer network and considers the interdepen-dence and promotion relationship between object detection,tracking,and segmentation in videos.A fusion tracking module and a detection temporal feature module that can effectively aggregate the tracking offset of objects in videos are proposed,improving the performance of object segmentation in occluded environments.The effectiveness of the proposed method is verified through experiments on the OVIS and YouTube VIS datasets.Compared to the current benchmark method,the proposed method exhibits better segmentation accuracy,further demonstrating its superiority.

Key words: Video instance segmentation, Object detection, Object tracking, Feature in time sequence, Occluded instance

中图分类号: 

  • TP391
[1]QI J Y,GAO Y,HU Y,et al.Occluded video instance segmentation:A benchmark[J].International Journal of Computer Vision,2022,130(8):2022-2039.
[2]YANG L J,FAN Y C,XU N.Video instance segmentation[C]//International Conference on Computer Vision.2019:5188-5197.
[3]HE K M,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:2961-2969.
[4]BERTASIUS G,TORRESANI L.Classifying,segmenting,andtracking object instances in video with mask propagation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2020:9739-9748.
[5]DAI J F,QI H Z,XIONG Y W,et al.Deformable convolutional networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:764-773.
[6]ATHAR A,MAHADEVAN S,OSEP A,et al.Stem-seg:Spatio-temporal embeddings for instance segmentation in videos[C]//European Conference on Computer Vision.2020:158-177.
[7]FU Y,YANG L J,LIU D,et al.Complete:Comprehensive feature aggregation for video instance segmentation[C]//Confe-rence on Artificial Intelligence.2021,35(2):1361-1369.
[8]WANG Y Q,XU Z L,WANG X L,et al.End-to-end video instance segmentation with transformers[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021:8741-8750.
[9]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//International Conference on Machine Learning.2018:4055-4064.
[10]ZHU X Z,SU W J,LU L W,et al.Deformable DETR:Defor-mable transformers for end-to-end object detection[J].arXiv:2010.04159,2021.
[11]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2021.
[12]LI Z S,LIU X T,DRENKOW N,et al.Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]//IEEE International Conference on Computer Vision.2021:6197-6206.
[13]LIU R J,YUAN Z J,LIU T,et al.End-to-end lane shape prediction with transformers[C]//IEEE Winter Conference on Applications of Computer Vision.2021:3694-3702.
[14]LIU CHANG,YUAN W J,WEI Z Q,et al.Location-aware predictive beamforming for UAV communications:A deep learning approach[J].IEEE Wireless Communications Letters,2020,10(3):668-672.
[15]ZHAO H S,JIA J Y,KOLTUN V.Exploring self-attention for image recognition[C]//IEEE International Conference on Computer Vision.2020:10076-10085.
[16]WANG H Y,ZHU Y K,GREEN B,et al.Axial-deeplab:Stand-alone axial-attention for panoptic segmentation[C]//European Conference on Computer Vision.2020:108-126.
[17]QI J Y,GAO Y,HU Y,et al.Occluded video instance segmentation:A benchmark[J].International Journal of Computer Vision,2022,130(8):2022-2039.
[18]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//IEEE International Conference on Computer Vision.2016:770-778.
[19]LOSHCHILOV I,HUTTER F.Fixing weight decay regularization in adam[C]//International Conference on Learning Representations.2018.
[20]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision.2014:740-755.
[21]YANG S S,FANG Y X,WANG X G,et al.Crossover learning for fast online video instance segmentation[C]//IEEE International Conference on Computer Vision.2021:8043-8052.
[22]VOIGTLAENDER P,CHAI Y,SCHROFF F,et al.Feelvos:Fast end-to-end embedding learning for video object segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:9481-9490.
[23]BOCHINSKI E,EISELEIN V,SIKORA T.High-speed tra-cking-by-detection without using image information[C]//IEEE International Conference on Advanced Video and Signal Based Surveillance.2017:1-6.
[24]WU J L,CAO J L,SONG L C,et al.Track to detect and segment:An online multi-object tracker[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021:12352-12361.
[25]CAO J L,ANWER R M,CHOLAKKAL H,et al.Sipmask:Spatial information preservation for fast image and video instance segmentation[C]//European Conference on Computer Vision.2020:1-18.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!