计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230600186-6.doi: 10.11896/jsjkx.230600186
郑申海1,2, 高茜1, 刘鹏威1, 李伟生1,2
ZHENG Shenhai1,2, GAO Xi1, LIU Pengwei1, LI Weisheng1,2
摘要: 视频实例分割是近年来兴起的一项在图像实例分割基础上引入时序特性的视觉任务,旨在同时对每一帧的目标进行分割并实现帧间的目标跟踪。移动互联网和人工智能的迅猛发展产生了大量的视频数据,但由于拍摄角度、快速运动和部分遮挡等,视频中的物体往往会出现分裂或模糊的情况,使得从视频数据中准确地分割目标并对目标进行处理和分析面临着重大挑战。经查阅和实践发现,现有的视频实例分割方法在遮挡情况下的表现较差。针对上述问题,提出了一种改进的遮挡视频实例分割算法——通过融合Transformer和跟踪检测的时序特征来改善分割性能。为增强网络对空间位置信息的学习能力,该算法将时间维度引入Transformer网络中,并考虑到视频中目标检测、跟踪和分割之间的相互依赖和促进关系,提出了一种能够有效地聚合目标在视频中的跟踪偏移的融合跟踪模块和检测时序特征模块,提升了遮挡环境下的目标分割性能。通过在OVIS和YouTube-VIS数据集上进行的实验,验证了所提方法的有效性。相比当前的基准方法,该方法展现出了更好的分割精度,进一步证明了其优越性。
中图分类号:
[1]QI J Y,GAO Y,HU Y,et al.Occluded video instance segmentation:A benchmark[J].International Journal of Computer Vision,2022,130(8):2022-2039. [2]YANG L J,FAN Y C,XU N.Video instance segmentation[C]//International Conference on Computer Vision.2019:5188-5197. [3]HE K M,GKIOXARI G,DOLLAR P,et al.Mask R-CNN[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:2961-2969. [4]BERTASIUS G,TORRESANI L.Classifying,segmenting,andtracking object instances in video with mask propagation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2020:9739-9748. [5]DAI J F,QI H Z,XIONG Y W,et al.Deformable convolutional networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:764-773. [6]ATHAR A,MAHADEVAN S,OSEP A,et al.Stem-seg:Spatio-temporal embeddings for instance segmentation in videos[C]//European Conference on Computer Vision.2020:158-177. [7]FU Y,YANG L J,LIU D,et al.Complete:Comprehensive feature aggregation for video instance segmentation[C]//Confe-rence on Artificial Intelligence.2021,35(2):1361-1369. [8]WANG Y Q,XU Z L,WANG X L,et al.End-to-end video instance segmentation with transformers[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021:8741-8750. [9]PARMAR N,VASWANI A,USZKOREIT J,et al.Image transformer[C]//International Conference on Machine Learning.2018:4055-4064. [10]ZHU X Z,SU W J,LU L W,et al.Deformable DETR:Defor-mable transformers for end-to-end object detection[J].arXiv:2010.04159,2021. [11]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16×16 words:Transformers for image recognition at scale[J].arXiv:2010.11929,2021. [12]LI Z S,LIU X T,DRENKOW N,et al.Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]//IEEE International Conference on Computer Vision.2021:6197-6206. [13]LIU R J,YUAN Z J,LIU T,et al.End-to-end lane shape prediction with transformers[C]//IEEE Winter Conference on Applications of Computer Vision.2021:3694-3702. [14]LIU CHANG,YUAN W J,WEI Z Q,et al.Location-aware predictive beamforming for UAV communications:A deep learning approach[J].IEEE Wireless Communications Letters,2020,10(3):668-672. [15]ZHAO H S,JIA J Y,KOLTUN V.Exploring self-attention for image recognition[C]//IEEE International Conference on Computer Vision.2020:10076-10085. [16]WANG H Y,ZHU Y K,GREEN B,et al.Axial-deeplab:Stand-alone axial-attention for panoptic segmentation[C]//European Conference on Computer Vision.2020:108-126. [17]QI J Y,GAO Y,HU Y,et al.Occluded video instance segmentation:A benchmark[J].International Journal of Computer Vision,2022,130(8):2022-2039. [18]HE K M,ZHANG X Y,REN S Q,et al.Deep residual learning for image recognition[C]//IEEE International Conference on Computer Vision.2016:770-778. [19]LOSHCHILOV I,HUTTER F.Fixing weight decay regularization in adam[C]//International Conference on Learning Representations.2018. [20]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft COCO:Common objects in context[C]//European Conference on Computer Vision.2014:740-755. [21]YANG S S,FANG Y X,WANG X G,et al.Crossover learning for fast online video instance segmentation[C]//IEEE International Conference on Computer Vision.2021:8043-8052. [22]VOIGTLAENDER P,CHAI Y,SCHROFF F,et al.Feelvos:Fast end-to-end embedding learning for video object segmentation[C]//IEEE Conference on Computer Vision and Pattern Recognition.2019:9481-9490. [23]BOCHINSKI E,EISELEIN V,SIKORA T.High-speed tra-cking-by-detection without using image information[C]//IEEE International Conference on Advanced Video and Signal Based Surveillance.2017:1-6. [24]WU J L,CAO J L,SONG L C,et al.Track to detect and segment:An online multi-object tracker[C]//IEEE Conference on Computer Vision and Pattern Recognition.2021:12352-12361. [25]CAO J L,ANWER R M,CHOLAKKAL H,et al.Sipmask:Spatial information preservation for fast image and video instance segmentation[C]//European Conference on Computer Vision.2020:1-18. |
|