计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 108-116.doi: 10.11896/jsjkx.230300232
王萍, 余圳煌, 鲁磊
WANG Ping, YU Zhenhuang, LU Lei
摘要: 针对现有局部近重复视频检测算法特征存储消耗大、整体查询效率低、提取特征时并未考虑近重复帧之间细微的语义差异等问题,文中提出了一种基于Transformer紧凑编码的局部近重复视频检测算法。首先,提出了一个基于Transformer的特征编码器,其学习了大量近重复帧之间细微的语义差异,可以在编码帧特征时对各个区域特征图引入自注意力机制,在有效降低帧特征维度的同时也提高了编码后特征的表示性。该特征编码器通过孪生网络训练得到,该网络不需要负样本就可以有效学习近重复帧之间的相似语义信息,因此无需沉重和困难的难负样本标注工作,使得训练过程更加简易和高效。其次,提出了一个基于视频自相似度矩阵的关键帧提取方法,可以从视频中提取丰富但不冗余的关键帧,从而使关键帧特征序列能够更全面地描述原视频内容,提升算法的性能,同时也大幅减少了存储和计算冗余关键帧带来的开销。最后,基于关键帧的低维紧凑编码特征,采用基于图网络的时间对齐算法,实现局部近重复视频片段的检测和定位。该算法在公开的局部近重复视频检测数据集VCDB上取得了优于现有算法的实验性能。
中图分类号:
[1]China Internet Network Information Center.The 50th Statistical Report on the Development of the Internet in China [EB/OL].http://www.cnnic.net.cn/NMediaFile/2022/0916/MAIN1663313008837KWI782STQL.pdf. [2]HE S F,YANG X,JIANG C,et al.A Large-scale Comprehen-sive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:21086-21095. [3]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Visil:Fine-grained spatio-temporal video similarity lear-ning[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.Piscataway:IEEE Press,2019:6351-6360. [4]GORDO A,ALMAZAN J,REVAUD J,et al.End-to-end lear-ning of deep visual representations for image retrieval[J].International Journal of Computer Vision,2017,124(2):237-254. [5]EL-NOUBY A,NEVEROVA N,LAPTEV I,et al.Training vision transformers for image retrieval[J].arXiv:2102.05644,2021. [6]CARON M,TOUVRON H,MISRA I,et al.Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2021:9650-9660. [7]WANG K H,CHENG C C,CHEN Y L,et al.Attention-based deep metric learning for near-duplicate video retrieval[C]//Proceedings of the IEEE/CVF International Conference on Pattern Recognition.Piscataway:IEEE Press,2021:5360-5367. [8]HAN Z,HE X T,TANG M Q,et al.Video similarity and align-ment learning on partial video copy detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:4165-4173. [9]JIANG C,HUANG K M,HE S F,et al.Learning segment similarity and alignment in large-scale content based video retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:1618-1626. [10]DOUZE M,JEGOU H,SCHMID C.An image-based approach to video copy detection with spatio-temporal post-filtering[J].IEEE Transactions on Multimedia,2010:12(4):257-266. [11]JIANG Y G,JIANG Y D,WANG J J.Vcdb:A large-scale database for partial copy detection in videos[C]//Proceedings of the European Conference on Computer Vision.Berlin:Springer,2014:357-371. [12]TAN H K,NGO C W,HONG R,et al.Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2009:145-154. [13]POULLOT S,TSUKATANI S,NGUYEN A P,et al.Temporal matching kernel with explicit feature maps[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2015:381-390. [14]BARALDI L,DOUZE M,CUCCHIARA R,et al.Lamv:Lear-ning to align and match videos with kernelized temporal layers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:7804-7813. [15]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the International Conference on Learning Representations.2015:1-14. [16]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Near-duplicate video retrieval with deep metric learning[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.Piscataway:IEEE Press,2017:347-356. [17]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021:1-21. [18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the International Conference on Neural Information Processing System.Cambridge:MIT Press,2017:5998-6008. [19]CHEN X L,HE K M.Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:15750-15758. [20]GRILL J B,STRUB F,ALTCHÉ F,et al.Bootstrap your own latent:A new approach to self-supervised learning[C]//Proceedings of the International Conferenceon Neural Information Processing System.Cambridge:MIT Press,2020:21271-21284. [21]JIANG Y G,WANG J J.Partial video copy detection in videos:A benchmark and an evaluation of popular methods[J].IEEE Transactions on Big Data,2016,2(1):32-42. [22]HUANG X,WANG X,LV W,et al.PPYOLOv2:A Practical Object Detector[J].arXiv:2104.10419,2021. [23]DOUZE M,REVAUD J,VERBEEK J,et al.Circulant temporal encoding for video retrieval and temporal alignment[J].International Journal of Computer Vision,2015,119:291-306. |
|