计算机科学 ›› 2024, Vol. 51 ›› Issue (5): 108-116.doi: 10.11896/jsjkx.230300232

• 计算机图形学&多媒体 • 上一篇    下一篇

基于Transformer紧凑编码的局部近重复视频检测算法

王萍, 余圳煌, 鲁磊   

  1. 西安交通大学信息与通信工程学院 西安 710049
  • 收稿日期:2023-03-30 修回日期:2023-10-07 出版日期:2024-05-15 发布日期:2024-05-08
  • 通讯作者: 鲁磊(lu.lei@xjtu.edu.cn)
  • 作者简介:(ping.fu@xjtu.edu.cn)

Partial Near-duplicate Video Detection Algorithm Based on Transformer Low-dimensionalCompact Coding

WANG Ping, YU Zhenhuang, LU Lei   

  1. School of Information and Communication Engineering,Xi'an Jiaotong University,Xi'an 710049,China
  • Received:2023-03-30 Revised:2023-10-07 Online:2024-05-15 Published:2024-05-08
  • About author:WANG Ping,born in 1976,Ph.D,asso-ciate professor.Her main research in-terests include image processing and video analysis.
    LU Lei,born in 1988,Ph.D,lecturer,is a member of CCF(No.J5150M).His main research interests include image processing,deep learning,and signal analysis.

摘要: 针对现有局部近重复视频检测算法特征存储消耗大、整体查询效率低、提取特征时并未考虑近重复帧之间细微的语义差异等问题,文中提出了一种基于Transformer紧凑编码的局部近重复视频检测算法。首先,提出了一个基于Transformer的特征编码器,其学习了大量近重复帧之间细微的语义差异,可以在编码帧特征时对各个区域特征图引入自注意力机制,在有效降低帧特征维度的同时也提高了编码后特征的表示性。该特征编码器通过孪生网络训练得到,该网络不需要负样本就可以有效学习近重复帧之间的相似语义信息,因此无需沉重和困难的难负样本标注工作,使得训练过程更加简易和高效。其次,提出了一个基于视频自相似度矩阵的关键帧提取方法,可以从视频中提取丰富但不冗余的关键帧,从而使关键帧特征序列能够更全面地描述原视频内容,提升算法的性能,同时也大幅减少了存储和计算冗余关键帧带来的开销。最后,基于关键帧的低维紧凑编码特征,采用基于图网络的时间对齐算法,实现局部近重复视频片段的检测和定位。该算法在公开的局部近重复视频检测数据集VCDB上取得了优于现有算法的实验性能。

关键词: 局部近重复视频检测, Transformer, 视频自相似度矩阵, 关键帧提取

Abstract: To address the issues of existing partial near-duplicate video detection algorithms,such as high storage consumption,low query efficiency,and feature extraction module that does not consider subtle semantic differences between near-duplicate frames,this paper proposes a partial near-duplicate video detection algorithm based on Transformer.First,a Transformer-based feature encoder is proposed,which canlearn subtle semantic differences between a large number of near-duplicate frames.The feature maps of frame regions are introduced with self-attention mechanism during frame feature encoding,effectively reducing the dimensionality of the feature while enhancing its representational capacity.The feature encoder is trained using a siamese network,which can effectively learn the semantic similarities between near-duplicate frames without negative samples.This eliminates the need for heavy and difficult negative sample annotation work,making the training process simpler and more efficient.Secondly,a key frame extraction method based on video self-similarity matrix is proposed.This method can extract rich,non-redundant key frames from the video,allowing for a more comprehensive description of the original video content and improved algorithm performance.Additionally,this approach significantly reduces the overhead associated with storing and computing redundant key frames.Finally,a graph network-based temporal alignment algorithm is used to detect and locate partial near-duplicate video clips based on the low-dimensional,compact encoded features of key frames.The proposed algorithm achieves impressive experimental results on the publicly available partial near-duplicate video detection dataset VCDB and outperforms existing algorithms.

Key words: Partial near-duplicate video detection, Transformer, Video self-similarity matrix, Keyframe extraction

中图分类号: 

  • TP391.4
[1]China Internet Network Information Center.The 50th Statistical Report on the Development of the Internet in China [EB/OL].http://www.cnnic.net.cn/NMediaFile/2022/0916/MAIN1663313008837KWI782STQL.pdf.
[2]HE S F,YANG X,JIANG C,et al.A Large-scale Comprehen-sive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:21086-21095.
[3]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Visil:Fine-grained spatio-temporal video similarity lear-ning[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.Piscataway:IEEE Press,2019:6351-6360.
[4]GORDO A,ALMAZAN J,REVAUD J,et al.End-to-end lear-ning of deep visual representations for image retrieval[J].International Journal of Computer Vision,2017,124(2):237-254.
[5]EL-NOUBY A,NEVEROVA N,LAPTEV I,et al.Training vision transformers for image retrieval[J].arXiv:2102.05644,2021.
[6]CARON M,TOUVRON H,MISRA I,et al.Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2021:9650-9660.
[7]WANG K H,CHENG C C,CHEN Y L,et al.Attention-based deep metric learning for near-duplicate video retrieval[C]//Proceedings of the IEEE/CVF International Conference on Pattern Recognition.Piscataway:IEEE Press,2021:5360-5367.
[8]HAN Z,HE X T,TANG M Q,et al.Video similarity and align-ment learning on partial video copy detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:4165-4173.
[9]JIANG C,HUANG K M,HE S F,et al.Learning segment similarity and alignment in large-scale content based video retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:1618-1626.
[10]DOUZE M,JEGOU H,SCHMID C.An image-based approach to video copy detection with spatio-temporal post-filtering[J].IEEE Transactions on Multimedia,2010:12(4):257-266.
[11]JIANG Y G,JIANG Y D,WANG J J.Vcdb:A large-scale database for partial copy detection in videos[C]//Proceedings of the European Conference on Computer Vision.Berlin:Springer,2014:357-371.
[12]TAN H K,NGO C W,HONG R,et al.Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2009:145-154.
[13]POULLOT S,TSUKATANI S,NGUYEN A P,et al.Temporal matching kernel with explicit feature maps[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2015:381-390.
[14]BARALDI L,DOUZE M,CUCCHIARA R,et al.Lamv:Lear-ning to align and match videos with kernelized temporal layers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:7804-7813.
[15]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the International Conference on Learning Representations.2015:1-14.
[16]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Near-duplicate video retrieval with deep metric learning[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.Piscataway:IEEE Press,2017:347-356.
[17]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021:1-21.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the International Conference on Neural Information Processing System.Cambridge:MIT Press,2017:5998-6008.
[19]CHEN X L,HE K M.Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:15750-15758.
[20]GRILL J B,STRUB F,ALTCHÉ F,et al.Bootstrap your own latent:A new approach to self-supervised learning[C]//Proceedings of the International Conferenceon Neural Information Processing System.Cambridge:MIT Press,2020:21271-21284.
[21]JIANG Y G,WANG J J.Partial video copy detection in videos:A benchmark and an evaluation of popular methods[J].IEEE Transactions on Big Data,2016,2(1):32-42.
[22]HUANG X,WANG X,LV W,et al.PPYOLOv2:A Practical Object Detector[J].arXiv:2104.10419,2021.
[23]DOUZE M,REVAUD J,VERBEEK J,et al.Circulant temporal encoding for video retrieval and temporal alignment[J].International Journal of Computer Vision,2015,119:291-306.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!