基于Transformer紧凑编码的局部近重复视频检测算法

doi:10.11896/jsjkx.230300232

Abstract

Abstract: To address the issues of existing partial near-duplicate video detection algorithms,such as high storage consumption,low query efficiency,and feature extraction module that does not consider subtle semantic differences between near-duplicate frames,this paper proposes a partial near-duplicate video detection algorithm based on Transformer.First,a Transformer-based feature encoder is proposed,which canlearn subtle semantic differences between a large number of near-duplicate frames.The feature maps of frame regions are introduced with self-attention mechanism during frame feature encoding,effectively reducing the dimensionality of the feature while enhancing its representational capacity.The feature encoder is trained using a siamese network,which can effectively learn the semantic similarities between near-duplicate frames without negative samples.This eliminates the need for heavy and difficult negative sample annotation work,making the training process simpler and more efficient.Secondly,a key frame extraction method based on video self-similarity matrix is proposed.This method can extract rich,non-redundant key frames from the video,allowing for a more comprehensive description of the original video content and improved algorithm performance.Additionally,this approach significantly reduces the overhead associated with storing and computing redundant key frames.Finally,a graph network-based temporal alignment algorithm is used to detect and locate partial near-duplicate video clips based on the low-dimensional,compact encoded features of key frames.The proposed algorithm achieves impressive experimental results on the publicly available partial near-duplicate video detection dataset VCDB and outperforms existing algorithms.

Key words: Partial near-duplicate video detection, Transformer, Video self-similarity matrix, Keyframe extraction

CLC Number:

TP391.4

WANG Ping, YU Zhenhuang, LU Lei. Partial Near-duplicate Video Detection Algorithm Based on Transformer Low-dimensionalCompact Coding[J].Computer Science, 2024, 51(5): 108-116.

References

[1]China Internet Network Information Center.The 50th Statistical Report on the Development of the Internet in China [EB/OL].http://www.cnnic.net.cn/NMediaFile/2022/0916/MAIN1663313008837KWI782STQL.pdf.
[2]HE S F,YANG X,JIANG C,et al.A Large-scale Comprehen-sive Dataset and Copy-overlap Aware Evaluation Protocol for Segment-level Video Copy Detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2022:21086-21095.
[3]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Visil:Fine-grained spatio-temporal video similarity lear-ning[C]//Proceedings of the IEEE/CVF International Confe-rence on Computer Vision.Piscataway:IEEE Press,2019:6351-6360.
[4]GORDO A,ALMAZAN J,REVAUD J,et al.End-to-end lear-ning of deep visual representations for image retrieval[J].International Journal of Computer Vision,2017,124(2):237-254.
[5]EL-NOUBY A,NEVEROVA N,LAPTEV I,et al.Training vision transformers for image retrieval[J].arXiv:2102.05644,2021.
[6]CARON M,TOUVRON H,MISRA I,et al.Emerging properties in self-supervised vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.Piscataway:IEEE Press,2021:9650-9660.
[7]WANG K H,CHENG C C,CHEN Y L,et al.Attention-based deep metric learning for near-duplicate video retrieval[C]//Proceedings of the IEEE/CVF International Conference on Pattern Recognition.Piscataway:IEEE Press,2021:5360-5367.
[8]HAN Z,HE X T,TANG M Q,et al.Video similarity and align-ment learning on partial video copy detection[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:4165-4173.
[9]JIANG C,HUANG K M,HE S F,et al.Learning segment similarity and alignment in large-scale content based video retrieval[C]//Proceedings of the 29th ACM International Conference on Multimedia.New York:ACM,2021:1618-1626.
[10]DOUZE M,JEGOU H,SCHMID C.An image-based approach to video copy detection with spatio-temporal post-filtering[J].IEEE Transactions on Multimedia,2010:12(4):257-266.
[11]JIANG Y G,JIANG Y D,WANG J J.Vcdb:A large-scale database for partial copy detection in videos[C]//Proceedings of the European Conference on Computer Vision.Berlin:Springer,2014:357-371.
[12]TAN H K,NGO C W,HONG R,et al.Scalable detection of partial near-duplicate videos by visual-temporal consistency[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2009:145-154.
[13]POULLOT S,TSUKATANI S,NGUYEN A P,et al.Temporal matching kernel with explicit feature maps[C]//Proceedings of the ACM International Conference on Multimedia.New York:ACM,2015:381-390.
[14]BARALDI L,DOUZE M,CUCCHIARA R,et al.Lamv:Lear-ning to align and match videos with kernelized temporal layers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2018:7804-7813.
[15]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the International Conference on Learning Representations.2015:1-14.
[16]KORDOPATIS-ZILOS G,PAPADOPOULOS S,PATRAS I,et al.Near-duplicate video retrieval with deep metric learning[C]//Proceedings of the IEEE International Conference on Computer Vision Workshops.Piscataway:IEEE Press,2017:347-356.
[17]DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.Animage is worth 16x16 words:Transformers for image recognition at scale[C]//Proceedings of the International Conference on Learning Representations.2021:1-21.
[18]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Proceedings of the International Conference on Neural Information Processing System.Cambridge:MIT Press,2017:5998-6008.
[19]CHEN X L,HE K M.Exploring simple siamese representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:15750-15758.
[20]GRILL J B,STRUB F,ALTCHÉ F,et al.Bootstrap your own latent:A new approach to self-supervised learning[C]//Proceedings of the International Conferenceon Neural Information Processing System.Cambridge:MIT Press,2020:21271-21284.
[21]JIANG Y G,WANG J J.Partial video copy detection in videos:A benchmark and an evaluation of popular methods[J].IEEE Transactions on Big Data,2016,2(1):32-42.
[22]HUANG X,WANG X,LV W,et al.PPYOLOv2:A Practical Object Detector[J].arXiv:2104.10419,2021.
[23]DOUZE M,REVAUD J,VERBEEK J,et al.Circulant temporal encoding for video retrieval and temporal alignment[J].International Journal of Computer Vision,2015,119:291-306.

Related Articles 15

[1]	ZHANG Jianliang, LI Yang, ZHU Qingshan, XUE Hongling, MA Junwei, ZHANG Lixia, BI Sheng. Substation Equipment Malfunction Alarm Algorithm Based on Dual-domain Sparse Transformer [J]. Computer Science, 2024, 51(5): 62-69.
[2]	ZHOU Yu, CHEN Zhihua, SHENG Bin, LIANG Lei. Multi Scale Progressive Transformer for Image Dehazing [J]. Computer Science, 2024, 51(5): 117-124.
[3]	XI Ying, WU Xuemeng, CUI Xiaohui. Node Influence Ranking Model Based on Transformer [J]. Computer Science, 2024, 51(4): 106-116.
[4]	WANG Wenjie, YANG Yan, JING Lili, WANG Jie, LIU Yan. LNG-Transformer:An Image Classification Network Based on Multi-scale Information Interaction [J]. Computer Science, 2024, 51(2): 189-195.
[5]	ZHANG Feng, HUANG Shixin, HUA Qiang, DONG Chunru. Novel Image Classification Model Based on Depth-wise Convolution Neural Network andVisual Transformer [J]. Computer Science, 2024, 51(2): 196-204.
[6]	HUANG Hanqiang, XING Yunbing, SHEN Jianfei, FAN Feiyi. Sign Language Animation Splicing Model Based on LpTransformer Network [J]. Computer Science, 2023, 50(9): 184-191.
[7]	TENG Sihang, WANG Lie, LI Ya. Non-autoregressive Transformer Chinese Speech Recognition Incorporating Pronunciation- Character Representation Conversion [J]. Computer Science, 2023, 50(8): 111-117.
[8]	ZHU Yuying, GUO Yan, WAN Yizhao, TIAN Kai. New Word Detection Based on Branch Entropy-Segmentation Probability Model [J]. Computer Science, 2023, 50(7): 221-228.
[9]	BAI Zhengyao, FAN Shenglan, LU Qianjie, ZHOU Xue. COVID-19 Instance Segmentation and Classification Network Based on CT Image Semantics [J]. Computer Science, 2023, 50(6A): 220600142-9.
[10]	YANG Jingyi, LI Fang, KANG Xiaodong, WANG Xiaotian, LIU Hanqing, HAN Junling. Ultrasonic Image Segmentation Based on SegFormer [J]. Computer Science, 2023, 50(6A): 220400273-6.
[11]	YANG Xiaoyu, LI Chao, CHEN Shunyao, LI Haoliang, YIN Guangqiang. Text-Image Cross-modal Retrieval Based on Transformer [J]. Computer Science, 2023, 50(4): 141-148.
[12]	LIANG Weiliang, LI Yue, WANG Pengfei. Lightweight Face Generation Method Based on TransEditor and Its Application Specification [J]. Computer Science, 2023, 50(2): 221-230.
[13]	CAO Jinjuan, QIAN Zhong, LI Peifeng. End-to-End Event Factuality Identification with Joint Model [J]. Computer Science, 2023, 50(2): 292-299.
[14]	DUAN Mengmeng, JIN Cheng. Transformer Feature Fusion Network for Time Series Classification [J]. Computer Science, 2023, 50(12): 97-103.
[15]	CHEN Luoxuan, LIN Chengchuang, ZHENG Zhaoliang, MO Zefeng, HUANG Xinyi, ZHAO Gansen. Review of Transformer in Computer Vision [J]. Computer Science, 2023, 50(12): 130-147.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Partial Near-duplicate Video Detection Algorithm Based on Transformer Low-dimensionalCompact Coding

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0