计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 251-258.doi: 10.11896/jsjkx.240900127

• 计算机图形学&多媒体 • 上一篇    下一篇

基于两阶段时空对齐的小样本视频行为识别

王佳, 夏英, 丰江帆   

  1. 重庆邮电大学计算机科学与技术学院 重庆 400065
    旅游多源数据感知与决策技术文化和旅游部重点实验室 重庆 400065
  • 收稿日期:2024-09-23 修回日期:2024-11-23 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 夏英(xiaying@cqupteducn)
  • 作者简介:(S220201092@stu.cqupt.edu.cn)
  • 基金资助:
    国家自然科学基金(41971365);重庆市教委重点合作项目(HZ2021008);文化和旅游部重点实验室资助项目(E020H2023005)

Few-shot Video Action Recognition Based on Two-stage Spatio-Temporal Alignment

WANG Jia, XIA Ying, FENG Jiangfan   

  1. College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    Key Laboratory of Tourism Multisource Data Perception and Decision Technology,Ministry of Culture and Tourism,Chongqing 400065,China
  • Received:2024-09-23 Revised:2024-11-23 Online:2025-08-15 Published:2025-08-08
  • About author:WANG Jia,born in 1996,postgraduate.His main research interests include deep learning and video action recognition.
    XIA Ying,born in 1972,professor,Ph.D supervisor.Her main research interests include spatio-temporal big data and cross-media retrieval.
  • Supported by:
    National Natural Science Foundation of China(41971365),Chongqing Municipal Education Commission Cooperation Projects(HZ2021008) and Key Laboratory Project from Ministry of Culture and Tourism,China(E020H2023005).

摘要: 小样本视频行为识别旨在利用有限的训练样本构建高效学习模型,从而减轻传统行为识别对大规模且精细标注数据集的依赖。目前,小样本学习模型大多依据视频之间的相似性进行分类,但不同的动作实例呈现出不同的时空分布,导致查询视频与支持视频之间出现时间错位和动作演化错位,从而影响模型的识别性能。针对此问题,提出两阶段时空对齐网络TSAN,以提高视频数据的对齐精度,进而提升小样本视频行为识别的准确率。该网络采用元学习的基本架构,第一阶段通过动作时间对齐模块ATAM,构建元组模式的视频帧对,将视频动作细分为子动作序列,并结合视频数据中的时序信息,提升小样本学习的效率;第二阶段通过动作演化对齐模块AEAM,及其中包含的时间同步子模块TSM和空间协调子模块SCM,对查询特征进行校准,以匹配支持集的时空动作演化,从而提高小样本视频行为识别的准确率。在HMDB51,UCF101,SSV2100和Kinetics100这4个数据集上的实验结果表明,TSAN网络相较于现有小样本视频行为识别方法,具有更高的识别准确率。

关键词: 行为识别, 视频分类, 时空对齐, 小样本学习, 元学习

Abstract: Few-shot video action recognition aims to construct efficient learning models using limited training samples,thereby reducing the dependence of traditional action recognition on large-scale and finely annotated datasets.At present,most few-shot learning models classify videos based on their similarity.However,due to the different spatiotemporal distributions of action instances,there is a temporal and action evolution mismatch between the query video and the supporting video,which affects the recognition performance of the model.To address this issue,a two-stage spatiotemporal alignment network TSAN is proposed to improve the alignment accuracy of video data,thereby enhancing the accuracy of few-shot video action recognition.This network adopts the basic architecture of meta learning.In the first stage,the action time alignment module ATAM is used to construct video frame pairs in tuple mode,which subdivides video actions into sub action sequences and combines them with temporal information in video data to improve the efficiency of few-shot learning.In the second stage,the action evolution alignment module AEAM,along with its time synchronization submodule TSM and spatial coordination submodule SCM,are used to calibrate the query features to match the spatiotemporal action evolution of the support set,thereby improving the accuracy of few-shot video action recognition.The experimental results on the HMDB51,UCF101,SSV2100,and Kinetics100 datasets show that the TSAN network has higher recognition accuracy compared to existing few-shot video action recognition methods.

Key words: Action recognition, Video classification, Spatio-temporal alignment, Few-shot learning, Meta-learning

中图分类号: 

  • TP391
[1]SHENG X X,LI K C,SHEN Z Q,et al.A Progressive Difference Method for Capturing Visual Tempos on Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(3):977-987.
[2]COSKUN H,ZIA Z,TEKIN B,et al.Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(6):6659-6673.
[3]WANG P,LI H B,ZHANG B X,et al.Metric-based few-shotlearning method for driver distracted behaviors detection[C]//2023 International Conference on Image Processing Computer Vision and Machine Learning(ICICML).IEEE,2023:959-963.
[4]ZHU L,YI Y.Compound Memory Networks for Few-Shot Video Classification[C]//Proceedings of the European Conference on Computer Vision.2018:751-766.
[5]BISHAY M,ZOUMPOURLIS G,PATRAS I.TARN:Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition[C]//British Machine Vision Conference.2019:154-168.
[6]ZHANG H,ZHANG L,QI X,et al.Few-shot Action Recognition with Permutation-invariant Attention[C]//Proceedings of the European Conference on Computer Vision.2020:525-542.
[7]CAO K,JI J,CAO Z,et al.Few-Shot Video Classification viaTemporal Alignment[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2020:10615-10624.
[8]FU Y,ZHANG L,WANG J,et al.Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia.ACM,2020:1142-1151.
[9]NI X Z,WEN H,LIU Y,et al.Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition[C]//Proceedings of the 2024 International Conference on Multimedia Retrieval.ACM,2020:1-10.
[10]DWIVEDI S K,GUPTA V,MITRA R,et al.ProtoGAN:Towards Few Shot Learning for Action Recognition[C]//International Conference on Computer Vision Workshop(ICCVW).IEEE,2019:1308-1316.
[11]ZHU X,TOISOUL A,PEREZ-RUA J M,et al.Few-shot Action Recognition with Prototype-centered Attentive Learning[C]//British Machine Vision Conference.2021:249-259.
[12]WANG X,ZHANG S,QING Z,et al.MoLo:Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2023:18011-18021.
[13]WANG X,ZHANG S W,CEN J,et al.CLIP-guided Prototype Modulatingfor Few-shot Action Recognition[J].International Journal of Computer Vision,2024,132(6):1899-1912.
[14]LI S,LIU H,QIAN R,et al.TA2N:Two-Stage Action Alignment Network for Few-shot Action Recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence,2022:36(2):1404-1411.
[15]PERRETT T,MASULLO A,BURGHARDT T,et al.Temporal-RelationalCrossTransformers for Few-Shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2021:475-484.
[16]THATIPELLI A,NARAYAN S,KHAN S,et al.Spatio-temporal Relation Modeling for Few-shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2022:19926-19935.
[17]GUO F,ZHU L,WANG Y K,et al.Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition[J].Neurocomputing,2024,32(5):598-612.
[18]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2016:770-778.
[19]HAKIM N,INSAF B,HASSAN S.Improving Human ActionRecognition in Videos with Two-Stream and Self-Attention Module[C]//Colloquium in Information Science and Technology.IEEE,2023:215-220.
[20]WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//Proceedings of the European Conference on Computer Vision.2016:20-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!