Computer Science ›› 2025, Vol. 52 ›› Issue (8): 251-258.doi: 10.11896/jsjkx.240900127

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Few-shot Video Action Recognition Based on Two-stage Spatio-Temporal Alignment

WANG Jia, XIA Ying, FENG Jiangfan   

  1. College of Computer Science and Technology,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    Key Laboratory of Tourism Multisource Data Perception and Decision Technology,Ministry of Culture and Tourism,Chongqing 400065,China
  • Received:2024-09-23 Revised:2024-11-23 Online:2025-08-15 Published:2025-08-08
  • About author:WANG Jia,born in 1996,postgraduate.His main research interests include deep learning and video action recognition.
    XIA Ying,born in 1972,professor,Ph.D supervisor.Her main research interests include spatio-temporal big data and cross-media retrieval.
  • Supported by:
    National Natural Science Foundation of China(41971365),Chongqing Municipal Education Commission Cooperation Projects(HZ2021008) and Key Laboratory Project from Ministry of Culture and Tourism,China(E020H2023005).

Abstract: Few-shot video action recognition aims to construct efficient learning models using limited training samples,thereby reducing the dependence of traditional action recognition on large-scale and finely annotated datasets.At present,most few-shot learning models classify videos based on their similarity.However,due to the different spatiotemporal distributions of action instances,there is a temporal and action evolution mismatch between the query video and the supporting video,which affects the recognition performance of the model.To address this issue,a two-stage spatiotemporal alignment network TSAN is proposed to improve the alignment accuracy of video data,thereby enhancing the accuracy of few-shot video action recognition.This network adopts the basic architecture of meta learning.In the first stage,the action time alignment module ATAM is used to construct video frame pairs in tuple mode,which subdivides video actions into sub action sequences and combines them with temporal information in video data to improve the efficiency of few-shot learning.In the second stage,the action evolution alignment module AEAM,along with its time synchronization submodule TSM and spatial coordination submodule SCM,are used to calibrate the query features to match the spatiotemporal action evolution of the support set,thereby improving the accuracy of few-shot video action recognition.The experimental results on the HMDB51,UCF101,SSV2100,and Kinetics100 datasets show that the TSAN network has higher recognition accuracy compared to existing few-shot video action recognition methods.

Key words: Action recognition, Video classification, Spatio-temporal alignment, Few-shot learning, Meta-learning

CLC Number: 

  • TP391
[1]SHENG X X,LI K C,SHEN Z Q,et al.A Progressive Difference Method for Capturing Visual Tempos on Action Recognition[J].IEEE Transactions on Circuits and Systems for Video Technology,2023,33(3):977-987.
[2]COSKUN H,ZIA Z,TEKIN B,et al.Domain-Specific Priors and Meta Learning for Few-Shot First-Person Action Recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2023,45(6):6659-6673.
[3]WANG P,LI H B,ZHANG B X,et al.Metric-based few-shotlearning method for driver distracted behaviors detection[C]//2023 International Conference on Image Processing Computer Vision and Machine Learning(ICICML).IEEE,2023:959-963.
[4]ZHU L,YI Y.Compound Memory Networks for Few-Shot Video Classification[C]//Proceedings of the European Conference on Computer Vision.2018:751-766.
[5]BISHAY M,ZOUMPOURLIS G,PATRAS I.TARN:Temporal Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition[C]//British Machine Vision Conference.2019:154-168.
[6]ZHANG H,ZHANG L,QI X,et al.Few-shot Action Recognition with Permutation-invariant Attention[C]//Proceedings of the European Conference on Computer Vision.2020:525-542.
[7]CAO K,JI J,CAO Z,et al.Few-Shot Video Classification viaTemporal Alignment[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2020:10615-10624.
[8]FU Y,ZHANG L,WANG J,et al.Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition[C]//Proceedings of the 28th ACM International Conference on Multimedia.ACM,2020:1142-1151.
[9]NI X Z,WEN H,LIU Y,et al.Multimodal Prototype-Enhanced Network for Few-Shot Action Recognition[C]//Proceedings of the 2024 International Conference on Multimedia Retrieval.ACM,2020:1-10.
[10]DWIVEDI S K,GUPTA V,MITRA R,et al.ProtoGAN:Towards Few Shot Learning for Action Recognition[C]//International Conference on Computer Vision Workshop(ICCVW).IEEE,2019:1308-1316.
[11]ZHU X,TOISOUL A,PEREZ-RUA J M,et al.Few-shot Action Recognition with Prototype-centered Attentive Learning[C]//British Machine Vision Conference.2021:249-259.
[12]WANG X,ZHANG S,QING Z,et al.MoLo:Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2023:18011-18021.
[13]WANG X,ZHANG S W,CEN J,et al.CLIP-guided Prototype Modulatingfor Few-shot Action Recognition[J].International Journal of Computer Vision,2024,132(6):1899-1912.
[14]LI S,LIU H,QIAN R,et al.TA2N:Two-Stage Action Alignment Network for Few-shot Action Recognition[J].Proceedings of the AAAI Conference on Artificial Intelligence,2022:36(2):1404-1411.
[15]PERRETT T,MASULLO A,BURGHARDT T,et al.Temporal-RelationalCrossTransformers for Few-Shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2021:475-484.
[16]THATIPELLI A,NARAYAN S,KHAN S,et al.Spatio-temporal Relation Modeling for Few-shot Action Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2022:19926-19935.
[17]GUO F,ZHU L,WANG Y K,et al.Task-Specific Alignment and Multiple Level Transformer for Few-Shot Action Recognition[J].Neurocomputing,2024,32(5):598-612.
[18]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2016:770-778.
[19]HAKIM N,INSAF B,HASSAN S.Improving Human ActionRecognition in Videos with Two-Stream and Self-Attention Module[C]//Colloquium in Information Science and Technology.IEEE,2023:215-220.
[20]WANG L,XIONG Y,WANG Z,et al.Temporal Segment Networks:Towards Good Practices for Deep Action Recognition[C]//Proceedings of the European Conference on Computer Vision.2016:20-36.
[1] CHEN Yadang, GAO Yuxuan, LU Chuhan, CHE Xun. Saliency Mask Mixup for Few-shot Image Classification [J]. Computer Science, 2025, 52(6): 256-263.
[2] HUANG Qian, SU Xinkai, LI Chang, WU Yirui. Hypergraph Convolutional Network with Multi-perspective Topology Refinement forSkeleton-based Action Recognition [J]. Computer Science, 2025, 52(5): 220-226.
[3] WANG Jiahui, PENG Guangling, DUAN Liang, YUAN Guowu, YUE Kun. Few-shot Shadow Removal Method for Text Recognition [J]. Computer Science, 2024, 51(9): 147-154.
[4] TANG Ruiqi, XIAO Ting, CHI Ziqiu, WANG Zhe. Few-shot Image Classification Based on Pseudo-label Dependence Enhancement and NoiseInterferenceReduction [J]. Computer Science, 2024, 51(8): 152-159.
[5] ZHANG Rui, WANG Ziqi, LI Yang, WANG Jiabao, CHEN Yao. Task-aware Few-shot SAR Image Classification Method Based on Multi-scale Attention Mechanism [J]. Computer Science, 2024, 51(8): 160-167.
[6] HE Zhilin, GU Tianhao, XU Guanhua. Few-shot Semi-supervised Semantic Image Translation Algorithm Based on Prototype Correction [J]. Computer Science, 2024, 51(8): 224-231.
[7] WANG Jinghong, TIAN Changshen, LI Haokang, WANG Wei. Lagrangian Dual-based Privacy Protection and Fairness Constrained Method for Few-shot Learning [J]. Computer Science, 2024, 51(7): 405-412.
[8] LEI Yongsheng, DING Meng, SHEN Yao, LI Juhao, ZHAO Dongyue, CHEN Fushi. Action Recognition Model Based on Improved Two Stream Vision Transformer [J]. Computer Science, 2024, 51(7): 229-235.
[9] WANG Yifan, ZHANG Xuefang. Modality Fusion Strategy Research Based on Multimodal Video Classification Task [J]. Computer Science, 2024, 51(6A): 230300212-5.
[10] YANG Xuhua, ZHANG Lian, YE Lei. Adaptive Context Matching Network for Few-shot Knowledge Graph Completion [J]. Computer Science, 2024, 51(5): 223-231.
[11] YAN Wenjie, YIN Yiying. Human Action Recognition Algorithm Based on Adaptive Shifted Graph Convolutional Neural
Network with 3D Skeleton Similarity
[J]. Computer Science, 2024, 51(4): 236-242.
[12] WANG Bo, ZHAO Jincheng, XU Bingfeng, HE Gaofeng. Zero Day Attack Detection Method for Internet of Vehicles [J]. Computer Science, 2024, 51(12): 334-342.
[13] HUANG Haixin, WANG Yuyao, CAI Mingqi. Bottleneck Multi-scale Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2024, 51(11A): 231000073-5.
[14] LIU Shuai, BAI Xuefei, GAO Xiaofang. Few-Shot Learning Method Based on Symmetric Convolutional Block Network and PrototypeCalibration [J]. Computer Science, 2024, 51(11): 182-190.
[15] DUAN Xinran, WANG Mei, HAN Tianli, ZHOU Hongyu, GUO Junqi, JI Weixing, HUANG Hua. Perception and Analysis of Teaching Process Based on Video Understanding [J]. Computer Science, 2024, 51(10): 56-66.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!