计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 156-162.doi: 10.11896/jsjkx.220600036

• 计算机图形学&多媒体 • 上一篇    下一篇

基于时序信息对齐的连续手语跨模态知识蒸馏

肖正业1, 林世铨1, 万修安1, 方昱春1, 倪兰2   

  1. 1 上海大学计算机工程与科学学院 上海 200444
    2 上海大学文学院 上海 200444
  • 收稿日期:2022-06-03 修回日期:2022-08-02 出版日期:2022-11-15 发布日期:2022-11-03
  • 通讯作者: 方昱春(ycfang@shu.edu.cn)
  • 作者简介:(xiaozy@shu.edu.cn)
  • 基金资助:
    国家自然科学基金(61976132,61991411,U1811461);上海自然科学基金(19ZR1419200);上海智能计算系统工程研究中心(19DZ2252600)

Temporal Relation Guided Knowledge Distillation for Continuous Sign Language Recognition

XIAO Zheng-ye1, LIN Shi-quan1, WAN Xiu-an1, FANGYu-chun1, NI Lan2   

  1. 1 School of Computer Engineering and Science,Shanghai University,Shanghai 200444,China
    2 College of Liberal Arts,Shanghai University,Shanghai 200444,China
  • Received:2022-06-03 Revised:2022-08-02 Online:2022-11-15 Published:2022-11-03
  • About author:XIAO Zheng-ye,born in 1996,bachelor.His main research interests include machine learning and computer vision.
    FANG Yu-chun,born in 1975,Ph.D,professor.Her main research interests include machine learning,multimedia,pattern recognition and image proces-sing.
  • Supported by:
    National Natural Science Foundation of China(61976132,61991411,U1811461),Natural Science Foundation of Shanghai,China(19ZR1419200)and Shanghai Engineering Research Center of Intelligent Computing System(19DZ2252600).

摘要: 近年来,连续手语识别的研究工作主要围绕RGB模态的数据展开,并且在现实场景数据集和实验室采集数据集上都取得了显著进展。然而,RGB模态的处理对设备计算能力具有很高的要求,而骨骼关键点模态则由于输入数据复杂度相对低,因此处理速度更快,只是在识别性能上弱于RGB模态。为了综合两种方法的优点,文中提出了一种基于时序关联信息对齐的跨模态知识蒸馏方法(Temporally Related Knowledge Distillation,TRKD)。该方法使用RGB模态的神经网络作为教师网络来指导使用骨骼关键点模态的学生网络,以快速准确地实现连续手语识别。由于教师网络对手语语境的理解能力十分值得学生网络学习,因此提出了具有先验信息以及自适应学习方法的图卷积网络来提取两类模态中的时序关联特征,并通过特征对齐来实现教学。在特征对齐过程中,在教师网络中引入可学习参数会导致教师提供的监督信息丢失。为了解决这个问题,所提出的TRKD方法引入了自监督学习中的对比学习来提供监督信息,从而实现了教师网络与学生网络在时序关联特征上的对齐。文中在Phoenix-2014手语数据集上组织了多项蒸馏任务,以验证所提方法的有效性。

关键词: 知识蒸馏, 图卷积网络, 手语识别

Abstract: Previous researches in continuous sign language recognition mainly focus on the RGB modality and achieve remarkable performance on real-world and laboratory datasets,but they usually require high computation intensity.On the other hand,the skeleton is a modality with small input data and fast computation speed,but poor at the real-world datasets.This paper proposes a cross-modal knowledge distillation method named temporally related knowledge distillation(TRKD) to alleviate the contradiction between RGB and skeleton modality in performance and calculation speed.TRKD utilizes the RGB modality network as a teacher to guide the skeleton modality network for fast and accurate implementation.We notice that the teacher’s understanding of sign language context is worth learning by student.It proposes to employ the graph convolutional network(GCN) to learn and align the temporally related features of teacher networks and student networks to achieve this goal.Moreover,since the supervised information from the teacher network is not available for traditional loss functions due to the learnable parameters of GCN in the teacher network,we introduce contrastive learning to provide self-supervised information.Multiple ablation experiments on the Phoenix-2014 dataset demonstrate the effectiveness of the proposed method.

Key words: Knowledge distillation, Graph convolutional network, Sign language recognition

中图分类号: 

  • TP311
[1]CUI R P,LIU H,ZHANG C S.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[2]KOLLER O,CAMGOZ N C,NEY H,et al.Weakly supervised learning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[C]//IEEE Transactions on Pattern Analysis and Machine Intelligence.2019:2306-2320.
[3]LUO Z L,HSIEH J T,JIANG L,et al.Graph distillation for action detection with privileged modalities[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:166-183.
[4]TIAN Y L,KRISHNAN D, ISOLA P.Contrastive Representation Distillation[C]//International Conference on Learning Representations.2020.
[5]THOMAS N K,MAX W.Semi-supervised classification withgraph convolutional networks[C]//International Conference on Learning Representations.2017.
[6]OORD A V D,LI Y Z,VINYALS O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2018.
[7]PU J F,ZHOU W G,LI H Q.Iterative alignment network for continuous sign language recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4165-4174.
[8]CUI R P,LIU H,ZHANG C S.A deep neural framework for continuous sign language recognition by iterative training[J].IEEE Transactions on Multimedia 2019,21(7):1880-1891.
[9]ZHOU H,ZHOU W G,ZHOU Y,et al.Spatial-temporal multi-cue network for continuous sign language recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:13009-13016.
[10]WANG Z C,ZHANG J Q.Continuous Sign Language Recognition based on Multi-Part Skeleton Data[C]//2021 International Joint Conference on Neural Networks(IJCNN).IEEE,2021:1-8.
[11]GARCIA N C,MORERIO P,MURINO V.Modality distillation with multiple stream networks for action recognition[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:103-118.
[12]DAI R,SRIJAN D,BREMOND F.Learning an Augmented RGB Representation with Cross-Modal Knowledge Distillation for Action Detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:13053-13064.
[13]HINTON G,VINYALS O,DEANY J.Distilling the knowledge in a neural network[J].arXiv:1503.02531,2015.
[14]OSCAR K,JENS F,HERMANN N.Continuous sign language recognition:Towards large vocabulary statistical recognition systems handling multiple signers[J].Computer Vision and Image Understanding,2015,141:108-125.
[15]YAN S J,XIONG Y J,LIN D H.Spatial temporal graph convolutional networks for skeleton-based action recognition[C]//Thirty-second AAAI Conference on Artificial Intelligence.2018.
[16]CAO Z,HINDALGO G,SIMONT,et al.OpenPose:realtime multi-person 2D pose estimation using Part Affinity Fields[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,43(1):172-186.
[17]JOAO C,ANDREW Z.Quo vadis,action recognition? a new model and the kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:6299-6308.
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[3] 李健智, 王红玲, 王中卿.
基于图卷积网络的专利摘要自动生成研究
Automatic Generation of Patent Summarization Based on Graph Convolution Network
计算机科学, 2022, 49(6A): 172-177. https://doi.org/10.11896/jsjkx.210400117
[4] 楚玉春, 龚航, 王学芳, 刘培顺.
基于YOLOv4的目标检测知识蒸馏算法研究
Study on Knowledge Distillation of Target Detection Algorithm Based on YOLOv4
计算机科学, 2022, 49(6A): 337-344. https://doi.org/10.11896/jsjkx.210600204
[5] 程祥鸣, 邓春华.
基于无标签知识蒸馏的人脸识别模型的压缩算法
Compression Algorithm of Face Recognition Model Based on Unlabeled Knowledge Distillation
计算机科学, 2022, 49(6): 245-253. https://doi.org/10.11896/jsjkx.210400023
[6] 赵小虎, 叶圣, 李晓.
多算法融合的骨骼重建信息动作分类方法
Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction
计算机科学, 2022, 49(6): 269-275. https://doi.org/10.11896/jsjkx.210500070
[7] 周海榆, 张道强.
面向多中心数据的超图卷积神经网络及应用
Multi-site Hyper-graph Convolutional Neural Networks and Application
计算机科学, 2022, 49(3): 129-133. https://doi.org/10.11896/jsjkx.201100152
[8] 潘志豪, 曾碧, 廖文雄, 魏鹏飞, 文松.
基于交互注意力图卷积网络的方面情感分类
Interactive Attention Graph Convolutional Networks for Aspect-based Sentiment Classification
计算机科学, 2022, 49(3): 294-300. https://doi.org/10.11896/jsjkx.210100180
[9] 解宇, 杨瑞玲, 刘公绪, 李德玉, 王文剑.
基于动态拓扑图的人体骨架动作识别算法
Human Skeleton Action Recognition Algorithm Based on Dynamic Topological Graph
计算机科学, 2022, 49(2): 62-68. https://doi.org/10.11896/jsjkx.210900059
[10] 苗壮, 王亚鹏, 李阳, 王家宝, 张睿, 赵昕昕.
一种鲁棒的双教师自监督蒸馏哈希学习方法
Robust Hash Learning Method Based on Dual-teacher Self-supervised Distillation
计算机科学, 2022, 49(10): 159-168. https://doi.org/10.11896/jsjkx.210800050
[11] 黄仲浩, 杨兴耀, 于炯, 郭亮, 李想.
基于多阶段多生成对抗网络的互学习知识蒸馏方法
Mutual Learning Knowledge Distillation Based on Multi-stage Multi-generative Adversarial Network
计算机科学, 2022, 49(10): 169-175. https://doi.org/10.11896/jsjkx.210800250
[12] 宋龙泽, 万怀宇, 郭晟楠, 林友芳.
面向出租车空载时间预测的多任务时空图卷积网络
Multi-task Spatial-Temporal Graph Convolutional Network for Taxi Idle Time Prediction
计算机科学, 2021, 48(7): 112-117. https://doi.org/10.11896/jsjkx.201000089
[13] 程思伟, 葛唯益, 王羽, 徐建.
BGCN:基于BERT和图卷积网络的触发词检测
BGCN:Trigger Detection Based on BERT and Graph Convolution Network
计算机科学, 2021, 48(7): 292-298. https://doi.org/10.11896/jsjkx.200500133
[14] 宋元隆, 吕光宏, 王桂芝, 贾吾财.
基于图卷积神经网络的SDN网络流量预测
SDN Traffic Prediction Based on Graph Convolutional Network
计算机科学, 2021, 48(6A): 392-397. https://doi.org/10.11896/jsjkx.200800090
[15] 郭丹, 唐申庚, 洪日昌, 汪萌.
手语识别、翻译与生成综述
Review of Sign Language Recognition, Translation and Generation
计算机科学, 2021, 48(3): 60-70. https://doi.org/10.11896/jsjkx.210100227
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!