计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211200198-6.doi: 10.11896/jsjkx.211200198

• 图像处理&多媒体技术 • 上一篇    下一篇

基于改进Transformer的连续手语识别方法

王帅, 张淑军, 叶康, 郭淇   

  1. 青岛科技大学信息科学技术学院 青岛 266061
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 张淑军(zhangsj@qust.edu.cn)
  • 作者简介:(1005183361@qq.com)
  • 基金资助:
    山东省重点研发计划项目(2017GGX10127)

Continuous Sign Language Recognition Method Based on Improved Transformer

WANG Shuai, ZHANG Shu-jun, YE Kang, GUO Qi   

  1. College of Information Science and Technology,Qingdao University of Science and Technology,Qingdao 266061,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:WANG Shuai,born in 1997,postgra-duate.His main research interests include computer vision.
    ZHANG Shu-jun,born in 1980,associate professor.Her main research interests include computer vision.
  • Supported by:
    Key Research and Development Program of Shandong(2017GGX10127).

摘要: 连续手语识别是一项具有挑战性的任务,当前大多数模型忽略了对长序列的整体建模能力,导致对较长手语视频的识别和翻译准确率较低。Transformer模型独特的编解码结构可用于手语识别,但其位置编码方式以及多头自注意力机制仍有待改善。因此,文中提出了一种基于改进Transformer模型的连续手语识别方法,通过多处复用的带参数位置编码对连续手语句子中的每个词向量进行多次循环计算,准确掌握各个词之间的位置信息;在注意力模块中添加可学习的记忆键值对形成持久记忆模块,通过线性高维映射等比例扩大注意力头数与嵌入维度,最大程度地发挥Transformer模型的多头注意力机制对较长手语序列的整体建模能力,深入挖掘视频内部各帧中的关键信息。所提方法在最具权威的连续手语数据集PHOENIX-Weather2014[1]和PHOENIX-Weather2014-T[2]上取得了有竞争力的识别结果。

关键词: 连续手语识别, Transformer, 多头注意力, 位置编码

Abstract: Continuous sign language recognition is a challenging task.Most current models ignore the overall modeling ability of long sequences,resulting in lower accuracy of recognition and translation of longer sign language videos.The unique codec structure of Transformer model can be used for sign language recognition,but its position coding method and multi-head self-attention mechanism still need to be improved.Therefore,this paper proposes a continuous sign language recognition method based on the improved Transformer model.Through multiple multiplexed position codes with parameters,each word vector in the continuous hand sentence is calculated multiple times to accurately grasp the position information between each word,add learnable memory key-value pairs to the attention module to form a persistent memory module,and expand the number of attention heads and embedding dimensions through linear high-dimensional mapping and the like,to maximize the multi-head attention mechanism of the Transformer model,and the overall modeling ability of long sign language sequences,in-depth mining of key information in each frame of the video.The proposed method achieves competitive recognition results on the most authoritative continuous sign language data sets PHOENIX-Weather2014[1] and PHOENIX-Weather2014-T[2].

Key words: Continuous sign language recognition, Transformer, Multi-head attention, Position encoding

中图分类号: 

  • TP391
[1]FORSTER J,SCHMIDT C,HOYOUX T,et al.RWTH-PHOENIX-Weather:A Large Vocabulary Sign Language Recognition and Translation Corpus[C]//International Conference on Language Resources and Evaluation(LREC).2012.
[2]FORSTER J,SCHMIDT C,KOLLER O,et al.Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather[C]//International Conference on Language Resources and Evaluation.2014:1911-1916.
[3]LECUN Y,BOTTOU L.Gradient-based learning applied to docu-ment recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[4]LIPTON Z C,BERKOWITZ J,ELKAN C.A critical review of recurrent neural networks for sequence learning[J].arXiv:1506.00019,2015.
[5]FORSTER J,KOLLER O,OBERDÖRFER C,et al.ImprovingContinuous Sign Language Recognition:Speech Recognition Techniques and System Design[C]//Workshop on Speech and Language Processing for Assistive Technologies.2013.
[6]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[7]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[8]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization [C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[9]MOLCHANOV P,YANG X,GUPTA S,et al.Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:4207-4215.
[10]JIE H,ZHOU W,LI H,et al.Sign Language Recognition using 3D convolutional neural networks[C]//IEEE International Conference on Multimedia and Expo.2015:1-6.
[11]HOCHREITERS,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[12]PIGOU L,HERREWEGHE M V,AMBRE J D.Gesture and Sign Language Recognition with Temporal Residual Networks[C]//IEEE International Conference on Computer Vision Workshop.2017:3086-3093.
[13]CAMGOZ N C,HADFIELD S,KOLLER O,et al.SubUNets:End-to-End Hand Shape and continuous sign language recognition[C]//IEEE International Conference on Computer Vision.2017:3075-3084.
[14]XU B,HUANG S,YE Z.Application of Tensor Train Decomposition in S2VT Model for Sign Language Recognition[J].IEEE Access,2021,9:35646-35653.
[15]BAHDANAUD,CHO K,BENGIO Y.Neural machine translation by Jointly Learning to align and translate[J].arXiv:1409.0473,2014.
[16]PAPASTRATIS I,DIMITROPOULOS K,KONSTANTINIDIS D,et al.Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space[J].IEEE Access,2020,8:91170-91180.
[17]LIN Z,FENG M,SANTOS C,et al.A Structured Self-attentive Sentence Embedding [J].arXiv:1703.03130,2017.
[18]GEHRING J,AULI M,GRANGIER D,et al.Convolutional Sequence to Sequence Learning[C]//International Conference on Machine Learning.PMLR,2017:1243-1252.
[19]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Sign language transformers:Joint end-to-end sign language recognition and translation[J].arXiv:2003.13830,2020.
[20]NIU Z,MAK B.Stochastic Fine-Grained Labeling of Multi-stateSign Glosses for Continuous Sign Language Recognition[M].Springer,Cham,2020.
[21]YIN K,READ J.Better Sign Language Translation with STMC-Transformer[C]//Proceedings of the 28th International Confe-rence on Computational Linguistics.2020.
[22]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Multi-channel Transformers for Multi-articulatory Sign Language Translation[C]//European Conference on Computer Vision.Cham:Sprin-ger,2020:301-319.
[23]BEN SLIMANE F,BOUGUESSA M.Context Matters:Self-Attention for Sign Language Recognition[J].arXiv:2101.04632,2021.
[24]SUKHBAATAR S,GRAVE E,LAMPLE G,et al.Augmenting Self-attention with Persistent Memory[J].arXiv:1907.01470,2019.
[25]TOUVRON H,CORD M,SABLAYROLLES A,et al.Goingdeeper with Image Transformers[J].arXiv:2103.17239,2021.
[26]KOLLER O,ZARGARANO,NEY H,et al.Deep sign:hybrid cnn-hmm for continuous sign language recognition[C]//British Machine Vision Conference(BMVC).2016:1-12.
[27]KOLLER O,ZARGARAN S,NEY H.Re-sign:Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:4297-4305.
[28]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[29]HUANG J,ZHOU W,ZHANG Q,et al.Video-based sign language recognition without temporal segmentation[C]//AAAI Conference on Artificial Intelligence.2018.
[30]PU J,ZHOU W,LI H.Iterative alignment network for continuous sign language recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2019.
[31]YANG Z,SHI Z,SHEN X,et al.SF-Net:Structured feature network for continuous sign language recognition[J].arXiv:1908.01341,2019.
[32]ZHOU H,ZHOU W,LI H.Dynamic pseudo label decoding for continuous sign language recognition[C]//IEEE International Conference on Multimedia and Expo.2019:1282-1287.
[33]KOLLER O,CAMGOZ C,NEY H,et al.Weakly supervisedlearning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,42(9):2306-2320.
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[3] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[4] 康雁, 徐玉龙, 寇勇奇, 谢思宇, 杨学昆, 李浩.
基于Transformer和LSTM的药物相互作用预测
Drug-Drug Interaction Prediction Based on Transformer and LSTM
计算机科学, 2022, 49(6A): 17-21. https://doi.org/10.11896/jsjkx.210400150
[5] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[6] 赵小虎, 叶圣, 李晓.
多算法融合的骨骼重建信息动作分类方法
Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction
计算机科学, 2022, 49(6): 269-275. https://doi.org/10.11896/jsjkx.210500070
[7] 陆亮, 孔芳.
面向对话的融入知识的实体关系抽取
Dialogue-based Entity Relation Extraction with Knowledge
计算机科学, 2022, 49(5): 200-205. https://doi.org/10.11896/jsjkx.210300198
[8] 李川, 李维华, 王迎晖, 陈伟, 文俊颖.
基于transformer的门控双塔模型预测H1N1流感抗原性
Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1
计算机科学, 2022, 49(11A): 211000209-6. https://doi.org/10.11896/jsjkx.211000209
[9] 胡新荣, 陈志恒, 刘军平, 彭涛, 叶鹏, 朱强.
基于多模态表示学习的情感分析框架
Sentiment Analysis Framework Based on Multimodal Representation Learning
计算机科学, 2022, 49(11A): 210900107-6. https://doi.org/10.11896/jsjkx.210900107
[10] 方仲俊, 张静, 李冬冬.
基于空间和多层级联合编码的图像描述算法
Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning
计算机科学, 2022, 49(10): 151-158. https://doi.org/10.11896/jsjkx.210900159
[11] 肖丁, 张玙璠, 纪厚业.
基于多头注意力机制的用户窃电行为检测
Electricity Theft Detection Based on Multi-head Attention Mechanism
计算机科学, 2022, 49(1): 140-145. https://doi.org/10.11896/jsjkx.210100177
[12] 杨慧敏, 马廷淮.
融合检索与生成的复合对话模型
Compound Conversation Model Combining Retrieval and Generation
计算机科学, 2021, 48(8): 234-239. https://doi.org/10.11896/jsjkx.200700162
[13] 杨进才, 曹元, 胡泉, 沈显君.
基于Transformer模型与关系词特征的汉语因果类复句关系自动识别
Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature
计算机科学, 2021, 48(6A): 295-298. https://doi.org/10.11896/jsjkx.200500019
[14] 霍帅, 庞春江.
基于Transformer和多通道卷积神经网络的情感分析研究
Research on Sentiment Analysis Based on Transformer and Multi-channel Convolutional Neural Network
计算机科学, 2021, 48(6A): 349-356. https://doi.org/10.11896/jsjkx.200800004
[15] 蒋琪, 苏伟, 谢莹, 周弘安平, 张久文, 蔡川.
基于Transformer的汉字到盲文端到端自动转换
End-to-End Chinese-Braille Automatic Conversion Based on Transformer
计算机科学, 2021, 48(11A): 136-141. https://doi.org/10.11896/jsjkx.210100025
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!