计算机科学 ›› 2022, Vol. 49 ›› Issue (9): 155-161.doi: 10.11896/jsjkx.210800026

• 计算机图形学&多媒体 • 上一篇    下一篇

多层注意力机制融合的序列到序列中国连续手语识别和翻译

周乐员1, 张剑华1, 袁甜甜2, 陈胜勇1   

  1. 1 天津理工大学计算机科学与工程学院 天津 300382
    2 天津理工大学聋人工学院 天津 300382
  • 收稿日期:2021-08-03 修回日期:2021-12-10 出版日期:2022-09-15 发布日期:2022-09-09
  • 通讯作者: 张剑华(zjh@email.tjut.edu.cn)
  • 作者简介:(870185811@qq.com)
  • 基金资助:
    国家自然科学基金(61876167);浙江省自然科学基金(LY20F030017);天津市智能制造专项资金(20201169)

Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion

ZHOU Le-yuan1, ZHANG Jian-hua1, YUAN Tian-tian2, CHEN Sheng-yong1   

  1. 1 School of Computer Science and Technology,Tianjin University of Technology,Tianjin 300382,China
    2 Technical College for the Deaf,Tianjin University of Technology,Tianjin 300382,China
  • Received:2021-08-03 Revised:2021-12-10 Online:2022-09-15 Published:2022-09-09
  • About author:ZHOU Le-yuan,born in 1996,postgra-duate.His main research interests include deep learning and computer vision.
    ZHANG Jian-hua,born in 1981,Ph.D,professor,Ph.D supervisor.His main research interests include computer vision,digital image processing and robot intelligent technology.
  • Supported by:
    National Natural Science Foundation of China(61876167),Natural Science Foundation of Zhejiang Province(LY20F030017) and Tianjin Intelligent Manufacturing Special Foundation(20201169).

摘要: 使计算机能够理解手语者的表达一直是一项极具挑战性的任务,不仅需要考虑手语视频的时间和空间信息,同时还要考虑手语语法的复杂性。在连续手语识别任务中,手语词汇和手语动作共享一致的顺序;而在连续手语翻译任务中,生成的自然语言句子应符合口语化描述,词汇顺序和动作顺序可能不一致。为了能够更加准确地学习手语者的表达,提出了一个新颖的能同时进行手语识别和翻译的深度神经网络。该方案探讨了不同的经典预训练卷积神经网络和不同的多层时序注意力分值函数在连续手语识别上的效果,网络将手语视频高级抽象特征和低级时序语义组合在多层时间注意力融合模块中,形成更全面的序列注意力融合特征,从而从连续手语视频中更准确地生成gloss句子。结合Transformer语言模型将手语识别gloss句子转换为符合手语翻译的连续自然语言句子。首先,该方法在第一个大规模的复杂背景的中国连续手语识别和翻译数据集Tslrt上进行评估。利用Tslrt数据集中手语者复杂的背景环境和丰富的动作表达来训练所提神经网络模型,通过不同的对比实验得到了一系列的基准结果。在连续手语识别和翻译的任务上,效果最好的词错误率分别达到了4.8%和5.1%。为了进一步证明所提方法的有效性,在另一个公开的中国连续手语识别数据集Chinese-CSL也进行了验证,并和其他13种公开方法进行了比较,结果表明,所提方法的词错误率达到了最好的识别效果,为1.8%,证明了该方法的有效性。

关键词: 连续手语识别和翻译, 视频理解, 序列模型, 注意力机制融合, 卷积神经网络

Abstract: Enabling computers to understand the expressions of signers has been a challenging task that requires considering not only the temporal and spatial information of sign language videos,but also the complexity of sign language grammar.In the continuous sign language recognition task,sign language words and sign language actions share a consistent order.In contrast,in the continuous sign language translation task,the generated natural language sentences have to conform to the spoken description,and the word order may not coincide with the action order.To enable more accurate learning of signers' expressions,this paper proposes a novel deep neural network for simultaneous sign language recognition and translation.In this scheme,we explore the effectiveness of different classical pre-trained convolutional neural networks,and different multilayer temporal attention score functions on continuous sign language recognition,combined with Transformer language model,to obtain continuous sign language translation conforming to the spoken description based on continuous sign language recognition.First,this method is assessed on the first large-scale complex background Chinese continuous sign language recognition and translation dataset Tslrt.The complex contextual environment and rich action expressions of signers in Tslrt dataset are used to train our neural network model through different comparison experiments,resulting in a series of benchmark results.The best WER are 4.8% and 5.1% on the tasks of continuous sign language recognition and translation,respectively.To further demonstrate the effectiveness of our method,experiments are conducted on another Chinese continuous sign language recognition dataset Chinese-CSL and compared with other 13 methods.The results show that the WER of our method reaches 1.8%,which proves the effectiveness of the proposed method.

Key words: Continuous sign language recognition and translation, Video understanding, Sequence model, Attention mechanism fusion, Convolutional neural network

中图分类号: 

  • TP391
[1]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[2]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4534-4542.
[3]ONG S C W,RANGANATH S.Automatic sign language analysis:A survey and the future beyond lexical meaning[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(6):873-891.
[4]THACKER N A,CLARK A F,BARRON J L,et al.Perfor-mance characterization in computer vision:A guide to best practices[J].Computer Vision and Image Understanding,2008,109(3):305-334.
[5]BROWN P F,DELLA PIETRA S A,DELLA PIETRA V J,et al.The mathematics of statistical machine translation:Parameter estimation[J].Computational Linguistics,1993,19(2):263-311.
[6]MAO J,XU W,YANG Y,et al.Explain images with multimodal recurrent neural networks[J].arXiv:1410.1090,2014.
[7]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[8]GUADARRAMA S,KRISHNAMOORTHY N,MALKARNE-NKAR G,et al.Youtube2text:Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:2712-2719.
[9]PASUNURU R,BANSAL M.Reinforced video captioning with entailment rewards[J].arXiv:1708.02300,2017.
[10]GUO D,TANG S G,HONG R C,et al.A review of sign language recognition,translation and generation[J].Computer Science,2021,48(3):60-70.
[11]LIU T,ZHOU W,LI H.Sign language recognition with long short-term memory[C]//2016 IEEE International Conference on Image Processing(ICIP).IEEE,2016:2871-2875.
[12]GUO D,ZHOU W,WANG M,et al.Sign language recognition based on adaptive hmms with data augmentation[C]//2016 IEEE International Conference on Image Processing(ICIP).IEEE,2016:2876-2880.
[13]YANG H D,LEE S W.Robust sign language recognition with hierarchical conditional random fields[C]//2010 20th International Conference on Pattern Recognition.IEEE,2010:2202-2205.
[14]ZHANG J,ZHOU W,LI H.A threshold-based hmm-dtw ap-proach for continuous sign language recognition[C]//Procee-dings of International Conference on Internet Multimedia Computing and Service.2014:237-240.
[15]PU J,ZHOU W,LI H.Dilated Convolutional Network withIterative Optimization for Continuous Sign Language Recognition[C]//IJCAI.2018:3-7.
[16]CAMGOZ N C,HADFIELD S,KOLLER O,et al.Subunets:End-to-end hand shape and continuous sign language recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:3056-3065.
[17]HUANG J,ZHOU W,ZHANG Q,et al.Video-based sign language recognition without temporal segmentation[J].arXiv:1801.10111,2018.
[18]CAMGOZ N C,HADFIELD S,KOLLER O,et al.Neural sign language translation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7784-7793.
[19]KO S K,KIM C J,JUNG H,et al.Neural sign language translation based on human keypoint estimation[J]. arXiv:1811.11436v2,2019.
[20]ZHOU H,ZHOU W,QI W,et al.Improving Sign LanguageTranslation with Monolingual Data by Sign Back-Translation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1316-1325.
[21]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems.2014:3104-3112.
[22]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Sign language transformers:Joint end-to-end sign language recognition and translation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10023-10033.
[23]YE R,DAI Q.A novel transfer learning framework for time series forecasting[J].Knowledge-Based Systems,2018,156:74-99.
[24]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[25]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[26]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[27]RONG X.Word2vec parameter learning explained[J].arXiv:1411.2738,2014.
[28]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[29]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[30]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[31]PU J,ZHOU W,LI H.Iterative alignment network for conti-nuous sign language recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4165-4174.
[32]CHENG K L,YANG Z,CHEN Q,et al.Fully convolutionalnetworks for continuous sign language recognition[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:697-714.
[1] 聂秀山, 潘嘉男, 谭智方, 刘新放, 郭杰, 尹义龙.
基于自然语言的视频片段定位综述
Overview of Natural Language Video Localization
计算机科学, 2022, 49(9): 111-122. https://doi.org/10.11896/jsjkx.220500130
[2] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[3] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[4] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[7] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[8] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[9] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[10] 刘月红, 牛少华, 神显豪.
基于卷积神经网络的虚拟现实视频帧内预测编码
Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network
计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
[11] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[12] 孙福权, 崔志清, 邹彭, 张琨.
基于多尺度特征的脑肿瘤分割算法
Brain Tumor Segmentation Algorithm Based on Multi-scale Features
计算机科学, 2022, 49(6A): 12-16. https://doi.org/10.11896/jsjkx.210700217
[13] 吴子斌, 闫巧.
基于动量的映射式梯度下降算法
Projected Gradient Descent Algorithm with Momentum
计算机科学, 2022, 49(6A): 178-183. https://doi.org/10.11896/jsjkx.210500039
[14] 杨涵, 万游, 蔡洁萱, 方铭宇, 吴卓超, 金扬, 钱伟行.
基于步态分类辅助的虚拟IMU的行人导航方法
Pedestrian Navigation Method Based on Virtual Inertial Measurement Unit Assisted by GaitClassification
计算机科学, 2022, 49(6A): 759-763. https://doi.org/10.11896/jsjkx.211200148
[15] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!