Computer Science ›› 2022, Vol. 49 ›› Issue (9): 155-161.doi: 10.11896/jsjkx.210800026

• Computer Graphics & Multimedia • Previous Articles     Next Articles

Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion

ZHOU Le-yuan1, ZHANG Jian-hua1, YUAN Tian-tian2, CHEN Sheng-yong1   

  1. 1 School of Computer Science and Technology,Tianjin University of Technology,Tianjin 300382,China
    2 Technical College for the Deaf,Tianjin University of Technology,Tianjin 300382,China
  • Received:2021-08-03 Revised:2021-12-10 Online:2022-09-15 Published:2022-09-09
  • About author:ZHOU Le-yuan,born in 1996,postgra-duate.His main research interests include deep learning and computer vision.
    ZHANG Jian-hua,born in 1981,Ph.D,professor,Ph.D supervisor.His main research interests include computer vision,digital image processing and robot intelligent technology.
  • Supported by:
    National Natural Science Foundation of China(61876167),Natural Science Foundation of Zhejiang Province(LY20F030017) and Tianjin Intelligent Manufacturing Special Foundation(20201169).

Abstract: Enabling computers to understand the expressions of signers has been a challenging task that requires considering not only the temporal and spatial information of sign language videos,but also the complexity of sign language grammar.In the continuous sign language recognition task,sign language words and sign language actions share a consistent order.In contrast,in the continuous sign language translation task,the generated natural language sentences have to conform to the spoken description,and the word order may not coincide with the action order.To enable more accurate learning of signers' expressions,this paper proposes a novel deep neural network for simultaneous sign language recognition and translation.In this scheme,we explore the effectiveness of different classical pre-trained convolutional neural networks,and different multilayer temporal attention score functions on continuous sign language recognition,combined with Transformer language model,to obtain continuous sign language translation conforming to the spoken description based on continuous sign language recognition.First,this method is assessed on the first large-scale complex background Chinese continuous sign language recognition and translation dataset Tslrt.The complex contextual environment and rich action expressions of signers in Tslrt dataset are used to train our neural network model through different comparison experiments,resulting in a series of benchmark results.The best WER are 4.8% and 5.1% on the tasks of continuous sign language recognition and translation,respectively.To further demonstrate the effectiveness of our method,experiments are conducted on another Chinese continuous sign language recognition dataset Chinese-CSL and compared with other 13 methods.The results show that the WER of our method reaches 1.8%,which proves the effectiveness of the proposed method.

Key words: Continuous sign language recognition and translation, Video understanding, Sequence model, Attention mechanism fusion, Convolutional neural network

CLC Number: 

  • TP391
[1]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[2]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4534-4542.
[3]ONG S C W,RANGANATH S.Automatic sign language analysis:A survey and the future beyond lexical meaning[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2005,27(6):873-891.
[4]THACKER N A,CLARK A F,BARRON J L,et al.Perfor-mance characterization in computer vision:A guide to best practices[J].Computer Vision and Image Understanding,2008,109(3):305-334.
[5]BROWN P F,DELLA PIETRA S A,DELLA PIETRA V J,et al.The mathematics of statistical machine translation:Parameter estimation[J].Computational Linguistics,1993,19(2):263-311.
[6]MAO J,XU W,YANG Y,et al.Explain images with multimodal recurrent neural networks[J].arXiv:1410.1090,2014.
[7]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//InternationalConference on Machine Learning.PMLR,2015:2048-2057.
[8]GUADARRAMA S,KRISHNAMOORTHY N,MALKARNE-NKAR G,et al.Youtube2text:Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2013:2712-2719.
[9]PASUNURU R,BANSAL M.Reinforced video captioning with entailment rewards[J].arXiv:1708.02300,2017.
[10]GUO D,TANG S G,HONG R C,et al.A review of sign language recognition,translation and generation[J].Computer Science,2021,48(3):60-70.
[11]LIU T,ZHOU W,LI H.Sign language recognition with long short-term memory[C]//2016 IEEE International Conference on Image Processing(ICIP).IEEE,2016:2871-2875.
[12]GUO D,ZHOU W,WANG M,et al.Sign language recognition based on adaptive hmms with data augmentation[C]//2016 IEEE International Conference on Image Processing(ICIP).IEEE,2016:2876-2880.
[13]YANG H D,LEE S W.Robust sign language recognition with hierarchical conditional random fields[C]//2010 20th International Conference on Pattern Recognition.IEEE,2010:2202-2205.
[14]ZHANG J,ZHOU W,LI H.A threshold-based hmm-dtw ap-proach for continuous sign language recognition[C]//Procee-dings of International Conference on Internet Multimedia Computing and Service.2014:237-240.
[15]PU J,ZHOU W,LI H.Dilated Convolutional Network withIterative Optimization for Continuous Sign Language Recognition[C]//IJCAI.2018:3-7.
[16]CAMGOZ N C,HADFIELD S,KOLLER O,et al.Subunets:End-to-end hand shape and continuous sign language recognition[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:3056-3065.
[17]HUANG J,ZHOU W,ZHANG Q,et al.Video-based sign language recognition without temporal segmentation[J].arXiv:1801.10111,2018.
[18]CAMGOZ N C,HADFIELD S,KOLLER O,et al.Neural sign language translation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7784-7793.
[19]KO S K,KIM C J,JUNG H,et al.Neural sign language translation based on human keypoint estimation[J]. arXiv:1811.11436v2,2019.
[20]ZHOU H,ZHOU W,QI W,et al.Improving Sign LanguageTranslation with Monolingual Data by Sign Back-Translation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:1316-1325.
[21]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems.2014:3104-3112.
[22]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Sign language transformers:Joint end-to-end sign language recognition and translation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:10023-10033.
[23]YE R,DAI Q.A novel transfer learning framework for time series forecasting[J].Knowledge-Based Systems,2018,156:74-99.
[24]DENG J,DONG W,SOCHER R,et al.Imagenet:A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2009:248-255.
[25]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[26]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[27]RONG X.Word2vec parameter learning explained[J].arXiv:1411.2738,2014.
[28]BAHDANAU D,CHO K,BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv:1409.0473,2014.
[29]LUONG M T,PHAM H,MANNING C D.Effective approaches to attention-based neural machine translation[J].arXiv:1508.04025,2015.
[30]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[31]PU J,ZHOU W,LI H.Iterative alignment network for conti-nuous sign language recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4165-4174.
[32]CHENG K L,YANG Z,CHEN Q,et al.Fully convolutionalnetworks for continuous sign language recognition[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:697-714.
[1] CHEN Yong-quan, JIANG Ying. Analysis Method of APP User Behavior Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(8): 78-85.
[2] ZHU Cheng-zhang, HUANG Jia-er, XIAO Ya-long, WANG Han, ZOU Bei-ji. Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism [J]. Computer Science, 2022, 49(8): 113-119.
[3] ZHANG Hong-bo, DONG Li-jia, PAN Yu-biao, HSIAO Tsung-chih, ZHANG Hui-zhen, DU Ji-xiang. Survey on Action Quality Assessment Methods in Video Understanding [J]. Computer Science, 2022, 49(7): 79-88.
[4] DAI Zhao-xia, LI Jin-xin, ZHANG Xiang-dong, XU Xu, MEI Lin, ZHANG Liang. Super-resolution Reconstruction of MRI Based on DNGAN [J]. Computer Science, 2022, 49(7): 113-119.
[5] LIU Yue-hong, NIU Shao-hua, SHEN Xian-hao. Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(7): 127-131.
[6] XU Ming-ke, ZHANG Fan. Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition [J]. Computer Science, 2022, 49(7): 132-141.
[7] YANG Yue, FENG Tao, LIANG Hong, YANG Yang. Image Arbitrary Style Transfer via Criss-cross Attention [J]. Computer Science, 2022, 49(6A): 345-352.
[8] YANG Jian-nan, ZHANG Fan. Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure [J]. Computer Science, 2022, 49(6A): 353-357.
[9] WU Zi-bin, YAN Qiao. Projected Gradient Descent Algorithm with Momentum [J]. Computer Science, 2022, 49(6A): 178-183.
[10] ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[11] WANG Jian-ming, CHEN Xiang-yu, YANG Zi-zhong, SHI Chen-yang, ZHANG Yu-hang, QIAN Zheng-kun. Influence of Different Data Augmentation Methods on Model Recognition Accuracy [J]. Computer Science, 2022, 49(6A): 418-423.
[12] SUN Jie-qi, LI Ya-feng, ZHANG Wen-bo, LIU Peng-hui. Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation [J]. Computer Science, 2022, 49(6A): 434-440.
[13] ZHAO Zheng-peng, LI Jun-gang, PU Yuan-yuan. Low-light Image Enhancement Based on Retinex Theory by Convolutional Neural Network [J]. Computer Science, 2022, 49(6): 199-209.
[14] LIU Lin-yun, CHEN Kai-yan, LI Xiong-wei, ZHANG Yang, XIE Fang-fang. Overview of Side Channel Analysis Based on Convolutional Neural Network [J]. Computer Science, 2022, 49(5): 296-302.
[15] ZHANG Wen-xuan, WU Qin. Fine-grained Image Classification Based on Multi-branch Attention-augmentation [J]. Computer Science, 2022, 49(5): 105-112.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!