基于改进Transformer的连续手语识别方法

doi:10.11896/jsjkx.211200198

Abstract

Abstract: Continuous sign language recognition is a challenging task.Most current models ignore the overall modeling ability of long sequences,resulting in lower accuracy of recognition and translation of longer sign language videos.The unique codec structure of Transformer model can be used for sign language recognition,but its position coding method and multi-head self-attention mechanism still need to be improved.Therefore,this paper proposes a continuous sign language recognition method based on the improved Transformer model.Through multiple multiplexed position codes with parameters,each word vector in the continuous hand sentence is calculated multiple times to accurately grasp the position information between each word,add learnable memory key-value pairs to the attention module to form a persistent memory module,and expand the number of attention heads and embedding dimensions through linear high-dimensional mapping and the like,to maximize the multi-head attention mechanism of the Transformer model,and the overall modeling ability of long sign language sequences,in-depth mining of key information in each frame of the video.The proposed method achieves competitive recognition results on the most authoritative continuous sign language data sets PHOENIX-Weather2014^[1] and PHOENIX-Weather2014-T^[2].

Key words: Continuous sign language recognition, Transformer, Multi-head attention, Position encoding

CLC Number:

TP391

WANG Shuai, ZHANG Shu-jun, YE Kang, GUO Qi. Continuous Sign Language Recognition Method Based on Improved Transformer[J].Computer Science, 2022, 49(11A): 211200198-6.

References

[1]FORSTER J,SCHMIDT C,HOYOUX T,et al.RWTH-PHOENIX-Weather:A Large Vocabulary Sign Language Recognition and Translation Corpus[C]//International Conference on Language Resources and Evaluation(LREC).2012.
[2]FORSTER J,SCHMIDT C,KOLLER O,et al.Extensions of the sign language recognition and translation corpus RWTH-PHOENIX-Weather[C]//International Conference on Language Resources and Evaluation.2014:1911-1916.
[3]LECUN Y,BOTTOU L.Gradient-based learning applied to docu-ment recognition[J].Proceedings of the IEEE,1998,86(11):2278-2324.
[4]LIPTON Z C,BERKOWITZ J,ELKAN C.A critical review of recurrent neural networks for sequence learning[J].arXiv:1506.00019,2015.
[5]FORSTER J,KOLLER O,OBERDÖRFER C,et al.ImprovingContinuous Sign Language Recognition:Speech Recognition Techniques and System Design[C]//Workshop on Speech and Language Processing for Assistive Technologies.2013.
[6]GRAVES A,FERNÁNDEZ S,GOMEZ F,et al.Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning.2006:369-376.
[7]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[8]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization [C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[9]MOLCHANOV P,YANG X,GUPTA S,et al.Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:4207-4215.
[10]JIE H,ZHOU W,LI H,et al.Sign Language Recognition using 3D convolutional neural networks[C]//IEEE International Conference on Multimedia and Expo.2015:1-6.
[11]HOCHREITERS,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[12]PIGOU L,HERREWEGHE M V,AMBRE J D.Gesture and Sign Language Recognition with Temporal Residual Networks[C]//IEEE International Conference on Computer Vision Workshop.2017:3086-3093.
[13]CAMGOZ N C,HADFIELD S,KOLLER O,et al.SubUNets:End-to-End Hand Shape and continuous sign language recognition[C]//IEEE International Conference on Computer Vision.2017:3075-3084.
[14]XU B,HUANG S,YE Z.Application of Tensor Train Decomposition in S2VT Model for Sign Language Recognition[J].IEEE Access,2021,9:35646-35653.
[15]BAHDANAUD,CHO K,BENGIO Y.Neural machine translation by Jointly Learning to align and translate[J].arXiv:1409.0473,2014.
[16]PAPASTRATIS I,DIMITROPOULOS K,KONSTANTINIDIS D,et al.Continuous Sign Language Recognition Through Cross-Modal Alignment of Video and Text Embeddings in a Joint-Latent Space[J].IEEE Access,2020,8:91170-91180.
[17]LIN Z,FENG M,SANTOS C,et al.A Structured Self-attentive Sentence Embedding [J].arXiv:1703.03130,2017.
[18]GEHRING J,AULI M,GRANGIER D,et al.Convolutional Sequence to Sequence Learning[C]//International Conference on Machine Learning.PMLR,2017:1243-1252.
[19]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Sign language transformers:Joint end-to-end sign language recognition and translation[J].arXiv:2003.13830,2020.
[20]NIU Z,MAK B.Stochastic Fine-Grained Labeling of Multi-stateSign Glosses for Continuous Sign Language Recognition[M].Springer,Cham,2020.
[21]YIN K,READ J.Better Sign Language Translation with STMC-Transformer[C]//Proceedings of the 28th International Confe-rence on Computational Linguistics.2020.
[22]CAMGOZ N C,KOLLER O,HADFIELD S,et al.Multi-channel Transformers for Multi-articulatory Sign Language Translation[C]//European Conference on Computer Vision.Cham:Sprin-ger,2020:301-319.
[23]BEN SLIMANE F,BOUGUESSA M.Context Matters:Self-Attention for Sign Language Recognition[J].arXiv:2101.04632,2021.
[24]SUKHBAATAR S,GRAVE E,LAMPLE G,et al.Augmenting Self-attention with Persistent Memory[J].arXiv:1907.01470,2019.
[25]TOUVRON H,CORD M,SABLAYROLLES A,et al.Goingdeeper with Image Transformers[J].arXiv:2103.17239,2021.
[26]KOLLER O,ZARGARANO,NEY H,et al.Deep sign:hybrid cnn-hmm for continuous sign language recognition[C]//British Machine Vision Conference(BMVC).2016:1-12.
[27]KOLLER O,ZARGARAN S,NEY H.Re-sign:Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:4297-4305.
[28]CUI R,LIU H,ZHANG C.Recurrent convolutional neural networks for continuous sign language recognition by staged optimization[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:7361-7369.
[29]HUANG J,ZHOU W,ZHANG Q,et al.Video-based sign language recognition without temporal segmentation[C]//AAAI Conference on Artificial Intelligence.2018.
[30]PU J,ZHOU W,LI H.Iterative alignment network for continuous sign language recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2019.
[31]YANG Z,SHI Z,SHEN X,et al.SF-Net:Structured feature network for continuous sign language recognition[J].arXiv:1908.01341,2019.
[32]ZHOU H,ZHOU W,LI H.Dynamic pseudo label decoding for continuous sign language recognition[C]//IEEE International Conference on Multimedia and Expo.2019:1282-1287.
[33]KOLLER O,CAMGOZ C,NEY H,et al.Weakly supervisedlearning with multi-stream CNN-LSTM-HMMs to discover sequential parallelism in sign language videos[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,42(9):2306-2320.

Related Articles 15

[1]	ZHOU Le-yuan, ZHANG Jian-hua, YUAN Tian-tian, CHEN Sheng-yong. Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion [J]. Computer Science, 2022, 49(9): 155-161.
[2]	WANG Ming, PENG Jian, HUANG Fei-hu. Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction [J]. Computer Science, 2022, 49(8): 40-48.
[3]	KANG Yan, XU Yu-long, KOU Yong-qi, XIE Si-yu, YANG Xue-kun, LI Hao. Drug-Drug Interaction Prediction Based on Transformer and LSTM [J]. Computer Science, 2022, 49(6A): 17-21.
[4]	ZHANG Jia-hao, LIU Feng, QI Jia-yin. Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer [J]. Computer Science, 2022, 49(6A): 370-377.
[5]	ZHAO Xiao-hu, YE Sheng, LI Xiao. Multi-algorithm Fusion Behavior Classification Method for Body Bone Information Reconstruction [J]. Computer Science, 2022, 49(6): 269-275.
[6]	LU Liang, KONG Fang. Dialogue-based Entity Relation Extraction with Knowledge [J]. Computer Science, 2022, 49(5): 200-205.
[7]	LI Chuan, LI Wei-hua, WANG Ying-hui, CHEN Wei, WEN Jun-ying. Gated Two-tower Transformer-based Model for Predicting Antigenicity of Influenza H1N1 [J]. Computer Science, 2022, 49(11A): 211000209-6.
[8]	HU Xin-rong, CHEN Zhi-heng, LIU Jun-ping, PENG Tao, YE Peng, ZHU Qiang. Sentiment Analysis Framework Based on Multimodal Representation Learning [J]. Computer Science, 2022, 49(11A): 210900107-6.
[9]	FANG Zhong-jun, ZHANG Jing, LI Dong-dong. Spatial Encoding and Multi-layer Joint Encoding Enhanced Transformer for Image Captioning [J]. Computer Science, 2022, 49(10): 151-158.
[10]	XIAO Ding, ZHANG Yu-fan, JI Hou-ye. Electricity Theft Detection Based on Multi-head Attention Mechanism [J]. Computer Science, 2022, 49(1): 140-145.
[11]	YANG Hui-min, MA Ting-huai. Compound Conversation Model Combining Retrieval and Generation [J]. Computer Science, 2021, 48(8): 234-239.
[12]	YANG Jin-cai, CAO Yuan, HU Quan, SHEN Xian-jun. Relation Classification of Chinese Causal Compound Sentences Based on Transformer Model and Relational Word Feature [J]. Computer Science, 2021, 48(6A): 295-298.
[13]	HUO Shuai, PANG Chun-jiang. Research on Sentiment Analysis Based on Transformer and Multi-channel Convolutional Neural Network [J]. Computer Science, 2021, 48(6A): 349-356.
[14]	JIANG Qi, SU Wei, XIE Ying, ZHOUHONG An-ping, ZHANG Jiu-wen, CAI Chuan. End-to-End Chinese-Braille Automatic Conversion Based on Transformer [J]. Computer Science, 2021, 48(11A): 136-141.
[15]	WANG Rui-ping, JIA Zhen, LIU Chang, CHEN Ze-wei, LI Tian-rui. Deep Interest Factorization Machine Network Based on DeepFM [J]. Computer Science, 2021, 48(1): 226-232.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Continuous Sign Language Recognition Method Based on Improved Transformer

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0