计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 240300139-6.doi: 10.11896/jsjkx.240300139

• 图像处理&多媒体技术 • 上一篇    下一篇

Partition-Time Masking:一种唇语识别数据增强方法

胡宇, 殷继彬   

  1. 昆明理工大学信息工程与自化学院 昆明 650500
  • 出版日期:2024-11-16 发布日期:2024-11-13
  • 通讯作者: 殷继彬(41868028@qq.com)
  • 作者简介:(1786702137@qq.com)

Partition-Time Masking:A Data Augmentation Method for Lip Reading

HU Yu, YIN Jibin   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:HU Yu,born in 1998,postgraduate.His main research interests include deep learning and lip reading.
    YIN Jibin,born in 1976,Ph.D,associate professor.His main research interests include human-computer interaction and artificial intelligence.

摘要: 提出了一种唇语识别数据增强方法Partition-Time Masking。该方法直接作用于输入数据,通过将输入划分为多个子序列再分别进行Mask操作最后再将各子序列按序拼接,使得模型能对部分帧缺失的输入具有更强的鲁棒性,从而增强泛化能力。实验前根据划分的子序列数目与掩码值来源不同而设计了5种增强策略,并与唇语识别研究中最重要的数据增强方法Time Masking进行了对比实验。实验在LRW数据集和LRW1000数据集上进行,实验结果表明Partition-Time Masking方法对模型性能提升的效果要优于Time Masking方法,其中子序列数目为3、掩码值选择各子序列平均帧时为最优策略,该策略使得目前最佳的唇语识别模型DC-TCN的性能从89.6%提高到90.0%。

关键词: 唇语识别, Time Making, 数据增强, 视觉语音识别, DC-TCN

Abstract: This paper proposes a new data augmentation method for lip-reading called Partition-Time Masking.This method operates directly on the input data,dividing it into multiple subsequences,each undergoing a separate masking operation before being sequentially reassembled.This approach enhances the model's robustness to inputs with partial frame loss,thereby improving generalization.Five augmentation strategies are designed based on the number of divided subsequences and the source of the mask values.Comparative experiments are also conducted with the Time Masking method,a pivotal data augmentation technique in lip-reading research.Experiments are carried out on the LRW and LRW1000 datasets.The results indicate that the Partition-Time Masking method surpasses the Time Masking method in enhancing model performance.The optimal strategy is identified as using an average frame of each subsequence for masking,with the number of subsequences set to three.This approach improves the performance of the state-of-the-art lip-reading model DC-TCN from 89.6% to 90.0%.

Key words: Lip reading recognition, Time Masking, Data enhancement, Visual speech recognition, DC-TCN

中图分类号: 

  • TP391
[1]BAEK K,BANG D,SHIM H.GridMix:Strong regularizationthrough local context mapping [J].Pattern Recognition,2021,109:107594.
[2]DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[J].arXiv:1708.04552,2017.
[3]XUE J,HUANG S,SONG H,et al.Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J].Frontiers of Computer Science,2023,17(6):176344.
[4]FENG D,YANG S,SHAN S,et al.Learn an effective lip reading model without pains[J].arXiv:2011.07557,2020.
[5]ZHANG Y,YANG S,XIAO J,et al.Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:356-363.
[6]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.
[7]YUN S,HAN D,OHS J,et al.Cutmix:Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6023-6032.
[8]STAFYLAKIS T,KHAN M H,TZIMIROPOULOSG.Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J].Computer Vision and Image Understanding,2018,176:22-32.
[9]MA P,WANG Y,PETRIDIS S,et al.Training strategies for improved lip-reading[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:8472-8476.
[10]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019.
[11]PETAJAN E D.Automatic lipreading to enhance speech recognition(speech reading)[M].University of Illinois at Urbana-Champaign,1984.
[12]MARTINEZ B,MA P,PETRIDIS S,et al.Lipreading usingtemporal convolutional networks [C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6319-6323.
[13]DUDA R O,HART P E,STORK D G.Pattern classification and scene analysis[M].New York:Wiley,1973.
[14]MARGAM D K,ARALIKATTI R,SHARMAT,et al.LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J].arXiv:1906.12170,2019.
[15]GUO D,WANG S,TIAN Q,et al.Dense Temporal Convolution Network for Sign Language Translation[C]//IJCAI.2019:744-750.
[16]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[17]YANG M,YU K,ZHANG C,et al.Denseaspp for semantic segmentation in street scenes [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3684-3692.
[18]ZHAO X,YANG S,SHAN S,et al.Mutual information maximization for effective lip reading[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:420-427.
[19]MA P,WANG Y,SHEN J,et al.Lip-reading with densely connected temporal convolutional networks[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2021:2857-2866.
[20]STAFYLAKIS T,TZIMIROPOULOSG.Combining residualnetworks with LSTMs for lipreading[J].arXiv:1703.04105,2017.
[21]YANG S,ZHANG Y,FENG D,et al.LRW-1000:A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition(FG 2019).IEEE,2019:1-8.
[22]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!