Computer Science ›› 2024, Vol. 51 ›› Issue (11A): 240300139-6.doi: 10.11896/jsjkx.240300139

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Partition-Time Masking:A Data Augmentation Method for Lip Reading

HU Yu, YIN Jibin   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:HU Yu,born in 1998,postgraduate.His main research interests include deep learning and lip reading.
    YIN Jibin,born in 1976,Ph.D,associate professor.His main research interests include human-computer interaction and artificial intelligence.

Abstract: This paper proposes a new data augmentation method for lip-reading called Partition-Time Masking.This method operates directly on the input data,dividing it into multiple subsequences,each undergoing a separate masking operation before being sequentially reassembled.This approach enhances the model's robustness to inputs with partial frame loss,thereby improving generalization.Five augmentation strategies are designed based on the number of divided subsequences and the source of the mask values.Comparative experiments are also conducted with the Time Masking method,a pivotal data augmentation technique in lip-reading research.Experiments are carried out on the LRW and LRW1000 datasets.The results indicate that the Partition-Time Masking method surpasses the Time Masking method in enhancing model performance.The optimal strategy is identified as using an average frame of each subsequence for masking,with the number of subsequences set to three.This approach improves the performance of the state-of-the-art lip-reading model DC-TCN from 89.6% to 90.0%.

Key words: Lip reading recognition, Time Masking, Data enhancement, Visual speech recognition, DC-TCN

CLC Number: 

  • TP391
[1]BAEK K,BANG D,SHIM H.GridMix:Strong regularizationthrough local context mapping [J].Pattern Recognition,2021,109:107594.
[2]DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[J].arXiv:1708.04552,2017.
[3]XUE J,HUANG S,SONG H,et al.Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J].Frontiers of Computer Science,2023,17(6):176344.
[4]FENG D,YANG S,SHAN S,et al.Learn an effective lip reading model without pains[J].arXiv:2011.07557,2020.
[5]ZHANG Y,YANG S,XIAO J,et al.Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:356-363.
[6]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.
[7]YUN S,HAN D,OHS J,et al.Cutmix:Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6023-6032.
[8]STAFYLAKIS T,KHAN M H,TZIMIROPOULOSG.Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J].Computer Vision and Image Understanding,2018,176:22-32.
[9]MA P,WANG Y,PETRIDIS S,et al.Training strategies for improved lip-reading[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:8472-8476.
[10]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019.
[11]PETAJAN E D.Automatic lipreading to enhance speech recognition(speech reading)[M].University of Illinois at Urbana-Champaign,1984.
[12]MARTINEZ B,MA P,PETRIDIS S,et al.Lipreading usingtemporal convolutional networks [C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6319-6323.
[13]DUDA R O,HART P E,STORK D G.Pattern classification and scene analysis[M].New York:Wiley,1973.
[14]MARGAM D K,ARALIKATTI R,SHARMAT,et al.LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J].arXiv:1906.12170,2019.
[15]GUO D,WANG S,TIAN Q,et al.Dense Temporal Convolution Network for Sign Language Translation[C]//IJCAI.2019:744-750.
[16]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[17]YANG M,YU K,ZHANG C,et al.Denseaspp for semantic segmentation in street scenes [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3684-3692.
[18]ZHAO X,YANG S,SHAN S,et al.Mutual information maximization for effective lip reading[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:420-427.
[19]MA P,WANG Y,SHEN J,et al.Lip-reading with densely connected temporal convolutional networks[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2021:2857-2866.
[20]STAFYLAKIS T,TZIMIROPOULOSG.Combining residualnetworks with LSTMs for lipreading[J].arXiv:1703.04105,2017.
[21]YANG S,ZHANG Y,FENG D,et al.LRW-1000:A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition(FG 2019).IEEE,2019:1-8.
[22]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.
[1] YANG Bo, LUO Jiachen, SONG Yantao, WU Hongtao, PENG Furong. Time Series Clustering Method Based on Contrastive Learning [J]. Computer Science, 2024, 51(2): 63-72.
[2] XU Jinpeng, GUO Xinfeng, WANG Ruibo, LI Jihong. Aggregation Model for Software Defect Prediction Based on Data Enhancement by GAN [J]. Computer Science, 2023, 50(12): 24-31.
[3] CAI Xin-yu, FENG Xiang, YU Hui-qun. Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes [J]. Computer Science, 2022, 49(6): 134-141.
[4] SI Shao-feng, ZHANG Sai-qiang, LI Qing, CHEN Ben-yao. Pedestrian Detection Optimization Method Based on Data Enhancement and SupervisedEqualization in Fisheye Image [J]. Computer Science, 2022, 49(11A): 210900070-6.
[5] CHE Ai-bo, ZHANG Hui, LI Chen, WANG Yao-nan. Single-stage 3D Object Detector in Traffic Environment Based on Point Cloud Data [J]. Computer Science, 2022, 49(11A): 210900079-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!