Partition-Time Masking:一种唇语识别数据增强方法

doi:10.11896/jsjkx.240300139

Abstract

Abstract: This paper proposes a new data augmentation method for lip-reading called Partition-Time Masking.This method operates directly on the input data,dividing it into multiple subsequences,each undergoing a separate masking operation before being sequentially reassembled.This approach enhances the model's robustness to inputs with partial frame loss,thereby improving generalization.Five augmentation strategies are designed based on the number of divided subsequences and the source of the mask values.Comparative experiments are also conducted with the Time Masking method,a pivotal data augmentation technique in lip-reading research.Experiments are carried out on the LRW and LRW1000 datasets.The results indicate that the Partition-Time Masking method surpasses the Time Masking method in enhancing model performance.The optimal strategy is identified as using an average frame of each subsequence for masking,with the number of subsequences set to three.This approach improves the performance of the state-of-the-art lip-reading model DC-TCN from 89.6% to 90.0%.

Key words: Lip reading recognition, Time Masking, Data enhancement, Visual speech recognition, DC-TCN

CLC Number:

TP391

HU Yu, YIN Jibin. Partition-Time Masking:A Data Augmentation Method for Lip Reading[J].Computer Science, 2024, 51(11A): 240300139-6.

References

[1]BAEK K,BANG D,SHIM H.GridMix:Strong regularizationthrough local context mapping [J].Pattern Recognition,2021,109:107594.
[2]DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[J].arXiv:1708.04552,2017.
[3]XUE J,HUANG S,SONG H,et al.Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J].Frontiers of Computer Science,2023,17(6):176344.
[4]FENG D,YANG S,SHAN S,et al.Learn an effective lip reading model without pains[J].arXiv:2011.07557,2020.
[5]ZHANG Y,YANG S,XIAO J,et al.Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:356-363.
[6]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.
[7]YUN S,HAN D,OHS J,et al.Cutmix:Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6023-6032.
[8]STAFYLAKIS T,KHAN M H,TZIMIROPOULOSG.Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J].Computer Vision and Image Understanding,2018,176:22-32.
[9]MA P,WANG Y,PETRIDIS S,et al.Training strategies for improved lip-reading[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:8472-8476.
[10]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019.
[11]PETAJAN E D.Automatic lipreading to enhance speech recognition(speech reading)[M].University of Illinois at Urbana-Champaign,1984.
[12]MARTINEZ B,MA P,PETRIDIS S,et al.Lipreading usingtemporal convolutional networks [C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6319-6323.
[13]DUDA R O,HART P E,STORK D G.Pattern classification and scene analysis[M].New York:Wiley,1973.
[14]MARGAM D K,ARALIKATTI R,SHARMAT,et al.LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J].arXiv:1906.12170,2019.
[15]GUO D,WANG S,TIAN Q,et al.Dense Temporal Convolution Network for Sign Language Translation[C]//IJCAI.2019:744-750.
[16]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708.
[17]YANG M,YU K,ZHANG C,et al.Denseaspp for semantic segmentation in street scenes [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3684-3692.
[18]ZHAO X,YANG S,SHAN S,et al.Mutual information maximization for effective lip reading[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:420-427.
[19]MA P,WANG Y,SHEN J,et al.Lip-reading with densely connected temporal convolutional networks[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2021:2857-2866.
[20]STAFYLAKIS T,TZIMIROPOULOSG.Combining residualnetworks with LSTMs for lipreading[J].arXiv:1703.04105,2017.
[21]YANG S,ZHANG Y,FENG D,et al.LRW-1000:A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition(FG 2019).IEEE,2019:1-8.
[22]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Partition-Time Masking:A Data Augmentation Method for Lip Reading

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 5

Metrics

Comments

Recommended 0

[1]	YANG Bo, LUO Jiachen, SONG Yantao, WU Hongtao, PENG Furong. Time Series Clustering Method Based on Contrastive Learning [J]. Computer Science, 2024, 51(2): 63-72.
[2]	XU Jinpeng, GUO Xinfeng, WANG Ruibo, LI Jihong. Aggregation Model for Software Defect Prediction Based on Data Enhancement by GAN [J]. Computer Science, 2023, 50(12): 24-31.
[3]	CAI Xin-yu, FENG Xiang, YU Hui-qun. Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes [J]. Computer Science, 2022, 49(6): 134-141.
[4]	SI Shao-feng, ZHANG Sai-qiang, LI Qing, CHEN Ben-yao. Pedestrian Detection Optimization Method Based on Data Enhancement and SupervisedEqualization in Fisheye Image [J]. Computer Science, 2022, 49(11A): 210900070-6.
[5]	CHE Ai-bo, ZHANG Hui, LI Chen, WANG Yao-nan. Single-stage 3D Object Detector in Traffic Environment Based on Point Cloud Data [J]. Computer Science, 2022, 49(11A): 210900079-6.