计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 240300139-6.doi: 10.11896/jsjkx.240300139
胡宇, 殷继彬
HU Yu, YIN Jibin
摘要: 提出了一种唇语识别数据增强方法Partition-Time Masking。该方法直接作用于输入数据,通过将输入划分为多个子序列再分别进行Mask操作最后再将各子序列按序拼接,使得模型能对部分帧缺失的输入具有更强的鲁棒性,从而增强泛化能力。实验前根据划分的子序列数目与掩码值来源不同而设计了5种增强策略,并与唇语识别研究中最重要的数据增强方法Time Masking进行了对比实验。实验在LRW数据集和LRW1000数据集上进行,实验结果表明Partition-Time Masking方法对模型性能提升的效果要优于Time Masking方法,其中子序列数目为3、掩码值选择各子序列平均帧时为最优策略,该策略使得目前最佳的唇语识别模型DC-TCN的性能从89.6%提高到90.0%。
中图分类号:
[1]BAEK K,BANG D,SHIM H.GridMix:Strong regularizationthrough local context mapping [J].Pattern Recognition,2021,109:107594. [2]DEVRIES T,TAYLOR G W.Improved regularization of convolutional neural networks with cutout[J].arXiv:1708.04552,2017. [3]XUE J,HUANG S,SONG H,et al.Fine-grained sequence-to-sequence lip reading based on self-attention and self-distillation[J].Frontiers of Computer Science,2023,17(6):176344. [4]FENG D,YANG S,SHAN S,et al.Learn an effective lip reading model without pains[J].arXiv:2011.07557,2020. [5]ZHANG Y,YANG S,XIAO J,et al.Can we read speech beyond the lips? rethinking roi selection for deep visual speech recognition[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:356-363. [6]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142. [7]YUN S,HAN D,OHS J,et al.Cutmix:Regularization strategy to train strong classifiers with localizable features[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:6023-6032. [8]STAFYLAKIS T,KHAN M H,TZIMIROPOULOSG.Pushing the boundaries of audiovisual word recognition using residual networks and LSTMs[J].Computer Vision and Image Understanding,2018,176:22-32. [9]MA P,WANG Y,PETRIDIS S,et al.Training strategies for improved lip-reading[C]//2022 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2022).IEEE,2022:8472-8476. [10]PARK D S,CHAN W,ZHANG Y,et al.Specaugment:A simple data augmentation method for automatic speech recognition[J].arXiv:1904.08779,2019. [11]PETAJAN E D.Automatic lipreading to enhance speech recognition(speech reading)[M].University of Illinois at Urbana-Champaign,1984. [12]MARTINEZ B,MA P,PETRIDIS S,et al.Lipreading usingtemporal convolutional networks [C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:6319-6323. [13]DUDA R O,HART P E,STORK D G.Pattern classification and scene analysis[M].New York:Wiley,1973. [14]MARGAM D K,ARALIKATTI R,SHARMAT,et al.LipReading with 3D-2D-CNN BLSTM-HMM and word-CTC models[J].arXiv:1906.12170,2019. [15]GUO D,WANG S,TIAN Q,et al.Dense Temporal Convolution Network for Sign Language Translation[C]//IJCAI.2019:744-750. [16]HUANG G,LIU Z,VAN DER MAATEN L,et al.Densely connected convolutional networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:4700-4708. [17]YANG M,YU K,ZHANG C,et al.Denseaspp for semantic segmentation in street scenes [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:3684-3692. [18]ZHAO X,YANG S,SHAN S,et al.Mutual information maximization for effective lip reading[C]//2020 15th IEEE International Conference on Automatic Face and Gesture Recognition(FG 2020).IEEE,2020:420-427. [19]MA P,WANG Y,SHEN J,et al.Lip-reading with densely connected temporal convolutional networks[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.2021:2857-2866. [20]STAFYLAKIS T,TZIMIROPOULOSG.Combining residualnetworks with LSTMs for lipreading[J].arXiv:1703.04105,2017. [21]YANG S,ZHANG Y,FENG D,et al.LRW-1000:A naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition(FG 2019).IEEE,2019:1-8. [22]WU Y,JI Q.Facial landmark detection:A literature survey[J].International Journal of Computer Vision,2019,127(2):115-142. |
|