计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240800028-7.doi: 10.11896/jsjkx.240800028
邹领, 朱磊, 邓阳君, 张红燕
ZOU Ling, ZHU Lei, DENG Yangjun, ZHANG Hongyan
摘要: 音频编辑软件以及深度伪造(DeepFake)技术使得对数字音频和语音的篡改及伪造变得容易,因此,在将一段音频或语音录音作为有效的司法证据前,必须对其真实性和完整性进行鉴定。面向数字语音的录音设备源确认(SRDV)是数字音频设备源取证的关键问题之一,具体是指:给定一段数字语音录音和一个录音设备,判断该录音是否是由该设备所录制。近年来,深度学习技术在许多领域得到了广泛应用并取得了很好的效果,但目前与录音设备源识别相关的工作主要集中于录音设备源辨认(SRDI)中,尚未有基于深度学习的SRDV方法的报道。文中提出了一种新颖的基于端到端(E2E)深度学习的录音设备源取证方法,从语音录音中提取FBank特征来表征设备指纹并作为深度神经网络结构的输入,深度神经网络结构采用一个调整参数的VGG-M网络,并通过自注意力池化(SAP)层和全连接层来提取录音设备特征向量(RDE)。整个网络基于通用端到端(GE2E)损失函数来进行训练。采用等错误率(EER)作为性能评估准则,在划分好的开发集和测试集上进行录音设备源确认实验,实验结果表明所提方法显著提升了录音设备源确认的性能。
中图分类号:
[1]BERDICH A,GROZA B,MAYRHOFER R.A Survey on Fingerprinting Technologies for Smartphones Based on Embedded Transducers[J].IEEE Internet of Things Journal,2023,10(16):14646-14670. [2]ZAKARIAH M,KHAN M K,MALIK H.Digital MultimediaAudio Forensics:Past,Present and Future[J].Multimedia Tools Appl.,2018,77(1):1009-1040. [3]ZOU L,HE Q,FENG X.Cell Phone Verification from Speech Recordings using Sparse Representation[C]//International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE Signal Processing Society,2015:1787-1791. [4]ZOU L,HE Q,WU J.Source Cell Phone Verification fromSpeech Recordings using Sparse Representation[J].Digital Signal Processing,2017,62:125-136. [5]GIGANTI A,CUCCOVILLO L,BESTAGINI P,et al.Speaker Independent Microphone Identification in Noisy Conditions[C]//European Signal Processing Conference(EUSIPCO),in press,2022. [6]CUCCOVILLO L,GIGANTI A,BESTAGINI P,et al.Spectral Denoising for Microphone Classifification[C]//ACM International Workshop on Multimedia AI against Disinformation(MAD),in press,Newark,NJ,USA,2022. [7]QIN T,WANG R,YAN D,et al.Source Cell-phone Identification in the Presence of Additive Noise from CQT Domain[J].Information,2018,9(8):205. [8]BALDINI G,AMERINI I,GENTILE C.Microphone Identification using Convolutional Neural Networks[J].IEEE Sensors Lett.,2019,3(7):6001504. [9]BALDINI G,AMERINI I.Smartphones Identification Through the Built-in Microphones with Convolutional Neural Network[J].IEEE Access,2019,7:158685-158696. [10]LIN X,ZHU J,CHEN D.Subband Aware CNN for Cell-phone Recognition[J].IEEE Signal Process.Lett.,2020,27:605-609. [11]VERMA V,KHANNAN.Speaker-independent Source Cell-phoneIdentification for Re-compressed and Noisy Audio Recordings[J].Multimedia Tools and Applications,2021,80:23581-23603. [12]QAMHAN M,ALTAHERI H,MEFTAH A H,et al.Digital Audio Forensics:Microphone and Environment Classification using Deep Learning[J].IEEE Access,2021,9:62719-62733. [13]SHEN X,SHAO X,GE Q,et al.RARS:Recognition of Audio Recording Sources Based on Residual Neural Network[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:575-584. [14]SU Z P,WU Z Q,Y F,et al.Source Cell-Phone Identification Under Background Noise Based on Low-Dimensional Deep Features [J].Acta Electronica Sinica,2021,49(4):637-646. [15]QAMHAN M,ALOTAIBI Y,SELOUANI S.Source micro-phone identification using Swin Transformer[J].Applied Sciences,2023,13(12):7112. [16]ZENG C,FENG S,ZHU D,et al.Source Acquisition DeviceIdentification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms[J].entropy,2023,25,626. [17]ZENG C,FENG S,WANG Z,et al.Audio source recording device recognition based on representation learning of sequential Gaussian mean matrix[J].Forensic Science International:Digital Investigation,2024,48,301676. [18]ZENG C,FENG S,WANG Z,et al.Spatio-temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings[J].Journal of Information Security and Applications,2024,80,103672. [19]HAANILÇI C,KINNUNEN T.SourceCell-phone RecognitionFrom Recorded Speech using Non-speech Segments[J].Digital Signal Processing,2014,35:75-85. [20]LUO D,KORUS P,HUANG J.Band Energy Difference forSource Attribution in Audio Forensics[J].IEEE Transactions on Information Forensics and Security,2018,13(9):2179-2189. [21]WAN L,WANG Q,PAPIR A,et al.Generalized End-to-EndLoss for Speaker Verification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2018:4879-4883. [22]CHUNG J S,HUH J,MUN S.Delving Into VoxCeleb:Environment Invariant Speaker Recognition[C]//Speaker Odyssey,2020. [23]CHATFIFIELD K,SIMONYAN K,VEDALDI A,et al.Return of the Devil in the Details:Delving Deep Into Convolutional Nets[C]//Proceedings of the British Machine Vision Conference.2014. [24]CAI W,CHEN J,LI M.Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System[C]//Speaker Odyssey.2018. [25]KOTROPOULOS C,SAMARA S.Mobile Phone IdentificationUsing Recorded Speech Signals[C]//Proceedings of the Int.Conf.Digit.Signal Process.(DSP).2014:586-591. [26]ZOU L,HE Q H,KUANG X C,et al.Source Recording Device Recognition Based on Device Noise Estimation[J].Journal of Jilin University(Engineering and Technology Edition),2017,47(1):274-280. [27]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An Imperative Style,High-performance Deep Learning Library[C]//NIPS.2019:8024-8035. [28]WANG F,CHENG J,LIU W,et al.Additive margin softmax for face verification[J].IEEE Signal Processing Letters,2018,25(7):926-930. [29]SCHROFF F,KALENICHENKO D.PHILBIN J.Facenet:AUnified Embedding for Face Recognition and Clustering[C]//Proc.CVPR.2015. [30]WANG J,WANG K C,LAWM T,et al.Centroid-based deepmetric learning for speaker recognition[C]//International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE Signal Processing Society,2019:3652-3656. |
|