计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240800028-7.doi: 10.11896/jsjkx.240800028

• 信息安全 • 上一篇    下一篇

基于端到端深度学习的数字语音源录音设备确认取证

邹领, 朱磊, 邓阳君, 张红燕   

  1. 湖南农业大学信息与智能科学技术学院 长沙 410128
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 邹领(lzou@hunau.edu.cn)
  • 基金资助:
    湖南省教育厅资助科研项目(23A0168);湖南省自然科学基金(2022JJ30308,2023JJ40333);国家自然科学基金(62202163)

Source Recording Device Verification Forensics of Digital Speech Based on End-to-End DeepLearning

ZOU Ling, ZHU Lei, DENG Yangjun, ZHANG Hongyan   

  1. College of Information and Intelligence,Hunan Agricultural University,Changsha 410128,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:ZOU Ling,born in 1981,Ph.D,lecturer,master’s supervisor,is a member of CCF(No.C3792M).His main interests include digital multimedia forensics,speech/audio signal processing and deep learning.
  • Supported by:
    Scientific Research Fund of Hunan Provincial Education Department(23A0168),Natural Science Foundation of Hunan Province(2022JJ30308,2023JJ40333) and National Natural Science Foundation of China(62202163).

摘要: 音频编辑软件以及深度伪造(DeepFake)技术使得对数字音频和语音的篡改及伪造变得容易,因此,在将一段音频或语音录音作为有效的司法证据前,必须对其真实性和完整性进行鉴定。面向数字语音的录音设备源确认(SRDV)是数字音频设备源取证的关键问题之一,具体是指:给定一段数字语音录音和一个录音设备,判断该录音是否是由该设备所录制。近年来,深度学习技术在许多领域得到了广泛应用并取得了很好的效果,但目前与录音设备源识别相关的工作主要集中于录音设备源辨认(SRDI)中,尚未有基于深度学习的SRDV方法的报道。文中提出了一种新颖的基于端到端(E2E)深度学习的录音设备源取证方法,从语音录音中提取FBank特征来表征设备指纹并作为深度神经网络结构的输入,深度神经网络结构采用一个调整参数的VGG-M网络,并通过自注意力池化(SAP)层和全连接层来提取录音设备特征向量(RDE)。整个网络基于通用端到端(GE2E)损失函数来进行训练。采用等错误率(EER)作为性能评估准则,在划分好的开发集和测试集上进行录音设备源确认实验,实验结果表明所提方法显著提升了录音设备源确认的性能。

关键词: 数字语音取证, 获取设备取证, 录音设备源确认, 录音设备特征向量, 端到端深度学习

Abstract: Audio editing software and DeepFake technology make it easy to tamper and fake with digital audio and speech recordings.Thus the authenticity and integrity of a digital audio or speech recording must be established before it can be used as valid judicial evidence.Source recording device verification(SRDV) for digital speech is one of the key problems of device source forensics of digital audio.Given a speech recording and a recording device,SRDV is to determine whether or not the speech recording is recorded by the claim device.In recent years,deep learning technology has been widely applied across numerous fields and has yielded impressive results.However,current research related to audio recording device identification has primarily focused on source recording device identification(SRDI),and there have been no reports on SRDV methods based on deep learning.In this paper,a novel End to End(E2E) deep learning based SRDV scheme is proposed.The FBank feature,extracted from speech recordings,is used to characterize the device fingerprint and serves as the input to the deep neural network.For the deep architecture,we employ a parameter adjusted VGG M model.The entire network is trained using the Generalized End to End(GE2E) loss.The recording device embedding(RDE) is extracted through a Self attentive Pooling(SAP) layer followed by a fully connected layer.The Equal Error Rate(EER) is adopted as the evaluation metric.Evaluation experiments are conducted on a carefully designed development set and test set.Experimental results demonstrate that the proposed method achieves significant improvements in addressing the SRDV problem.

Key words: Digital speech forensics, Acquisition device forensics, Source recording device verification, Recording device embedding, End-to-end deep learning

中图分类号: 

  • TP391
[1]BERDICH A,GROZA B,MAYRHOFER R.A Survey on Fingerprinting Technologies for Smartphones Based on Embedded Transducers[J].IEEE Internet of Things Journal,2023,10(16):14646-14670.
[2]ZAKARIAH M,KHAN M K,MALIK H.Digital MultimediaAudio Forensics:Past,Present and Future[J].Multimedia Tools Appl.,2018,77(1):1009-1040.
[3]ZOU L,HE Q,FENG X.Cell Phone Verification from Speech Recordings using Sparse Representation[C]//International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE Signal Processing Society,2015:1787-1791.
[4]ZOU L,HE Q,WU J.Source Cell Phone Verification fromSpeech Recordings using Sparse Representation[J].Digital Signal Processing,2017,62:125-136.
[5]GIGANTI A,CUCCOVILLO L,BESTAGINI P,et al.Speaker Independent Microphone Identification in Noisy Conditions[C]//European Signal Processing Conference(EUSIPCO),in press,2022.
[6]CUCCOVILLO L,GIGANTI A,BESTAGINI P,et al.Spectral Denoising for Microphone Classifification[C]//ACM International Workshop on Multimedia AI against Disinformation(MAD),in press,Newark,NJ,USA,2022.
[7]QIN T,WANG R,YAN D,et al.Source Cell-phone Identification in the Presence of Additive Noise from CQT Domain[J].Information,2018,9(8):205.
[8]BALDINI G,AMERINI I,GENTILE C.Microphone Identification using Convolutional Neural Networks[J].IEEE Sensors Lett.,2019,3(7):6001504.
[9]BALDINI G,AMERINI I.Smartphones Identification Through the Built-in Microphones with Convolutional Neural Network[J].IEEE Access,2019,7:158685-158696.
[10]LIN X,ZHU J,CHEN D.Subband Aware CNN for Cell-phone Recognition[J].IEEE Signal Process.Lett.,2020,27:605-609.
[11]VERMA V,KHANNAN.Speaker-independent Source Cell-phoneIdentification for Re-compressed and Noisy Audio Recordings[J].Multimedia Tools and Applications,2021,80:23581-23603.
[12]QAMHAN M,ALTAHERI H,MEFTAH A H,et al.Digital Audio Forensics:Microphone and Environment Classification using Deep Learning[J].IEEE Access,2021,9:62719-62733.
[13]SHEN X,SHAO X,GE Q,et al.RARS:Recognition of Audio Recording Sources Based on Residual Neural Network[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2021,29:575-584.
[14]SU Z P,WU Z Q,Y F,et al.Source Cell-Phone Identification Under Background Noise Based on Low-Dimensional Deep Features [J].Acta Electronica Sinica,2021,49(4):637-646.
[15]QAMHAN M,ALOTAIBI Y,SELOUANI S.Source micro-phone identification using Swin Transformer[J].Applied Sciences,2023,13(12):7112.
[16]ZENG C,FENG S,ZHU D,et al.Source Acquisition DeviceIdentification from Recorded Audio Based on Spatiotemporal Representation Learning with Multi-Attention Mechanisms[J].entropy,2023,25,626.
[17]ZENG C,FENG S,WANG Z,et al.Audio source recording device recognition based on representation learning of sequential Gaussian mean matrix[J].Forensic Science International:Digital Investigation,2024,48,301676.
[18]ZENG C,FENG S,WANG Z,et al.Spatio-temporal Representation Learning Enhanced Source Cell-phone Recognition from Speech Recordings[J].Journal of Information Security and Applications,2024,80,103672.
[19]HAANILÇI C,KINNUNEN T.SourceCell-phone RecognitionFrom Recorded Speech using Non-speech Segments[J].Digital Signal Processing,2014,35:75-85.
[20]LUO D,KORUS P,HUANG J.Band Energy Difference forSource Attribution in Audio Forensics[J].IEEE Transactions on Information Forensics and Security,2018,13(9):2179-2189.
[21]WAN L,WANG Q,PAPIR A,et al.Generalized End-to-EndLoss for Speaker Verification[C]//2018 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).2018:4879-4883.
[22]CHUNG J S,HUH J,MUN S.Delving Into VoxCeleb:Environment Invariant Speaker Recognition[C]//Speaker Odyssey,2020.
[23]CHATFIFIELD K,SIMONYAN K,VEDALDI A,et al.Return of the Devil in the Details:Delving Deep Into Convolutional Nets[C]//Proceedings of the British Machine Vision Conference.2014.
[24]CAI W,CHEN J,LI M.Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System[C]//Speaker Odyssey.2018.
[25]KOTROPOULOS C,SAMARA S.Mobile Phone IdentificationUsing Recorded Speech Signals[C]//Proceedings of the Int.Conf.Digit.Signal Process.(DSP).2014:586-591.
[26]ZOU L,HE Q H,KUANG X C,et al.Source Recording Device Recognition Based on Device Noise Estimation[J].Journal of Jilin University(Engineering and Technology Edition),2017,47(1):274-280.
[27]PASZKE A,GROSS S,MASSA F,et al.Pytorch:An Imperative Style,High-performance Deep Learning Library[C]//NIPS.2019:8024-8035.
[28]WANG F,CHENG J,LIU W,et al.Additive margin softmax for face verification[J].IEEE Signal Processing Letters,2018,25(7):926-930.
[29]SCHROFF F,KALENICHENKO D.PHILBIN J.Facenet:AUnified Embedding for Face Recognition and Clustering[C]//Proc.CVPR.2015.
[30]WANG J,WANG K C,LAWM T,et al.Centroid-based deepmetric learning for speaker recognition[C]//International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE Signal Processing Society,2019:3652-3656.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!