计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 220100106-9.doi: 10.11896/jsjkx.220100106

• 图像处理&多媒体技术 • 上一篇    下一篇

基于双流网络结构的深度伪造人脸的检测方法

李颖1, 边山1,2,3, 王春桃1,2, 黄琼1,2   

  1. 1 华南农业大学数学与信息学院 广州 510642
    2 广州市智慧农业重点实验室 广州 510642
    3 广东省信息安全技术重点实验室 广州 510006
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 边山(bianshan@scau.edu.cn)
  • 作者简介:(spade@stu.scau.edu.cn)
  • 基金资助:
    国家自然科学基金(61702199,62172165,61872152);广东省基础与应用基础研究重大项目(2019B030302008);广东省信息安全技术重点实验室开放基金(2020B1212060078-07);广州市科技计划项目(202102020582,201902010081)

Detection of Deepfakes Based on Dual-stream Network

LI Ying1, BIAN Shan1,2,3, WANG Chun-tao1,2, HUANG Qiong1,2   

  1. 1 College of Mathematics and Informatics,South China Agricultural University,Guangzhou 510642,China
    2 Guangzhou Key Laboratory of Intelligent Agriculture,Guangzhou 510642,China
    3 Guangdong Provincial Key Laboratory of Information Security Technology,Guangzhou 510006,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:LI Ying,born in 1998,postgraduate,is a student member of China Computer Federation.Her main research interests include video forensics and so on.
    BIAN Shan,born in 1986,Ph.D,asso-ciate professor,is a member of China Computer Federation.Her main research interests include video forensics and tampering detection.
  • Supported by:
    National Natural Science Foundation of China(61702199,62172165,61872152),Major Program of Guangdong Basic and Applied Research(2019B030302008),Opening Project of Guangdong Province Key Laboratory of Information Security Technology(2020B1212060078-07) and Science and Technology Program of Guangzhou(202102020582,201902010081).

摘要: 深度伪造技术(Deepfake) 是一种基于生成对抗网络(Generative Adversarial Networks,GAN)的深度网络模型,可以利用源和目标人脸生成高度逼真且难以鉴别的人脸视频。如果不法分子借此技术制造虚假视频并在互联网上传播谣言,将会侵犯个人肖像权,造成不良的社会影响,甚至引发严重的司法纠纷。面对深度伪造技术带来的严重威胁,国内外众多研究机构高度关注深度伪造检测技术的研究并提出了若干检测方法。现有的检测方法在高质量视频上可以取得良好的检测效果,然而日常应用中的视频通常会通过社交软件从而被压缩为低质量视频,在此类低质量数据集中,现有的大多数伪造人脸检测方法的准确率有着明显的下降,并且现有方法在跨库情况下的检测性能也不够理想。文中针对现有工作的局限性,提出了一种注意力机制下基于Xception 模型的双流网络结构。该网络结构中包含了使用多重注意力机制的RGB 分支,以及用于捕捉低质量视频伪影效应的频率域分支。通过研究发现,真实图像与伪造图像之间的微小差别更多地集中在局部位置,因此多重注意力机制下的RGB 分支将使得模型关注人脸的不同区域,并在注意力图的指导下得到由低层纹理特征及高层语义特征聚合的全局特征。频率域分支引入离散余弦变换作为频域变换手段,为图像提供与RGB 分支互补的特征表示,此分支能够反映细微的伪造痕迹或者压缩误差。为了验证该网络结构的有效性,所提算法在FaceForensics++,Celeb-DF 以及DFDC 3个公开数据集上进行了大量对比实验。实验结果表明,所提算法在低质量视频集上的性能优于现有的检测算法,并且所提模型在跨库场景下具有更好的检测性能,即验证了文中提出的注意力机制下的RGB和频率域双流特征的结合可以提高检测模型在低质量视频集及跨库情形下的鲁棒性。

关键词: 深度伪造, 视频取证, 双流网络, 注意力机制, RGB分支, 频率域分支

Abstract: Deepfake is a kind of deep network model based on generative adversarial networks(GAN).It uses the source and target faces to generate highly realistic face videos that are difficult to identify.If some malicious person uses this technology to make fake videos and spread rumors on the Internet,it will infringe personal portrait right,cause adverse social impact,or even cause serious judicial disputes.In view of the serious threat brought by deepfake technology,many researchers at home and abroad pay close attention to the study of the deepfake detection technology,and have put forward some effective detection me-thods.The existing detection methods achieve good detection results in high-quality videos,but most videos in daily applications are usually compressed into low-quality versions through social software.However,most of the existing deepfake detection methods have a significant decline in detecting this kind of low-quality videos.Besides,the detection performances of existing methods are still unsatisfying in the case of cross datasets,limiting their real applications.To address this issue,this paper proposes a dual-flow network structure based on Xception model under multiple attention mechanism.The network structure includes an RGB branch using multiple attention mechanism and a frequency-domain branch for capturing low quality video artifacts.Based on our research,it is found that the tiny difference between real images and fake images tends to concentrates in some local area.The RGB branch under the multiple attention mechanism makes the model focus on different regions of the face,so it can get the glo-bal features aggregated by the low-level texture and high-level semantic features under the guidance of the attention map.Combined with the RGB branch,the discrete cosine transform(DCT) is introduced in the frequency domain branch to provide complementary feature representation,which can reflect subtle forgery traces or compression errors.Specifically,the proposed algorithm firstly extracts a large number of face frames from videos by face extractor algorithm,and feeds these face frames into the two-branch network model.The frequency branch decomposes the spectrum of images with three combined filters that provide additional learnable parts.In the RGB branch,the first three layers of the backbone network extract shallow features including texture information etc.Then the attention module makes the model attend to the shallow information from different local areas.The shallow information is then fed to the attention pooling layer to aggregate with the high-level semantic features from the rest layers of the backbone network.Finally,the network merges the feature vectors from both the RGB-branch and the frequency branch to obtain the final discriminant result.The combination of these two branches can significantly improve the detection performance of the model in cross database scenes and low-quality video sets.In order to verify the effectiveness of the proposed network structure,a large number of comparative experiments are conducted on three public datasets,including FaceForensics++,Celeb DF and DFDC.In the low-quality part of FaceForensics++ dataset,the AUC(Area Under the Curve) can reach 0.9271.In the video level,the detection accuracy of low-quality and high-quality videos can reach 93.84% and 99.69%,respectively.Experimental results show that the proposed algorithm outperforms the existing detection algorithms in low-quality video sets as well as in cross dataset scenes.It verifies that the combination of dual-stream features including the RGB branch and the frequency branch can improve the robustness of the detection method,especially in low-quality video sets and in cross dataset scenes.

Key words: Deepfake, Video forensics, Dual stream network, Attention mechanism, RGB branch, Frequency branch

中图分类号: 

  • TP391
[1]BRANDON J.Terrifying High-Tech Porn:Creepy 《deepfake》Videos Are on the Rise[EB/OL].Fox News,2018.(2018-02-16)[2021-06-27].https://www.foxnews.com/tech/terrifying-high-tech-porn-creepy-deepfake-videos-are-on-the-rise.
[2]ROETTGERS J,ROETTGERS J.Porn Producers Offer to Help Hollywood Take Down Deepfake Videos[EB/OL].(2018-02-21)[2021-06-27].https://variety.com/2018/digital/news/deepfakes-porn-adult-industry-1202705749/.
[3]GOODFELLOW I J,POUGET-ABADIE J,MIRZA M,et al.Generative Adversarial Networks[J/OL].arXiv:1406.2661,2014.
[4]ZHAO H,ZHOU W,CHEN D,et al.Multi-attentional deepfake detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:2185-2194.
[5]QIAN Y,YIN G,SHENG L,et al.Thinking in frequency:Face forgery detection by mining frequency-aware clues[C]//Euro-pean Conference on Computer Vision.Cham:Springer,2020:86-103.
[6]AFCHAR D,NOZICK V,YAMAGISHI J,et al.Mesonet:acompact facial video forgery detection network[C]//2018 IEEE International Workshop on Information Forensics and Security(WIFS).IEEE,2018:1-7.
[7]ZHOU P,HAN X,MORARIU V I,et al.Two-stream neural networks for tampered face detection[C]//2017 IEEE Confe-rence on Computer Vision and Pattern Recognition Workshops(CVPRW).IEEE,2017:1831-1839.
[8]YU N,DAVIS L,FRITZ M.Attributing fake images to gans:Learning and analyzing gan fingerprints[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:7556-7566.
[9]MCCLOSKEY S,ALBRIGHT M.Detecting GAN-GeneratedImagery Using Color Cues[J].arXiv:1812.08247,2018.
[10]XIAO J,GONG L Y,HUANG T Q,et al.Deepfake swapped face detection based on double attention[J].Chinese Journal of Netword and Information Security,2021,7(2):151.
[11]BIAN M Y,PENG B,WANG W,et al.Detection of low- quality facial deepfake image based on void convolution[J].Modern Electronics Technique,2021,44(6):133-138.
[12]LI X R,YU K.A Deepfakes detection technique based on two-stream network[J].Journal of Cyber Security,2020,5(2):84-91.
[13]BAO Y X,LU T L,DU Y H,et al.Deepfake VideosDetection Method Basedoni_ResNet34 Model and Data Augmentation[J].Comptuter Science,2021,48(7):77-85.
[14]LIU H,LI X,ZHOU W,et al.Spatial-phase shallow learning:rethinking face forgery detection in frequency domain[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:772-781.
[15]WANG J,WU Z,CHEN J,et al.M2TR:Multi-Modal Multi-ScaleTransformers for Deepfake Detection[J].arXiv:2104.09770,2021.
[16]SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:1-9.
[17]CHOLLET F.Deep learning with depthwise separable convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:1251-1258.
[18]SZEGEDY C,VANHOUCKE V,IOFFE S,et al.Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:2818-2826.
[19]LI Y,YANG X,SUN P,et al.Celeb-df:A large-scale challenging dataset for deepfake forensics[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:3207-3216.
[20]DOLHANSKY B,BITTON J,PFLAUM B,et al.The deepfake detection challenge(dfdc) dataset[J].arXiv:2006.07397,2020.
[21]ROSSLER A,COZZOLINO D,VERDOLIVA L,et al.Facefo-rensics++:Learning to detect manipulated facial images[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:1-11.
[22]COZZOLINO D,POGGI G,VERDOLIVA L.Recasting residual-based local descriptors as convolutional neural networks:an application to image forgery detection[C]//Proceedings of the 5th ACM Workshop on Information Hiding and Multimedia Security.2017:159-164.
[23]LI Y,LYU S.Exposing deepfake videos by detecting face warping artifacts[J].arXiv:1811.00656,2018.
[24]LI L,BAO J,ZHANG T,et al.Face x-ray for more general face forgery detection[C]//Proceedings of the IEEE/CVF Confe-rence on Computer Vision and Pattern Recognition.2020:5001-5010.
[25]MASI I,KILLEKAR A,MARIAN R,et al.Two-branch recurrent network for isolating deepfakes in videos[C]//European Conference on Computer Vision.Cham:Springer,2020:667-684.
[26]BONETTINI N,CANNAS E D,MANDELLI S,et al.Video face manipulation detection through ensemble of cnns[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:5012-5019.
[27]NGUYEN H H,FANG F,YAMAGISHI J,et al.Multi-tasklearning for detecting and segmenting manipulated facial images and videos[J].arXiv:1906.06876,2019.
[28]NGUYEN H H,YAMAGISHI J,ECHIZEN I.Use of a capsule network to detect fake images and videos[J].arXiv:1910.12467,2019.
[29]WODAJO D,ATNAFU S.Deepfake video detection using con-volutional vision transformer[J].arXiv:2102.11126,2021.
[1] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[2] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[3] 戴禹, 许林峰.
基于文本行匹配的跨图文本阅读方法
Cross-image Text Reading Method Based on Text Line Matching
计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[4] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5] 熊丽琴, 曹雷, 赖俊, 陈希亮.
基于值分解的多智能体深度强化学习综述
Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization
计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[6] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[7] 汪鸣, 彭舰, 黄飞虎.
基于多时间尺度时空图网络的交通流量预测模型
Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction
计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[8] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[9] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[10] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[11] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[12] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[13] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[14] 孟月波, 穆思蓉, 刘光辉, 徐胜军, 韩九强.
基于向量注意力机制GoogLeNet-GMP的行人重识别方法
Person Re-identification Method Based on GoogLeNet-GMP Based on Vector Attention Mechanism
计算机科学, 2022, 49(7): 142-147. https://doi.org/10.11896/jsjkx.210600198
[15] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!