计算机科学 ›› 2023, Vol. 50 ›› Issue (10): 112-118.doi: 10.11896/jsjkx.220900048

• 计算机图形学&多媒体 • 上一篇    下一篇

基于多尺度Transformer融合多域信息的伪造人脸检测

马欣1,2, 吉立新2, 李邵梅2   

  1. 1 郑州大学网络空间安全学院 郑州450001
    2 战略支援部队信息工程大学信息技术研究所 郑州450002
  • 收稿日期:2022-09-06 修回日期:2022-12-10 出版日期:2023-10-10 发布日期:2023-10-10
  • 通讯作者: 李邵梅(lishaomei_may@126.com)
  • 作者简介:(15543782756@163.com)
  • 基金资助:
    国家自然科学基金创新研究群体科学基金(61521003)

Forgery Face Detection Based on Multi-scale Transformer Fusing Multi-domain Information

MA Xin1,2, JI Lixin2, LI Shaomei2   

  1. 1 School of Cyber Science and Engineering,Zhengzhou University,Zhengzhou 450001,China
    2 Institute of Information Technology,PLA Strategic Support Force Information Engineering University,Zhengzhou 450002,China
  • Received:2022-09-06 Revised:2022-12-10 Online:2023-10-10 Published:2023-10-10
  • About author:MA Xin,born in 1997,postgraduate.Her main research interests include deep learning and computer vision.LI Shaomei,born in 1982,Ph.D,asso-ciate professor.Her main research in-terests include computer vision and so on.
  • Supported by:
    Science Fund for Creative Research Groups of the National Natural Science Foundation of China(61521003).

摘要: 当前,基于Deepfakes等深度伪造技术生成的“换脸”类伪造视频泛滥,给公民个人隐私和国家政治安全带来巨大威胁,为此,研究视频中深度伪造人脸检测技术具有重要意义。针对已有伪造人脸检测方法存在的面部特征提取不充分、泛化能力弱等不足,提出一种基于多尺度Transformer对多域信息进行融合的伪造人脸检测方法。基于多域特征融合的思路,同时从视频帧的频域与RGB域进行特征提取,提高模型的泛化性;联合EfficientNet和多尺度Transformer,设计多层级的特征提取网络以提取更精细的伪造特征。在开源数据集上的测试结果表明,相比已有方法,所提方法具有更好的检测效果;同时在跨数据集上的实验结果证明了所提模型具有较好的泛化性能。

关键词: 伪造人脸检测, 多尺度Transformer, EfficientNet, 频域特征, 特征融合

Abstract: At present,the proliferation of “face-changing” fake videos generated based on deep forgery technologies such as Deepfakes poses a considerable threat to citizens' privacy and national political security.Therefore,it is of great significance to study deep-faked face detection technology in videos.Aiming at the problems of insufficient extraction of facial features and weak gene-ralization ability of existing forged face detection methods,this paper proposes a fake face detection method based on multi-scale Transformer for the fusion of multi-domain information.First,based on the idea of multi-domain feature fusion,feature extraction from the frequency domain and RGB domain of video frames improves the generalization of the model.Second,the EfficientNet and multi-scale Transformer are combined to design a multi-level feature extraction network to extract more elaborate forged features.The test results on open-source datasets show that the proposed method has better detection performance than the existing methods.At the same time,experimental results on cross-datasets prove that the proposed model has better generalization performance.

Key words: Forgery face detection, Multi-scale Transformer, EfficientNet, Frequency domain features, Feature fusion

中图分类号: 

  • TP391
[1]LI X R,JI S L,WU C M,et al.Survey on deepfakes and detection techniques[J].Journal of Software,2021,32(2):496-518.
[2]Big Data Digest.Deepfake ‘involved in the war' for the first time:The Ukrainian president was faked to surrender the video,and the rumor was dispelled on Twitter[EB/OL].https://www.thepaper.cn/newsDetail_forward_17262083.
[3]ZHOU P,HAN X,MORARIU V I,et al.Two-Stream Neural Networks for Tampered Face Detection[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops.Piscataway:IEEE Press,2017:1831-1839.
[4]NGUYEN H,FANG F,YAMAGISHI J,et al.Capsule-foren-sics:Using Capsule Networks to Detect Forged Images and Vi-deos[C]//Proceedings of the 2019 IEEE International Confe-rence on Acoustics Speech and Signal Processing.Piscataway:IEEE Press,2019:2307-2311.
[5]HSU C C,ZHUANG Y X,LEE C Y.Deep Fake Image Detection Based on Pairwise Learning[J/OL].Applied Sciences,2020,10(1):370.http://doi.org/10.33901app10010370.
[6]TARIQ S,LEE S,KIM H,et al.Detecting Both Machine and Human Created Fake Face Images In the Wild[C]//Proceedings of the 2nd International Workshop on Multimedia Privacy and Security.Canada:CCS Press,2018:81-87.
[7]DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive Language Models beyond a Fixed-Length Context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Florence,Italy.2019:2978-2988.
[8]HAN K,WANG Y,CHEN H,et al.A Survey on Visual Trans-former[J].arXiv.2012.12556,2020.
[9]WODAJO D,ATNAFU S.Deepfake Video Detection UsingConvolutional Vision Transformer[J].arXiv:2102.11126,2021.
[10]COCCOMINI D A,MESSINA N,GENNARO C,et al.Combining EfficientNet and Vision Transformers for Video Deepfake Detection[C]//Proceedings of the 21st International Conference on Image Analysis and Processing.Cham:Springer,2022:219-229.
[11]HEO Y J,CHOI Y J,LEE Y W,et al.Deepfake DetectionScheme Based on Vision Transformer and Distillation[J].ar-Xiv:2104.01353,2021.
[12]WANG J,WU Z,CHEN J,et al.M2TR:Multi-modal Multi-scale Transformers for Deepfake Detection[C]//Proceedings of the 2022 International Conference on Multimedia Retrieval.New York:ACM Press,2022:615-623.
[13]LIU H,LI X,ZHOU W,et al.Spatial-Phase Shallow Learning:Rethinking Face Forgery Detection in Frequency Domain[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Piscataway:IEEE Press,2021:772-781.
[14]TAN M,LE Q.EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks[C]//Proceedings of the 2019 International Conference on Machine Learning.Piscataway:IEEE Press,2019:6105-6114.
[15]PU Y,GAN Z,HENAO R,et al.Variational Autoencoder forDeep Learning of Images,Labels and Captions[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge:MIT Press,2016:2360-2368.
[16]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative Adversarial Nets[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.Cambridge:MIT Press,2014:2672-2680.
[17]ROSSLER A,COZZOLINO D,VERDOLIVA L,et al.FaceForensics++:Learning to Detect Manipulated Facial Images[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision.New York:IEEE Press,2019:1-11.
[18]LI Y,YANG X,SUN P,et al.Celeb-DF:A Large-scale Challenging Dataset for Deepfake Forensics[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2020:3204-3213.
[19]DENG J,GUO J,ZHOU Y,et al.RetinaFace:Single-stageDense Face Localisation in the Wild[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2020:5202-5211.
[20]LI L,BAO J,ZHANG T,et al.Face X-Ray for More GeneralFace Forgery Detection[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2020:5000-5009.
[21]CHOLLET F.Xception:Deep Learning with Depthwise Separable Convolutions[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2017:1800-1807.
[22]TAN M,LE Q V.EfficientNet:Rethinking Model Scaling for Convolutional Neural Networks[C]//Proceedings of the 36th International Conference on Machine Learning.New York:PMLR Press,2019:6105-6114.
[23]MASI I,KILLEKAR A,RM MASCAREN,et al.Two Branch Recurrent Network for Isolating Deepfakes in Videos[C]//Proceedings of the 2020 European Conference on Computer Vision.Cham:Springer,2020:667-684.
[24]ZHAO H,ZHOU W,CHEN D,et al.Multi-attentional Deepfake Detection[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition.New York:IEEE Press,2021:2185-2194.
[25]AFCHAR D,NOZICK V,YAMAGISHI J,et al.MesoNet:aCompact Facial Video Forgery Detection Network[C]//Proceedings of the 2018 IEEE International Workshop on Information Forensics and Security.New York:IEEE Press,2018:1-7.
[26]LI Y,LYU S.Exposing DeepFake Videos By Detecting FaceWarping Artifacts[C]//Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.California:CVPR workshop,2019:46-52.
[27]LI X,LANG Y,CHEN Y,et al.Sharp Multiple Instance Lear-ning for DeepFake Video Detection[C]//Proceedings of the 28th ACM International Conference on Multimedia.New York:ACM Press,2020:1864-1872.
[28]SELVARAJU R,COGSWELL M,DAS A,et al.Grad-cam:Vi-sual explanations from deep networks via gradient-based localization[C]//Proceedings of the 2017 IEEE International Confe-rence on Computer Vision.USA:IEEE Press,2017:618-626.
[29]MAATEN L V D.Accelerating t-SNE using tree-based algorithms[J].Journal of Machine Learning Research,2014,15(1):3221-3245.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!