计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241200164-10.doi: 10.11896/jsjkx.241200164

• 计算机图形学&多媒体 • 上一篇    下一篇

生成式人工智能在视频处理领域的应用综述

王中原, 王宝山, 王拥军, 袁天浩   

  1. 北京航空航天大学数学科学学院 北京 102206
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 王宝山(bwang@buaa.edu.cn)
  • 作者简介:zywang111@buaa.edu.cn
  • 基金资助:
    国家自然科学基金(12371016,11871083)

Review of Applications of Artificial Intelligence Generated Content in Video Processing

WANG Zhongyuan, WANG Baoshan, WANG Yongjun, YUAN Tianhao   

  1. School of Mathematical Sciences,Beihang University,Beijing 102206,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(12371016,11871083).

摘要: 生成式人工智能是近年来的重点研究方向,尤其是视频处理领域。Sora等新技术的问世,掀起了新一轮的生成式人工智能研究热潮。介绍了生成式人工智能在视频处理领域的发展及应用,并讨论了未来值得研究的方向及面临的挑战。具体包括3个部分:首先,回顾了生成式人工智能在视频处理领域早期的重要基础模型,包括生成式对抗网络、变分自动编码器、扩散模型等结构,并总结了在视频生成任务中做出重大创新或效果优异的模型;然后,从基本属性、视频生成质量、人类主观视角3个维度对比了2023-2024年Sora出现前后视频生成新模型的优劣;最后,基于对数据的分析,提出了未来视频生成领域的发展方向及挑战,为相关领域研究者提供参考,推动生成式人工智能在视频处理领域的广泛应用。

关键词: 生成式人工智能, Sora模型, 视频生成, 模型对比, 发展与挑战

Abstract: Artificial intelligence generated content has become a key research focus in recent years,particularly in the field of video processing.With the emergence of new technologies such as Sora,a new wave of research enthusiasm has been sparked.This paper introduces the development and applications of artificial intelligence generated content in video processing and discusses future research directions and challenges.There are three parts in this paper.Firstly,it introduces the early foundational models of artificial intelligence generated content in the field of video processing,including generative adversarial networks,variational autoencoders,diffusion models and other models,summarizing the models that have made significant innovations or achieved excellent results in video generation tasks.Secondly,it compares the advantages and disadvantages of new video generation models before and after the introduction of Sora in 2023-2024 from three dimensions:basic properties,video generation quality and human subjective perspective.Finally,based on data analysis,this paper outlines the future development directions and challenges in the field of video generation,offering valuable insights for researchers in related fields and promoting the widespread adoption of generative artificial intelligence in video processing.

Key words: Artificial intelligence generated content, Sora model, Video generation, Model comparison, Development and challenge

中图分类号: 

  • TP183
[1]YUE Q,ZHANG C K.Application of AIGC in multimodal scenarios:A survey[J].Journal of Frontiers of Computer Science and Technology,2025,19(1):79-96.
[2]BROWN T B.Language models are few-shot learners[C]//Neural Information Processing Systems.2020:1877-1901.
[3]ZHANG X,ZHANG P,SHEN Y,et al.A Systematic Literature Review of Empirical Research on Applying Generative Artificial Intelligence in Education[J].Frontiers of Digital Education,2024,1(3):223-245.
[4]WANG X,ZHAO J,MAROSTICA E,et al.A pathology foundation model for cancer diagnosis and prognosis prediction[J].Nature,2024,634(8035):970-978.
[5]SHEN B,ZHANG J,CHEN T,et al.Pangu-coder2:Boostinglarge Language models for code with ranking feedback[J].ar-Xiv:2307.14936,2023.
[6]YANG L N,LIU C S,LIU L L.Intelligent extraction model ofunstructured text key information based on blockchain technology[J].Information Technology,2024(2):154-159,165.
[7]OpenAI.Video generation models as world simulators[EB/OL].https://openai.com/index/video-generation-models-as-world-simulators,2024.
[8]LI C,HUANG D,LU Z,et al.A survey on long video generation:Challenges,methods,and prospects[J].arXiv:2403.16407,2024.
[9]LEI W,WANG J,MA F,et al.A Comprehensive Survey on Human Video Generation:Challenges,Methods,and Insights[J].arXiv:2407.08428,2024.
[10]HOCHREITER S.Long Short-term Memory[J].Neural Com-putation,1997,9:1735-1780.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30.
[12]WANG Z.Optimization Study of Political News Information Extraction Based on the OneIE Model[D].Beijing:China Agricultural University,2024.
[13]MA X,WANG Y,JIA G,et al.Latte:Latent diffusion trans-former for video generation[J].arXiv:2401.03048,2024.
[14]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4534-4542.
[15]GUPTA A,YU L,SOHN K,et al.Photorealistic video generation with diffusion models[C]//European Conference on Computer Vision.Cham:Springer,2025:393-411.
[16]WANG X,ZHU Z,HUANG G,et al.Worlddreamer:Towardsgeneral world models for video generation via predicting masked tokens[J].arXiv:2401.09985,2024.
[17]BECK M,PÖPPEL K,SPANRING M,et al.xLSTM:Extended Long Short-Term Memory[J].arXiv:2405.04517,2024.
[18]ALKIN B,BECK M,PÖPPEL K,et al.Vision-LSTM:xLSTM as Generic Vision Backbone[J].arXiv:2406.04303,2024.
[19]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene-rative adversarial nets[J].Advances in Neural Information Proces-sing Systems,2014,27.
[20]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017.
[21]TERO K,SAMULI L,MIIKA A,et al.Analyzing and improving the image quality of stylegan[C]//CVPR.2020:8110-8119.
[22]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[23]KARRAS T.Progressive Growing of GANs for Improved Quality,Stability,and Variation[C]//International Conference on Learning Representations.2018.
[24]ALDAUSARI N,SOWMYA A,MARCUS N,et al.Video gene-rative adversarial networks:a review[J].ACM Computing Surveys(CSUR),2022,55(2):1-25.
[25]VONDRICK C,PIRSIAVASH H,TORRALBA A.Generating videos with scene dynamics[J].Advances in Neural Information Processing Systems,2016,29.
[26]SAITO M,MATSUMOTO E,SAITO S.Temporal generativeadversarial nets with singular value clipping[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2830-2839.
[27]ZHANG Q,YANG C,SHEN Y,et al.Towards smooth video composition[C]//The Eleventh International Conference on Learning Representations(ICLR).2023.
[28]KINGMA D P.Auto-encoding variational bayes[C]//International Conference on Learning Representations.2014.
[29]LIN S,CLARK R,BIRKE R,et al.Anomaly detection for time series using vae-lstm hybrid model[C]//2020 IEEE InternationalConference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:4322-4326.
[30]FENG L,WANG C,WU T,et al.Dimensionality ReductionMethod for Manifold Learning Based on Variational Autoencoder[J].Journal of Computer-Aided Design & Computer Graphics,2025,37(3):439.
[31]CHEN L,LI Z,LIN B,et al.Od-vae:An omni-dimensional video compressor for improving latent video diffusion model[J].ar-Xiv:2409.01199,2024.
[32]LEE Y,JEON J,YU J,et al.Context-aware multi-task learning for traffic scene recognition in autonomous vehicles[C]//2020 IEEE Intelligent Vehicles Symposium(IV).IEEE,2020:723-730.
[33]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//International Conference on Machine Lear-ning.PMLR,2015:2256-2265.
[34]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[35]BETKER J,GOH G,JING L,et al.Improving image generation with better captions[J].Computer Science,2023,2(3):8.
[36]HO J,SALIMANS T,GRITSENKO A,et al.Video diffusionmodels[J].Advances in Neural Information Processing Systems,2022,35:8633-8646.
[37]YANG L,ZHANG Z,SONG Y,et al.Diffusion models:A comprehensive survey of methods and applications[J].ACM Computing Surveys,2023,56(4):1-39.
[38]CEYLAN D,HUANG C H P,MITRA N J.Pix2video:Videoediting using image diffusion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:23206-23217.
[39]KHACHATRYAN L,MOVSISYAN A,TADEVOSYAN V,et al.Text2video-zero:Text-to-image diffusion models are zero-shot video generators[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:15954-15964.
[40]MIRZA M.Conditional generative adversarial nets[C]//Neural Information Processing Systems.2014.
[41]ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-imagetranslation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2223-2232.
[42]TULYAKOV S,LIU M Y,YANG X,et al.Mocogan:Decomposing motion and content for video generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1526-1535.
[43]WANG T C,LIU M Y,ZHU J Y,et al.Video-to-video synthesis[C]//Neural Information Processing Systems.2018.
[44]CLARK A,DONAHUE J,SIMONYAN K.Adversarial videogeneration on complex datasets[C]//International Conference on Learning Representations.2019.
[45]KARRAS T,LAINE S,AILA T.A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4401-4410.
[46]WU J,HUANG Z,ACHARYA D,et al.Sliced wasserstein gene-rative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3713-3722.
[47]KARRAS T,LAINE S,AITTALA M,et al.Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:8110-8119.
[48]KARRAS T,AITTALA M,LAINE S,et al.Alias-free generative adversarial networks[J].Advances in Neural Information Processing Systems,2021,34:852-863.
[49]LI T,CHANG H,MISHRA S,et al.Mage:Masked generativeencoder to unify representation learning and image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:2142-2152.
[50]XU Y,PARK T,ZHANG R,et al.VideoGigaGAN:TowardsDetail-rich Video Super-Resolution[J].arXiv:2404.12388,2024.
[51]VAN DEN OORD A,VINYALS O.Neural discrete representation learning[J].Advances in Neural Information Processing Systems,2017,30.
[52]HE J,LEHRMANN A,MARINO J,et al.Probabilistic videogeneration using holistic attribute control[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:452-467.
[53]HARVEY W,NADERIPARIZI S,MASRANI V,et al.Flexible diffusion modeling of long videos[J].Advances in Neural Information Processing Systems,2022,35:27953-27965.
[54]YANG R,SRIVASTAVA P,MANDT S.Diffusion probabilistic modeling for video generation[J].Entropy,2023,25(10):1469.
[55]WU Z,HU J,LU W,et al.Slotdiffusion:Object-centric generative modeling with diffusion models[J].Advances in Neural Information Processing Systems,2023,36:50932-50958.
[56]PEEBLES W,XIE S.Scalable diffusion models with transfor-mers[C]//Proceedings of the IEEE/CVF International Con-ference on Computer Vision.2023:4195-4205.
[57]BAR-TAL O,CHEFER H,TOV O,et al.Lumiere:A space-time diffusion model for video generation[C]//SIGGRAPH Asia 2024 Conference.2024.
[58]BLATTMANN A,ROMBACH R,LING H,et al.Align your la-tents:High-resolution video synthesis with latent diffusion mo-dels[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2023:22563-22575.
[59]CHEN H,XIA M,HE Y,et al.Videocrafter1:Open diffusionmodels for high-quality video generation[J].arXiv:2310.19512,2023.
[60]ZHANG D J,WU J Z,LIU J W,et al.Show-1:Marrying pixel and latent diffusion models for text-to-video generation[J].International Journal of Computer Vision,2024,133(4):1879-1893.
[61]BLATTMANN A,DOCKHORN T,KULAL S,et al.Stablevideo diffusion:Scaling latent video diffusion models to large datasets[J].arXiv:2311.15127,2023.
[62]HUANG Z,HE Y,YU J,et al.Vbench:Comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:21807-21818.
[63]HUANG Z,ZHANG F,XU X,et al.VBench++:Comprehensive and Versatile Benchmark Suite for Video Generative Models[J].arXiv:2411.13503,2024.
[64]LIU Y,ZHANG K,LI Y,et al.Sora:A review on background,technology,limitations,and opportunities of large vision models[J].arXiv:2402.17177,2024.
[65]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//18th International Conference Medical Image Computing and Computer-assisted Intervention(MICCAI 2015).Munich,Germany,Springer International Publishing,2015:234-241.
[66]POLYAK A,ZOHAR A,BROWN A,et al.Movie gen:A cast of media foundation models[J].arXiv:2410.13720,2024.
[67]DUBEY A,JAUHRI A,PANDEY A,et al.The llama 3 herd of models[J].arXiv:2407.21783,2024.
[68]ROMBACH R,BLATTMANN A,LORENZ D,et al.High-reso-lution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:10684-10695.
[69]AI P.Pika is the idea-to-video platform that sets your creativity in motion[EB/OL].https://pika.art/home.
[70]BAO F,XIANG C,YUE G,et al.Vidu:a highly consistent,dynamic and skilled text-to-video generator with diffusion models[J].arXiv:2405.04233,2024.
[71]TIAN Y,YANG L,YANG H,et al.VideoTetris:TowardsCompositional Text-to-Video Generation [C]//Neural Information Processing Systems.2024.
[72]YANG Z,TENG J,ZHENG W,et al.Cogvideox:Text-to-video diffusion models with an expert transformer[J].arXiv:2408.06072,2024.
[73]HUANG L,CHEN D,LIU Y,et al.Composer:Creative andcontrollable image synthesis with composable conditions[C]//International Conference on Learning Representations(ICLR).2023.
[74]HONG X,ZHANG H.LSTM-CBAM-based audio and videosynchronization face video generation[J].Intelligent Computer and Applications,2023,13(5):151-155.
[75]TANG Z,YANG Z,ZHU C,et al.Any-to-any generation viacomposable diffusion[J].Advances in Neural Information Processing Systems,2024,36.
[76]PERRAULT R,CLARK J.Artificial Intelligence Index Report 2024[R].2024.
[77]ZHENG Z,LV J,WANG L,et al.Cross-scale systematic lear-ning for social big data:theory and methods[J].Scientia Sinica(Informationis),2024,54(9):2083-2097.
[78]WANG D,YU Y,YAO S,et al.Construction of generative artificial intelligence security assessment system[C]//Proceedings of the Artificial Intelligence Security Governance Theme Forum of the 2024 World Intelligent Industry Expo.2024.
[79]LI X,HU Y,WANG M,et al.A Review of AI-generated Content Research:Applications,Risks,and Governance[J].Library and Information Service,2024,68(17):136-149.
[80]HUANG X,LIU H,YAN X.The Employment Impact of Generative AI and Policy Responses[J].Contemporary Economy & Management,2025,47(4):73-87.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!