生成式人工智能在视频处理领域的应用综述

doi:10.11896/jsjkx.241200164

Abstract

Abstract: Artificial intelligence generated content has become a key research focus in recent years,particularly in the field of video processing.With the emergence of new technologies such as Sora,a new wave of research enthusiasm has been sparked.This paper introduces the development and applications of artificial intelligence generated content in video processing and discusses future research directions and challenges.There are three parts in this paper.Firstly,it introduces the early foundational models of artificial intelligence generated content in the field of video processing,including generative adversarial networks,variational autoencoders,diffusion models and other models,summarizing the models that have made significant innovations or achieved excellent results in video generation tasks.Secondly,it compares the advantages and disadvantages of new video generation models before and after the introduction of Sora in 2023－2024 from three dimensions:basic properties,video generation quality and human subjective perspective.Finally,based on data analysis,this paper outlines the future development directions and challenges in the field of video generation,offering valuable insights for researchers in related fields and promoting the widespread adoption of generative artificial intelligence in video processing.

Key words: Artificial intelligence generated content, Sora model, Video generation, Model comparison, Development and challenge

CLC Number:

TP183

WANG Zhongyuan, WANG Baoshan, WANG Yongjun, YUAN Tianhao. Review of Applications of Artificial Intelligence Generated Content in Video Processing[J].Computer Science, 2025, 52(11A): 241200164-10.

References

[1]YUE Q,ZHANG C K.Application of AIGC in multimodal scenarios:A survey[J].Journal of Frontiers of Computer Science and Technology,2025,19(1):79-96.
[2]BROWN T B.Language models are few-shot learners[C]//Neural Information Processing Systems.2020:1877-1901.
[3]ZHANG X,ZHANG P,SHEN Y,et al.A Systematic Literature Review of Empirical Research on Applying Generative Artificial Intelligence in Education[J].Frontiers of Digital Education,2024,1(3):223-245.
[4]WANG X,ZHAO J,MAROSTICA E,et al.A pathology foundation model for cancer diagnosis and prognosis prediction[J].Nature,2024,634(8035):970-978.
[5]SHEN B,ZHANG J,CHEN T,et al.Pangu-coder2:Boostinglarge Language models for code with ranking feedback[J].ar-Xiv:2307.14936,2023.
[6]YANG L N,LIU C S,LIU L L.Intelligent extraction model ofunstructured text key information based on blockchain technology[J].Information Technology,2024(2):154-159,165.
[7]OpenAI.Video generation models as world simulators[EB／OL].https://openai.com/index/video-generation-models-as-world-simulators,2024.
[8]LI C,HUANG D,LU Z,et al.A survey on long video generation:Challenges,methods,and prospects[J].arXiv:2403.16407,2024.
[9]LEI W,WANG J,MA F,et al.A Comprehensive Survey on Human Video Generation:Challenges,Methods,and Insights[J].arXiv:2407.08428,2024.
[10]HOCHREITER S.Long Short-term Memory[J].Neural Com-putation,1997,9:1735-1780.
[11]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].Advances in Neural Information Processing Systems,2017,30.
[12]WANG Z.Optimization Study of Political News Information Extraction Based on the OneIE Model[D].Beijing:China Agricultural University,2024.
[13]MA X,WANG Y,JIA G,et al.Latte:Latent diffusion trans-former for video generation[J].arXiv:2401.03048,2024.
[14]VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:4534-4542.
[15]GUPTA A,YU L,SOHN K,et al.Photorealistic video generation with diffusion models[C]//European Conference on Computer Vision.Cham:Springer,2025:393-411.
[16]WANG X,ZHU Z,HUANG G,et al.Worlddreamer:Towardsgeneral world models for video generation via predicting masked tokens[J].arXiv:2401.09985,2024.
[17]BECK M,PÖPPEL K,SPANRING M,et al.xLSTM:Extended Long Short-Term Memory[J].arXiv:2405.04517,2024.
[18]ALKIN B,BECK M,PÖPPEL K,et al.Vision-LSTM:xLSTM as Generic Vision Backbone[J].arXiv:2406.04303,2024.
[19]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene-rative adversarial nets[J].Advances in Neural Information Proces-sing Systems,2014,27.
[20]YU L,ZHANG W,WANG J,et al.Seqgan:Sequence generative adversarial nets with policy gradient[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2017.
[21]TERO K,SAMULI L,MIIKA A,et al.Analyzing and improving the image quality of stylegan[C]//CVPR.2020:8110-8119.
[22]KONG J,KIM J,BAE J.Hifi-gan:Generative adversarial networks for efficient and high fidelity speech synthesis[J].Advances in Neural Information Processing Systems,2020,33:17022-17033.
[23]KARRAS T.Progressive Growing of GANs for Improved Quality,Stability,and Variation[C]//International Conference on Learning Representations.2018.
[24]ALDAUSARI N,SOWMYA A,MARCUS N,et al.Video gene-rative adversarial networks:a review[J].ACM Computing Surveys(CSUR),2022,55(2):1-25.
[25]VONDRICK C,PIRSIAVASH H,TORRALBA A.Generating videos with scene dynamics[J].Advances in Neural Information Processing Systems,2016,29.
[26]SAITO M,MATSUMOTO E,SAITO S.Temporal generativeadversarial nets with singular value clipping[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2830-2839.
[27]ZHANG Q,YANG C,SHEN Y,et al.Towards smooth video composition[C]//The Eleventh International Conference on Learning Representations(ICLR).2023.
[28]KINGMA D P.Auto-encoding variational bayes[C]//International Conference on Learning Representations.2014.
[29]LIN S,CLARK R,BIRKE R,et al.Anomaly detection for time series using vae-lstm hybrid model[C]//2020 IEEE InternationalConference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:4322-4326.
[30]FENG L,WANG C,WU T,et al.Dimensionality ReductionMethod for Manifold Learning Based on Variational Autoencoder[J].Journal of Computer-Aided Design & Computer Graphics,2025,37(3):439.
[31]CHEN L,LI Z,LIN B,et al.Od-vae:An omni-dimensional video compressor for improving latent video diffusion model[J].ar-Xiv:2409.01199,2024.
[32]LEE Y,JEON J,YU J,et al.Context-aware multi-task learning for traffic scene recognition in autonomous vehicles[C]//2020 IEEE Intelligent Vehicles Symposium(IV).IEEE,2020:723-730.
[33]SOHL-DICKSTEIN J,WEISS E,MAHESWARANATHANN,et al.Deep unsupervised learning using nonequilibrium thermodynamics[C]//International Conference on Machine Lear-ning.PMLR,2015:2256-2265.
[34]HO J,JAIN A,ABBEEL P.Denoising diffusion probabilisticmodels[J].Advances in Neural Information Processing Systems,2020,33:6840-6851.
[35]BETKER J,GOH G,JING L,et al.Improving image generation with better captions[J].Computer Science,2023,2(3):8.
[36]HO J,SALIMANS T,GRITSENKO A,et al.Video diffusionmodels[J].Advances in Neural Information Processing Systems,2022,35:8633-8646.
[37]YANG L,ZHANG Z,SONG Y,et al.Diffusion models:A comprehensive survey of methods and applications[J].ACM Computing Surveys,2023,56(4):1-39.
[38]CEYLAN D,HUANG C H P,MITRA N J.Pix2video:Videoediting using image diffusion[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:23206-23217.
[39]KHACHATRYAN L,MOVSISYAN A,TADEVOSYAN V,et al.Text2video-zero:Text-to-image diffusion models are zero-shot video generators[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2023:15954-15964.
[40]MIRZA M.Conditional generative adversarial nets[C]//Neural Information Processing Systems.2014.
[41]ZHU J Y,PARK T,ISOLA P,et al.Unpaired image-to-imagetranslation using cycle-consistent adversarial networks[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:2223-2232.
[42]TULYAKOV S,LIU M Y,YANG X,et al.Mocogan:Decomposing motion and content for video generation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:1526-1535.
[43]WANG T C,LIU M Y,ZHU J Y,et al.Video-to-video synthesis[C]//Neural Information Processing Systems.2018.
[44]CLARK A,DONAHUE J,SIMONYAN K.Adversarial videogeneration on complex datasets[C]//International Conference on Learning Representations.2019.
[45]KARRAS T,LAINE S,AILA T.A style-based generator architecture for generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:4401-4410.
[46]WU J,HUANG Z,ACHARYA D,et al.Sliced wasserstein gene-rative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3713-3722.
[47]KARRAS T,LAINE S,AITTALA M,et al.Analyzing and improving the image quality of stylegan[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2020:8110-8119.
[48]KARRAS T,AITTALA M,LAINE S,et al.Alias-free generative adversarial networks[J].Advances in Neural Information Processing Systems,2021,34:852-863.
[49]LI T,CHANG H,MISHRA S,et al.Mage:Masked generativeencoder to unify representation learning and image synthesis[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2023:2142-2152.
[50]XU Y,PARK T,ZHANG R,et al.VideoGigaGAN:TowardsDetail-rich Video Super-Resolution[J].arXiv:2404.12388,2024.
[51]VAN DEN OORD A,VINYALS O.Neural discrete representation learning[J].Advances in Neural Information Processing Systems,2017,30.
[52]HE J,LEHRMANN A,MARINO J,et al.Probabilistic videogeneration using holistic attribute control[C]//Proceedings of the European Conference on Computer Vision(ECCV).2018:452-467.
[53]HARVEY W,NADERIPARIZI S,MASRANI V,et al.Flexible diffusion modeling of long videos[J].Advances in Neural Information Processing Systems,2022,35:27953-27965.
[54]YANG R,SRIVASTAVA P,MANDT S.Diffusion probabilistic modeling for video generation[J].Entropy,2023,25(10):1469.
[55]WU Z,HU J,LU W,et al.Slotdiffusion:Object-centric generative modeling with diffusion models[J].Advances in Neural Information Processing Systems,2023,36:50932-50958.
[56]PEEBLES W,XIE S.Scalable diffusion models with transfor-mers[C]//Proceedings of the IEEE/CVF International Con-ference on Computer Vision.2023:4195-4205.
[57]BAR-TAL O,CHEFER H,TOV O,et al.Lumiere:A space-time diffusion model for video generation[C]//SIGGRAPH Asia 2024 Conference.2024.
[58]BLATTMANN A,ROMBACH R,LING H,et al.Align your la-tents:High-resolution video synthesis with latent diffusion mo-dels[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2023:22563-22575.
[59]CHEN H,XIA M,HE Y,et al.Videocrafter1:Open diffusionmodels for high-quality video generation[J].arXiv:2310.19512,2023.
[60]ZHANG D J,WU J Z,LIU J W,et al.Show-1:Marrying pixel and latent diffusion models for text-to-video generation[J].International Journal of Computer Vision,2024,133(4):1879-1893.
[61]BLATTMANN A,DOCKHORN T,KULAL S,et al.Stablevideo diffusion:Scaling latent video diffusion models to large datasets[J].arXiv:2311.15127,2023.
[62]HUANG Z,HE Y,YU J,et al.Vbench:Comprehensive benchmark suite for video generative models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2024:21807-21818.
[63]HUANG Z,ZHANG F,XU X,et al.VBench++:Comprehensive and Versatile Benchmark Suite for Video Generative Models[J].arXiv:2411.13503,2024.
[64]LIU Y,ZHANG K,LI Y,et al.Sora:A review on background,technology,limitations,and opportunities of large vision models[J].arXiv:2402.17177,2024.
[65]RONNEBERGER O,FISCHER P,BROX T.U-net:Convolu-tional networks for biomedical image segmentation[C]//18th International Conference Medical Image Computing and Computer-assisted Intervention(MICCAI 2015).Munich,Germany,Springer International Publishing,2015:234-241.
[66]POLYAK A,ZOHAR A,BROWN A,et al.Movie gen:A cast of media foundation models[J].arXiv:2410.13720,2024.
[67]DUBEY A,JAUHRI A,PANDEY A,et al.The llama 3 herd of models[J].arXiv:2407.21783,2024.
[68]ROMBACH R,BLATTMANN A,LORENZ D,et al.High-reso-lution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:10684-10695.
[69]AI P.Pika is the idea-to-video platform that sets your creativity in motion[EB/OL].https://pika.art/home.
[70]BAO F,XIANG C,YUE G,et al.Vidu:a highly consistent,dynamic and skilled text-to-video generator with diffusion models[J].arXiv:2405.04233,2024.
[71]TIAN Y,YANG L,YANG H,et al.VideoTetris:TowardsCompositional Text-to-Video Generation [C]//Neural Information Processing Systems.2024.
[72]YANG Z,TENG J,ZHENG W,et al.Cogvideox:Text-to-video diffusion models with an expert transformer[J].arXiv:2408.06072,2024.
[73]HUANG L,CHEN D,LIU Y,et al.Composer:Creative andcontrollable image synthesis with composable conditions[C]//International Conference on Learning Representations(ICLR).2023.
[74]HONG X,ZHANG H.LSTM-CBAM-based audio and videosynchronization face video generation[J].Intelligent Computer and Applications,2023,13(5):151-155.
[75]TANG Z,YANG Z,ZHU C,et al.Any-to-any generation viacomposable diffusion[J].Advances in Neural Information Processing Systems,2024,36.
[76]PERRAULT R,CLARK J.Artificial Intelligence Index Report 2024[R].2024.
[77]ZHENG Z,LV J,WANG L,et al.Cross-scale systematic lear-ning for social big data:theory and methods[J].Scientia Sinica(Informationis),2024,54(9):2083-2097.
[78]WANG D,YU Y,YAO S,et al.Construction of generative artificial intelligence security assessment system[C]//Proceedings of the Artificial Intelligence Security Governance Theme Forum of the 2024 World Intelligent Industry Expo.2024.
[79]LI X,HU Y,WANG M,et al.A Review of AI-generated Content Research:Applications,Risks,and Governance[J].Library and Information Service,2024,68(17):136-149.
[80]HUANG X,LIU H,YAN X.The Employment Impact of Generative AI and Policy Responses[J].Contemporary Economy & Management,2025,47(4):73-87.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Review of Applications of Artificial Intelligence Generated Content in Video Processing

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 5

Metrics

Comments

Recommended 0

[1]	YUAN Tianhao, WANG Yongjun, WANG Baoshan, WANG Zhongyuan. Review of Artificial Intelligence Generated Content Applications in Natural Language Processing [J]. Computer Science, 2025, 52(11A): 241200156-12.
[2]	XU Jun, ZHOU Peijin, ZHANG Haijing, ZHANG Hao, XU Yuzhong. Analysis of User Evaluation Indicator for AIGC Digital Illustration Design Principles [J]. Computer Science, 2024, 51(11): 47-53.
[3]	SONG Xinyang, YAN Zhiyuan, SUN Muyi, DAI Linlin, LI Qi, SUN Zhenan. Review of Talking Face Generation [J]. Computer Science, 2023, 50(8): 68-78.
[4]	GUO Dan, TANG Shen-geng, HONG Ri-chang, WANG Meng. Review of Sign Language Recognition, Translation and Generation [J]. Computer Science, 2021, 48(3): 60-70.
[5]	. Comparison on Covering-based Rough Set Models [J]. Computer Science, 2012, 39(7): 229-231.