基于CVAE-WGAN的音乐情感转换模型

doi:10.11896/jsjkx.241100014

摘要/Abstract

摘要： 音乐是人表达情感的重要方式。音乐情感转换技术能够将原始音乐转换成具有目标情感的音乐,满足用户对多样化情感音乐的需求,并提升创作效率。现有音乐情感转换技术通过构建深度学习模型来实现端到端的情感转换,但其表征音乐的情感向量与实际音乐特征之间的对应性不足,导致中间层缺乏可解释性,这在一定程度上限制了音乐情感转换的准确性,并可能引发梯度消失问题。针对上述问题,提出了一种基于CVAE-WGAN(Conditional Variational Autoencoder Wasserstein Generative Adversarial Network)架构的音乐情感转换模型,使用WGAN-GP网络替代传统GAN,引入Wasserstein 距离和梯度惩罚机制,有效避免模式崩溃和梯度消失,从而提升训练的稳定性和生成质量。同时,为了解决生成模型中间过程缺乏可解释性的问题,引入涵盖音乐旋律、和声、节奏、动态强弱、音色、表达性和曲式方面的64种具备明确可解释性的中间感知特征作为潜在空间变量融入模型,确保潜在空间的每一个维度都能对应一个具体的音乐特征。此外,该模型还使用高斯混合模型代替变分自编码器中的单高斯模型,用于捕捉和表示不同情感类别下的音乐特征分布。实验结果表明,该模型在快乐、悲伤、温柔、愤怒、恐惧和惊讶6种典型情感间的相互转换任务上表现优异,在情感准确率、重构误差、生成连贯性和生成多样性方面的表现均优于对比模型。

关键词: 音乐情感转换, CVAE-WGAN, Swin Transformer, 中间感知特征, 高斯混合模型

Abstract: Music is an important means of emotional expression for people,serving as a powerful tool for conveying feelings.Music emotion transformation technology allows for the conversion of original music into music with a target emotion,thereby mee-ting users’ demands for diverse emotional music and significantly improving creative efficiency.The existing music emotion transformation technologies achieve end-to-end emotion transformation by constructing sophisticated deep learning models.However,in current methods,the correspondence between the emotional vector representing music and the actual musical features is insufficient,resulting in a lack of interpretability in the intermediate network layers,which limits the accuracy of emotion transformation to a certain extent and may contribute to the problem of gradient vanishing.To address these issues,a new music emotion transformation model based on the CVAE-WGAN(Conditional Variational Autoencoder Wasserstein Generative Adversarial Network) architecture is proposed.This model uses WGAN-GP Network to replace the traditional GAN module and introduces Wasserstein distance and gradient penalty mechanism,which effectively avoids mode collapse and gradient vanishing,thereby further enhancing the stability of training and the quality of generated music outputs.Meanwhile,in order to address the lack of interpretability in the intermediate process of the generative model,64 kinds of intermediate perceptual features with clear interpretability are introduced,covering aspects such as music melody,harmony,rhythm,dynamics,timbre,expressiveness and form.These features are incorporated into the model as latent space variables to ensure that each dimension of the latent space corresponds to a specific and meaningful musical feature.In addition,a Gaussian mixture model is employed in place of the single Gaussian model tradi-tionally used in the variational autoencoder to capture and represent the nuanced distribution of musical features across different emotional categories.The experimental results show that this model performs excellently in transformations among six distinct emotions－happiness,sadness,tenderness,anger,fear,and surprise.Moreover,the proposed model outperforms comparative mo-dels in terms of emotional accuracy,reconstruction error,generation coherence,and generation diversity.

Key words: Emotion transformation in music, CVAE-WGAN, Swin transformer, Intermediate perceptual features, Gaussian mixture model

中图分类号:

TP181

胥备, 赵丹. 基于CVAE-WGAN的音乐情感转换模型[J]. 计算机科学, 2025, 52(11A): 241100014-13. https://doi.org/10.11896/jsjkx.241100014

XU Bei, ZHAO Dan. Music Emotion Transformation Model Based on CVAE-WGAN[J]. Computer Science, 2025, 52(11A): 241100014-13. https://doi.org/10.11896/jsjkx.241100014

参考文献

[1]FERREIRA L N,WHITEHEAD J.Learning to generate music with sentiment[J].arXiv:2103.06125,2021.
[2]KOH E,DUBNOV S.Comparison and analysis of deep audio embeddings for music emotion recognition[J].arXiv:2104.06517,2021.
[3]AGRES K R,SCHAEFER R S,VOLKA,et al.Music,computing,and health:a roadmap for the current and future roles of music technology for health care and well-being[J].Music & Science,2021,4:2059204321997709.
[4]ELLIOTT D,POLMAN R,MCGREGOR R.Relaxing music for anxiety control[J].Journal of Music Therapy,2011,48(3):264-288.
[5]STEWART J,GARRIDO S,HENSE C,et al.Music use formood regulation:Self-awareness and conscious listening choices in young people with tendencies to depression[J].Frontiers in Psychology,2019,10:1199.
[6]CLEMENTS-CORTÉSA.The use of music in facilitating emotional expression in the terminally ill[J].American Journal of Hospice and Palliative Medicine©,2004,21(4):255-260.
[7]FUJIOKA T,WEEN J E,JAMALI S,et al.Changes in neuro-magnetic beta-band oscillation after music-supported stroke rehabilitation[J].Annals of the New York Academy of Sciences,2012,1252(1):294-304.
[8]GORINI A,CAPIDEVILLE C S,DE L,et al.The role of immersion and narrative in mediated presence:the virtual hospital experience[J].Cyberpsychology,Behavior,and Social Networking,2011,14(3):99-105.
[9]KANTOSALO A,TOIVONEN H.Modes for creative human-computer collaboration:Alternating and task-divided co-creativity[C]//Proceedings of the Seventh International Conference on Computational Creativity.Paris:ICCC Press,2016:77-84.
[10]MICCHI G,BIGO L,GIRAUDM,et al.I Keep Counting:An experiment in human/AI co-creative songwriting[J].Transactions of the International Society for Music Information Retrieval(TISMIR),2021,4(1):263-275.
[11]MADHOK R,GOEL S,GARG S.SentiMozart:Music Generation based on Emotions[C]//ICAART(2).Portugal:SciTEPress,2018:501-506.
[12]MA L,ZHONG W,MA X,et al.Learning to generate emotional music correlated with music structure features[J].Cognitive Computation and Systems,2022,4(2):100-107.
[13]BENGIO Y,SIMARD P,FRASCONI P.Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Neural Networks,1994,5(2):157-166.
[14]VIJAYAKUMAR A K,COGSWELL M,SELVARAJU R R,et al.Diverse beam search:Decoding diverse solutions from neural sequence models[J].arXiv:1610.02424,2016.
[15]KINGMA D P,WELLING M.Auto-encoding variationalbayes[J].arXiv:1312.6114,2013.
[16]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene-rative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[17]ROBERTS A,ENGEL J,RAFFEL C,et al.A hierarchical latent vector model for learning long-term structure in music[C]//International Conference on Machine Learning.Stockholm:PMLR,2018:4364-4373.
[18]SOHN K,LEE H,YAN X.Learning structured output repre-sentation using deep conditional generative models[J].Advances in Neural Information Processing Systems,2015,28:1935.
[19]GREKOW J,DIMITROVA-GREKOW T.Monophonic musicgeneration with a given emotion using conditional variationalautoencoder[J].IEEE Access,2021,9:129088-129101.
[20]DAHMANI S,COLOTTE V,GIRARD V,et al.Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis[J].Neural Networks,2021,141:315-329.
[21]DONG H W,HSIAO W Y,YANG L C,et al.Musegan:Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:34-41.
[22]MIRZA M,OSINDERO S.Conditional generative adversarial-nets[J].arXiv:1411.1784,2014.
[23]ARJOVSKY M,CHINTALA S,BOTTOU L,Wasserstein GAN[J].arXiv:1701.07875,2017.
[24]GULRAJANI I,AHMED F,ARJOVSKY M,et al.Improvedtraining ofwasserstein gans[C]//Advances in Neural Information Processing Systems.2017:5769-5779.
[25]BAO J,CHEN D,WEN F,et al.CVAE-GAN:fine-grained image generation through asymmetric training[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice:IEEE Press,2017:2745-2754.
[26]GOMEZ P,DANUSER B.Relationships between musical structure and psychophysiological measures of emotion[J].Emotion,2007,7(2):377-387.
[27]CHOWDHURY S,VALL A,HAUNSCHMID V,et al.To-wards explainable music emotion recognition:The route via mid-levelfeatures[J].arXiv:1907.03572,2019.
[28]EKMAN P.Are there basic emotions?[J].Psychological Review,1992,99(3):550-553.
[29]PLUTCHIK R.The nature of emotions:Human emotions have deep evolutionary roots,a fact that may explain their complexity and provide tools for clinical practice[J].American Scientist,2001,89(4):344-350.
[30]HEVNER K.Experimental studies of the elements of expression inmusic[J].The American Journal of Psychology,1936,48(2):246-268.
[31]CHELKOWSKA-ZACHAREWICZ M,JANOWSKI M.Polishadaptation of the Geneva Emotional Music Scale:Factor structure andreliability[J].Psychology of Music,2021,49(5):1117-1131.
[32]THAYER R E.The biopsychology of mood andarousal[M].Oxford:Oxford University Press,1990.
[33]RUSSELL J A.A circumplex model ofaffect[J].Journal of Personality and Social Psychology,1980,39(6):1161.
[34]MEHRABIAN A.Silent messages:implicit communication of emotions andattitudes[M].Belmont:Wadsworth Pub,1981.
[35]FRIBERG A,SCHOONDERWALDT E,HEDBLAD A,et al.Using listener-based perceptual features as intermediate representations in music information retrieval[J].The Journal of the Acoustical Society of America,2014,136(4):1951-1963.
[36]ALJANAKI A,SOLEYMANI M.A data-driven approach tomid-level perceptual musical feature modeling[C]//Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR).2018:615-621.
[37]MCKINNEY M,BREEBAART J.Features for audio and music classification[C]//Proceedings of the International Conference on Music Information Retrieval.Plymouth MA,2004:151-158.
[38]PANDA R,MALHEIRO R,PAIVA R P.Novel audio features for music emotionrecognition[J].IEEE Transactions on Affective Computing,2018,11(4):614-626.
[39]PANDA R,MALHEIRO R,PAIVA R P.Audio features formusic emotion recognition:asurvey[J].IEEE Transactions on Affective Computing,2020,14(1):68-88.
[40]KHURANA A,MITTAL S,KUMAR D,et al.Tri-integrated convolutional neural network for audio image classification using Mel-frequency spectrograms[J].Multimedia Tools and Applications,2023,82(4):5521-5546.
[41]ALI S,NAZ B,NAREJO S,et al.Alex Net-Based Speech Emotion Recognition Using 3D Mel-Spectrograms[J].International Journal of Innovations in Science and Technology,2024,6(2):426-433.
[42]LIANG X,WU J,YIN Y.MIDI-Sandwich:Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track MusicGeneration[J].Australian Journal of Intelligent Information Processing Systems,2019,15(2):1-9.
[43]HUANG C F,HUANG C Y.Emotion-based AI music generation system with CVAE-GAN[C]//2020 IEEE Eurasia Confe-rence on IOT,Communication and Engineering(ECICE).Taiwan:IEEE Press,2020:220-222.
[44]HUANG C F,HUANG C Y.CVAE-GAN Emotional AI Music System for Car DrivingSafety[J].Intelligent Automation & Soft Computing,2022,32(3):1939-1953.
[45]KOSSALE Y,AIRAJ M,DAROUICHI A.Mode collapse ingenerative adversarial networks:An overview[C]//InternationalConference on Optimization and Applications(ICOA).IEEE Press,2022:1-6.
[46]FASSMEYER P,KORTMANN F,DREWS P,et al.Towards a Camera-Based Road Damage Assessment and Detection for Autonomous Vehicles:Applying Scaled-YOLO and CVAE-WGAN[C]//2021 IEEE 94th Vehicular Technology Conference(VTC2021-Fall).IEEE,2021:1-7.
[47]YONEKURA K,TOMORI Y,SUZUKI K.Airfoil Shape Generation and Feature Extraction Using the Conditional VAE-WGAN-gp[J].AI,2024,5(4):2092-2103.
[48]KUMAR N,KUMAR R,BHATTACHARYA S.Testing reliability ofMirtoolbox[C]//International Conference on Electronics and Communication Systems(ICECS).India:IEEE Press,2015:710-717.
[49]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.California:IEEE Press,2021:10012-10022.
[50]EEROLA T,VUOSKOSKI J K.A comparison of the discreteand dimensional models of emotion inmusic[J].Psychology of Music,2011,39(1):18-49.
[51]LI Y,FU R,MENG X,et al.A SAR-to-optical image translation method based on conditional generation adversarial network(cGAN)[J].IEEE Access,2020,8:60338-60343.
[52]NEVES P,FORNARI J,FLORINDO J.Generating music with sentiment using Transformer-GANs[J].arXiv:2212.11134,2022.
[53]MADHAVI K R,CHALIVENDRA V,VASANTHA C L,et al.Music Recommendation and Generation Based on Face Emotion Detection[C]//2024 7th International Conference on Circuit Power and Computing Technologies(ICCPCT).IEEE,2024:1205-1210.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed