Computer Science ›› 2025, Vol. 52 ›› Issue (11A): 241100014-13.doi: 10.11896/jsjkx.241100014

• Image Processing & Multimedia Technology • Previous Articles     Next Articles

Music Emotion Transformation Model Based on CVAE-WGAN

XU Bei1,2, ZHAO Dan1   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu Key Laboratory of Big Data Security & Intelligent Processing,Nanjing 210023,China
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    Natural Science Foundation of the Jiangsu Higher Education Institutions of China(21KJB520017).

Abstract: Music is an important means of emotional expression for people,serving as a powerful tool for conveying feelings.Music emotion transformation technology allows for the conversion of original music into music with a target emotion,thereby mee-ting users’ demands for diverse emotional music and significantly improving creative efficiency.The existing music emotion transformation technologies achieve end-to-end emotion transformation by constructing sophisticated deep learning models.However,in current methods,the correspondence between the emotional vector representing music and the actual musical features is insufficient,resulting in a lack of interpretability in the intermediate network layers,which limits the accuracy of emotion transformation to a certain extent and may contribute to the problem of gradient vanishing.To address these issues,a new music emotion transformation model based on the CVAE-WGAN(Conditional Variational Autoencoder Wasserstein Generative Adversarial Network) architecture is proposed.This model uses WGAN-GP Network to replace the traditional GAN module and introduces Wasserstein distance and gradient penalty mechanism,which effectively avoids mode collapse and gradient vanishing,thereby further enhancing the stability of training and the quality of generated music outputs.Meanwhile,in order to address the lack of interpretability in the intermediate process of the generative model,64 kinds of intermediate perceptual features with clear interpretability are introduced,covering aspects such as music melody,harmony,rhythm,dynamics,timbre,expressiveness and form.These features are incorporated into the model as latent space variables to ensure that each dimension of the latent space corresponds to a specific and meaningful musical feature.In addition,a Gaussian mixture model is employed in place of the single Gaussian model tradi-tionally used in the variational autoencoder to capture and represent the nuanced distribution of musical features across different emotional categories.The experimental results show that this model performs excellently in transformations among six distinct emotions-happiness,sadness,tenderness,anger,fear,and surprise.Moreover,the proposed model outperforms comparative mo-dels in terms of emotional accuracy,reconstruction error,generation coherence,and generation diversity.

Key words: Emotion transformation in music, CVAE-WGAN, Swin transformer, Intermediate perceptual features, Gaussian mixture model

CLC Number: 

  • TP181
[1]FERREIRA L N,WHITEHEAD J.Learning to generate music with sentiment[J].arXiv:2103.06125,2021.
[2]KOH E,DUBNOV S.Comparison and analysis of deep audio embeddings for music emotion recognition[J].arXiv:2104.06517,2021.
[3]AGRES K R,SCHAEFER R S,VOLKA,et al.Music,computing,and health:a roadmap for the current and future roles of music technology for health care and well-being[J].Music & Science,2021,4:2059204321997709.
[4]ELLIOTT D,POLMAN R,MCGREGOR R.Relaxing music for anxiety control[J].Journal of Music Therapy,2011,48(3):264-288.
[5]STEWART J,GARRIDO S,HENSE C,et al.Music use formood regulation:Self-awareness and conscious listening choices in young people with tendencies to depression[J].Frontiers in Psychology,2019,10:1199.
[6]CLEMENTS-CORTÉSA.The use of music in facilitating emotional expression in the terminally ill[J].American Journal of Hospice and Palliative Medicine©,2004,21(4):255-260.
[7]FUJIOKA T,WEEN J E,JAMALI S,et al.Changes in neuro-magnetic beta-band oscillation after music-supported stroke rehabilitation[J].Annals of the New York Academy of Sciences,2012,1252(1):294-304.
[8]GORINI A,CAPIDEVILLE C S,DE L,et al.The role of immersion and narrative in mediated presence:the virtual hospital experience[J].Cyberpsychology,Behavior,and Social Networking,2011,14(3):99-105.
[9]KANTOSALO A,TOIVONEN H.Modes for creative human-computer collaboration:Alternating and task-divided co-creativity[C]//Proceedings of the Seventh International Conference on Computational Creativity.Paris:ICCC Press,2016:77-84.
[10]MICCHI G,BIGO L,GIRAUDM,et al.I Keep Counting:An experiment in human/AI co-creative songwriting[J].Transactions of the International Society for Music Information Retrieval(TISMIR),2021,4(1):263-275.
[11]MADHOK R,GOEL S,GARG S.SentiMozart:Music Generation based on Emotions[C]//ICAART(2).Portugal:SciTEPress,2018:501-506.
[12]MA L,ZHONG W,MA X,et al.Learning to generate emotional music correlated with music structure features[J].Cognitive Computation and Systems,2022,4(2):100-107.
[13]BENGIO Y,SIMARD P,FRASCONI P.Learning long-term dependencies with gradient descent is difficult[J].IEEE Transactions on Neural Networks,1994,5(2):157-166.
[14]VIJAYAKUMAR A K,COGSWELL M,SELVARAJU R R,et al.Diverse beam search:Decoding diverse solutions from neural sequence models[J].arXiv:1610.02424,2016.
[15]KINGMA D P,WELLING M.Auto-encoding variationalbayes[J].arXiv:1312.6114,2013.
[16]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Gene-rative adversarial networks[J].Communications of the ACM,2020,63(11):139-144.
[17]ROBERTS A,ENGEL J,RAFFEL C,et al.A hierarchical latent vector model for learning long-term structure in music[C]//International Conference on Machine Learning.Stockholm:PMLR,2018:4364-4373.
[18]SOHN K,LEE H,YAN X.Learning structured output repre-sentation using deep conditional generative models[J].Advances in Neural Information Processing Systems,2015,28:1935.
[19]GREKOW J,DIMITROVA-GREKOW T.Monophonic musicgeneration with a given emotion using conditional variationalautoencoder[J].IEEE Access,2021,9:129088-129101.
[20]DAHMANI S,COLOTTE V,GIRARD V,et al.Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis[J].Neural Networks,2021,141:315-329.
[21]DONG H W,HSIAO W Y,YANG L C,et al.Musegan:Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence.New Orleans:AAAI Press,2018:34-41.
[22]MIRZA M,OSINDERO S.Conditional generative adversarial-nets[J].arXiv:1411.1784,2014.
[23]ARJOVSKY M,CHINTALA S,BOTTOU L,Wasserstein GAN[J].arXiv:1701.07875,2017.
[24]GULRAJANI I,AHMED F,ARJOVSKY M,et al.Improvedtraining ofwasserstein gans[C]//Advances in Neural Information Processing Systems.2017:5769-5779.
[25]BAO J,CHEN D,WEN F,et al.CVAE-GAN:fine-grained image generation through asymmetric training[C]//Proceedings of the IEEE International Conference on Computer Vision.Venice:IEEE Press,2017:2745-2754.
[26]GOMEZ P,DANUSER B.Relationships between musical structure and psychophysiological measures of emotion[J].Emotion,2007,7(2):377-387.
[27]CHOWDHURY S,VALL A,HAUNSCHMID V,et al.To-wards explainable music emotion recognition:The route via mid-levelfeatures[J].arXiv:1907.03572,2019.
[28]EKMAN P.Are there basic emotions?[J].Psychological Review,1992,99(3):550-553.
[29]PLUTCHIK R.The nature of emotions:Human emotions have deep evolutionary roots,a fact that may explain their complexity and provide tools for clinical practice[J].American Scientist,2001,89(4):344-350.
[30]HEVNER K.Experimental studies of the elements of expression inmusic[J].The American Journal of Psychology,1936,48(2):246-268.
[31]CHELKOWSKA-ZACHAREWICZ M,JANOWSKI M.Polishadaptation of the Geneva Emotional Music Scale:Factor structure andreliability[J].Psychology of Music,2021,49(5):1117-1131.
[32]THAYER R E.The biopsychology of mood andarousal[M].Oxford:Oxford University Press,1990.
[33]RUSSELL J A.A circumplex model ofaffect[J].Journal of Personality and Social Psychology,1980,39(6):1161.
[34]MEHRABIAN A.Silent messages:implicit communication of emotions andattitudes[M].Belmont:Wadsworth Pub,1981.
[35]FRIBERG A,SCHOONDERWALDT E,HEDBLAD A,et al.Using listener-based perceptual features as intermediate representations in music information retrieval[J].The Journal of the Acoustical Society of America,2014,136(4):1951-1963.
[36]ALJANAKI A,SOLEYMANI M.A data-driven approach tomid-level perceptual musical feature modeling[C]//Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR).2018:615-621.
[37]MCKINNEY M,BREEBAART J.Features for audio and music classification[C]//Proceedings of the International Conference on Music Information Retrieval.Plymouth MA,2004:151-158.
[38]PANDA R,MALHEIRO R,PAIVA R P.Novel audio features for music emotionrecognition[J].IEEE Transactions on Affective Computing,2018,11(4):614-626.
[39]PANDA R,MALHEIRO R,PAIVA R P.Audio features formusic emotion recognition:asurvey[J].IEEE Transactions on Affective Computing,2020,14(1):68-88.
[40]KHURANA A,MITTAL S,KUMAR D,et al.Tri-integrated convolutional neural network for audio image classification using Mel-frequency spectrograms[J].Multimedia Tools and Applications,2023,82(4):5521-5546.
[41]ALI S,NAZ B,NAREJO S,et al.Alex Net-Based Speech Emotion Recognition Using 3D Mel-Spectrograms[J].International Journal of Innovations in Science and Technology,2024,6(2):426-433.
[42]LIANG X,WU J,YIN Y.MIDI-Sandwich:Multi-model Multi-task Hierarchical Conditional VAE-GAN networks for Symbolic Single-track MusicGeneration[J].Australian Journal of Intelligent Information Processing Systems,2019,15(2):1-9.
[43]HUANG C F,HUANG C Y.Emotion-based AI music generation system with CVAE-GAN[C]//2020 IEEE Eurasia Confe-rence on IOT,Communication and Engineering(ECICE).Taiwan:IEEE Press,2020:220-222.
[44]HUANG C F,HUANG C Y.CVAE-GAN Emotional AI Music System for Car DrivingSafety[J].Intelligent Automation & Soft Computing,2022,32(3):1939-1953.
[45]KOSSALE Y,AIRAJ M,DAROUICHI A.Mode collapse ingenerative adversarial networks:An overview[C]//InternationalConference on Optimization and Applications(ICOA).IEEE Press,2022:1-6.
[46]FASSMEYER P,KORTMANN F,DREWS P,et al.Towards a Camera-Based Road Damage Assessment and Detection for Autonomous Vehicles:Applying Scaled-YOLO and CVAE-WGAN[C]//2021 IEEE 94th Vehicular Technology Conference(VTC2021-Fall).IEEE,2021:1-7.
[47]YONEKURA K,TOMORI Y,SUZUKI K.Airfoil Shape Generation and Feature Extraction Using the Conditional VAE-WGAN-gp[J].AI,2024,5(4):2092-2103.
[48]KUMAR N,KUMAR R,BHATTACHARYA S.Testing reliability ofMirtoolbox[C]//International Conference on Electronics and Communication Systems(ICECS).India:IEEE Press,2015:710-717.
[49]LIU Z,LIN Y,CAO Y,et al.Swin transformer:Hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.California:IEEE Press,2021:10012-10022.
[50]EEROLA T,VUOSKOSKI J K.A comparison of the discreteand dimensional models of emotion inmusic[J].Psychology of Music,2011,39(1):18-49.
[51]LI Y,FU R,MENG X,et al.A SAR-to-optical image translation method based on conditional generation adversarial network(cGAN)[J].IEEE Access,2020,8:60338-60343.
[52]NEVES P,FORNARI J,FLORINDO J.Generating music with sentiment using Transformer-GANs[J].arXiv:2212.11134,2022.
[53]MADHAVI K R,CHALIVENDRA V,VASANTHA C L,et al.Music Recommendation and Generation Based on Face Emotion Detection[C]//2024 7th International Conference on Circuit Power and Computing Technologies(ICCPCT).IEEE,2024:1205-1210.
[1] LIU Jiasen, HUANG Jun. Center Point Target Detection Algorithm Based on Improved Swin Transformer [J]. Computer Science, 2024, 51(6): 264-271.
[2] ZHU Wentao, LIU Wei, LIANG Shangsong, ZHU Huaijie, YIN Jian. Variational Continuous Bayesian Meta-learning Based Algorithm for Recommendation [J]. Computer Science, 2023, 50(7): 66-71.
[3] BAI Zhengyao, FAN Shenglan, LU Qianjie, ZHOU Xue. COVID-19 Instance Segmentation and Classification Network Based on CT Image Semantics [J]. Computer Science, 2023, 50(6A): 220600142-9.
[4] SUN Kaixin, LIU Bin, SU Shuguang. Medical Microscopic Image Segmentation Model Based on CNN Structure and Swin Transformer [J]. Computer Science, 2023, 50(11A): 230200119-8.
[5] ZHANG Hui. Fault Localization Technology Based on Program Mutation and Gaussian Mixture Model [J]. Computer Science, 2021, 48(6A): 572-574.
[6] ZOU Cheng-ming, CHEN De. Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis [J]. Computer Science, 2021, 48(2): 121-127.
[7] WANG Wei-dong, XU Jin-hui, ZHANG Zhi-feng, YANG Xi-bei. Gaussian Mixture Models Algorithm Based on Density Peaks Clustering [J]. Computer Science, 2021, 48(10): 191-196.
[8] WANG Yan, LUO Qian, DENG Hui. Bearing Fault Diagnosis Method Based on Variational Bayes [J]. Computer Science, 2019, 46(11): 323-327.
[9] CHAI Wu-yi, YANG Feng, YUAN Shao-feng, HUANG Jing. Multi-class Gaussian Mixture Model and Neighborhood Information BasedGaussian Mixture Model for Image Segmentation [J]. Computer Science, 2018, 45(11): 272-277.
[10] CHENG Ying-chao, WANG Rui-hu and HU Zhang-ping. Novel Approach on Collaborative Filtering Based on Gaussian Mixture Model [J]. Computer Science, 2017, 44(Z6): 451-454.
[11] ZHANG Yi-hao, LIU Zhi and ZHU Chang-peng. Chinese Word Sense Induction Model by Integrating Distance Metric and Gaussian Mixture Model [J]. Computer Science, 2017, 44(8): 265-269.
[12] LI Rui and SHENG Chao. Mixed Gaussian Target Detection Algorithm Based on Entropy and Related Close Degree [J]. Computer Science, 2017, 44(12): 304-309.
[13] ZHOU Hong-yu, YANG Yang and ZHANG Su. Non-rigid Point Set Registration Algorithm Based on Iteration [J]. Computer Science, 2016, 43(Z6): 226-231.
[14] JIA Xu, SUN Fu-ming, CAO Yu-dong, CUI Jian-jiang and XUE Ding-yu. Dorsal Hand Vein Recognition Algorithm Based on Effective Dimensional Feature [J]. Computer Science, 2016, 43(1): 315-319.
[15] SHENG Jia-chuan and YANG Wei. Research on Moving Objects Detection in Video Sequences Based on Grabcut-guassian Mixture Model [J]. Computer Science, 2015, 42(Z11): 199-202.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!