Computer Science ›› 2024, Vol. 51 ›› Issue (8): 281-296.doi: 10.11896/jsjkx.230500124

• Artificial Intelligence • Previous Articles     Next Articles

Semi-supervised Emotional Music Generation Method Based on Improved Gaussian Mixture Variational Autoencoders

XU Bei1,2, LIU Tong1   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu Key Laboratory of Big Data Security & Intelligent Processing,Nanjing 210023,China
  • Received:2023-05-21 Revised:2023-11-14 Online:2024-08-15 Published:2024-08-13
  • About author:XU Bei,born in 1986,Ph.D,associate professor,is a member of CCF(No.P1014M).His main research interests include affective computing and natural language processing.
    LIU Tong,born in 1997,postgraduate.His main research interests include affective computing and music generation.
  • Supported by:
    Natural Science Foundation of the Jiangsu Higher Education Institutions of China(21KJB520017).

Abstract: Music can transmit audio content and emotions through serialized audio features.Emotion is an important component in the semantic expression of music.Therefore,music generation technology should not only consider the structural information of music but also incorporate emotions.Most existing emotional music generation technologies use the complete supervised methods based on emotion labeling.However,the music field lacks a large number of standard emotional labeling datasets,and emotional labels are insufficient to express the emotional features of music.To solve these problems,this paper proposes a semi-supervised emotional music generation method(Semg-GMVAE) based on improved Gaussian mixture variational autoencoders(GMVAE),which connects the rhythm features and mode features of music with emotions,incorporates a feature disentanglement mechanism into GMVAE to learn the potential variable representations of these two features,and performs semi-supervised clustering infe-rence on them.Finally,by manipulating the feature representation of music,our model can achieve music generation and emotion switching on happy,tense,sad,and calm emotions.Meanwhile,this paper conducts a series of experiments on the problem that GMVAE is difficult to distinguish different emotional categories of data.The key reason for the problem is that the variance regularization term and mutual information suppression term in the evidence lower bound of GMVAE make the Gaussian components of each category less dispersed,thus affecting the performance of learned representation and the quality of generation.Therefore,Semg-GMVAE penalizes and augments these two factors respectively,and uses Transformer-XL as the encoder and decoder to enhance the modeling capabilities on long sequence music.Experimental results based on real data show that,compared to existing methods,Semg-GMVAE achieves better separation of music with different emotions in potential space,enhances the correlation between music and emotions,effectively disentangles different music features,and finally achieves better emotional music generation and emotion switching by changing the feature representation.

Key words: Emotional music generation, Semi-supervised generative models, Disentangled representation learning, Gaussian mixture variational autoencoders, Transformer-XL

CLC Number: 

  • TP181
[1]TIE Y,CHEN H J,JIN C,et al.Research on emotion recognition method based on audio and video feature fusion[J].Journal of Chongqing University of Technology(Natural Science),2022,36(1):120-127.
[2]MA L,ZHONG W,MA X,et al.Learning to generate emotional music correlated with music structure features[J].Cognitive Computation and Systems,2022,4(2):100-107.
[3]SULUN S,DAVIES M E P,VIANA P.Symbolic music genera-tion conditioned on continuous-valued emotions[J].IEEE Access,2022,10:44617-44626.
[4]HUNG H T,CHING J,DOH S,et al.EMOPIA:A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation[C]//Proceedings of 22th International Society for Music Information Retrieval Conference(ISMIR).2021:318-325.
[5]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[6]GREKOW J,DIMITROVA-GREKOW T.Monophonic musicgeneration with a given emotion using conditional variational autoencoder[J].IEEE Access,2021,9:129088-129101.
[7]TAN H H,HERREMANS D.Music FaderNets:ControllableMusic Generation Based on High-Level Features via Low-Level Feature Modelling[C]//Proceedings of 21th International So-ciety for Music Information Retrieval Conference(ISMIR).2020:109-116.
[8]DILOKTHANAKUL N,MEDIANO P A M,GARNELO M,et al.Deep unsupervised clustering with Gaussian mixture variational autoencoders[C]//International Conference on Learning Representations(ICLR).2017.
[9]RUSSELL J A.A circumplex model of affect[J].Journal of Personality and Social Psychology,1980,39(6):1161.
[10]LI Z,ZHAO Y,XU H,et al.Unsupervised clustering throughGaussian mixture variational autoencoder with non-reparamete-rized variational inference and std annealing[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[11]ROBERTS A,ENGEL J,RAFFEL C,et al.Ahierarchical latent vector model for learning long-term structure in music[C]//International Conference on Machine Learning(ICML).PMLR,2018:4364-4373.
[12]BRUNNER G,KONRAD A,WANG Y,et al.MIDI-VAE:Mo-deling dynamics and instrumentation of music with applications to style transfer[C]//Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR).2018:747-754.
[13]DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive Language Models beyond a Fixed-Length Context[C]//Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics(ACL).2019:2978-2988.
[14]HEVNER K.Experimental studies of the elements of expression in music[J].The American Journal of Psychology,1936,48(2):246-268.
[15]CHEŁKOWSKA-ZACHAREWICZ M,JANOWSKI M.Polishadaptation of the Geneva Emotional Music Scale:Factor structure and reliability[J].Psychology of Music,2021,49(5):1117-1131.
[16]THAYER R E.The biopsychology of mood and arousal[M].Oxford University Press,1990.
[17]MEHRABIAN A.Silent messages:implicit communication ofemotions and attitudes[M].Wadsworth Pub,1981.
[18]KREUTZ G,OTT U,TEICHMANN D,et al.Using music to induce emotions:Influences of musical preference and absorption[J].Psychology of Music,2008,36(1):101-126.
[19]VIEILLARD S,PERETZ I,GOSSELIN N,et al.Happy,sad,scary and peaceful musical excerpts for research on emotions[J].Cognition & Emotion,2008,22(4):720-752.
[20]YANG X,SONG Z,KING I,et al.A survey on deep semi-supervised learning[J].arXiv:2103.00550,2021.
[21]KINGMA D P,REZENDE D J,MOHAMED S,et al.Semi-supervised learning with deep generative models[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2(NIPS).2014:3581-3589.
[22]HABIB R,MARIOORYAD S,SHANNON M,et al.Semi-supervised generative modeling for controllable speech synthesis[C]//International Conference on Learning Representations(ICLR).2019.
[23]CHEUNG V K M,KAO H K,SU L.Semi-supervised violin fingering generation using variational autoencoders[C]//Procee-dings of 22th International Society for Music Information Retrie-val Conference(ISMIR).2021:113-120.
[24]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[25]LI Y,PAN Q,WANG S,et al.Disentangled variational autoencoder for semi-supervised learning[J].Information Sciences,2019,482:73-85.
[26]JOY T,SCHMON S M,TORR P H S,et al.Capturing labelcharacteristics in VAEs[C]//International Conference on Learning Representations(ICLR).2021.
[27]DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society:Series B(Methodological),1977,39(1):1-22.
[28]LUO Y J,AGRES K,HERREMANS D.Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:746-753.
[29]BENGIO Y,COURVILLE A,VINCENT P.Representationlearning:A review and new perspectives[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
[30]WANG X,CHEN H,TANG S,et al.Disentangled Representation Learning[J].arXiv:2211.11695,2022.
[31]HIGGINS I,MATTHEY L,PALA,et al.beta-VAE:Learning basic visual concepts with a constrained variational framework[C]//International Conference on Learning Representations(ICLR).2017.
[32]CHEN R T Q,LI X,GROSSE R,et al.Isolating sources ofdisentanglement inVAEs[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems(NIPS).2018:2615-2625.
[33]KUMAR A,SATTIGERI P,BALAKRISHNAN A.Variational inference of disentangled latent concepts from unlabeled observations[C]//International Conference on Learning Representations(ICLR).2018.
[34]WANG Z,WANG D,ZHANG Y,et al.Learning interpretable representation for controllable polyphonic music generation[C]//Proceedings of 21th International Society for Music Information Retrieval Conference(ISMIR).2020:662-669.
[35]YANG R,WANG D,WANG Z,et al.Deep music analogy via latent representation disentanglement[C]//Proceedings of 20th International Society for Music Information Retrieval Confe-rence(ISMIR).2019:596-603.
[36]WU Y,CARSAULT T,NAKAMURA E,et al.Semi-supervised neural chord estimation based on a variational autoencoder with latent chord labels and features[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2956-2966.
[37]AKAMA T.Controlling Symbolic Music Generation based onConcept Learning from Domain Knowledge[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:816-823.
[38]CHOI K,CHO K.Deep unsupervised drum transcription[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:183-191.
[39]ZHANG Y.Representation learning for controllable music ge-neration:A survey[C]//Proceedings of 20th International So-ciety for Music Information Retrieval Conference(ISMIR).2020:1-8.
[40]MI L,HE T,PARK C F,et al.Revisiting LatentSpace Interpolation via a Quantitative Evaluation Framework[J].arXiv:2110.06421,2021.
[41]JIANG Z,ZHENG Y,TAN H,et al.Variational deep embedding:an unsupervised and generative approach to clustering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence(IJCAI).2017:1965-1972.
[42]ZHAO T,LEE K,ESKENAZI M.Unsupervised Discrete Sen-tence Representation Learning for Interpretable Neural Dialog Generation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(ACL).2018:1098-1107.
[43]REZAABAD A L,VISHWANATH S.Learning representations by maximizing mutual information in variational autoencoders[C]//2020 IEEE International Symposium on Information Theory(ISIT).IEEE,2020:2729-2734.
[44]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems(NeurIPS).2017:5998-6008.
[45]JIANG J,XIA G G,CARLTON D B,et al.Transformer VAE:A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:516-520.
[46]WU S L,YANG Y H.MuseMorphose:Full-song and fine-grained piano music style transfer with one transformer VAE[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:1953-1967.
[47]DONG H W,HSIAO W Y,YANG L C,et al.MuseGAN:Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence(AAAI).2018:34-41.
[48]BERTIN-MAHIEUX T,ELLIS D P W,WHITMAN B,et al.The million song dataset[C]//Proceedings of 12th International Society for Music Information Retrieval Conference(ISMIR).2011:591-596.
[49]HUANG Y S,YANG Y H.Pop music transformer:Beat-based modeling and generation of expressive pop piano compositions[C]//Proceedings of the 28th ACM International Conference on Multimedia(ACM Multimedia).2020:1180-1188.
[50]GLOROT X,BENGIO Y.Understanding the difficulty of trai-ning deep feedforward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics(AISTATS).JMLR Workshop and Conference Proceedings,2010:249-256.
[51]ZHENG K,MENG R,ZHENG C,et al.EmotionBox:A music-element-driven emotional music generation system based on music psychology[J].Frontiers in Psychology,2022,13:5189.
[52]VAN DER MAATEN L,HINTON G.Visualizing Data using t-SNE[J].Journal of Machine Learning Research,2008,9:2579-2605.
[53]KAWAI L,ESLING P,HARADA T.Attributes-Aware DeepMusic Transformation[C]//Proceedings of 21th International Society for Music Information Retrieval Conference(ISMIR).2020:670-677.
[54]DONG H W,HSIAO W Y,YANG Y H.Pypianoroll:Opensource Python package for handling multitrack pianorolls[C]//Proceedings of 19th International Society for Music Information Retrieval Conference(ISMIR).Late-breaking paper,2018.
[1] HAO Jingyu, WEN Jingxuan, LIU Huafeng, JING Liping, YU Jian. Deep Disentangled Collaborative Filtering with Graph Global Information [J]. Computer Science, 2023, 50(1): 41-51.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!