计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 281-296.doi: 10.11896/jsjkx.230500124

• 人工智能 • 上一篇    下一篇

基于改进高斯混合变分自编码器的半监督情感音乐生成

胥备1,2, 刘桐1   

  1. 1 南京邮电大学计算机学院 南京 210023
    2 江苏大数据安全与智能处理重点实验室 南京 210023
  • 收稿日期:2023-05-21 修回日期:2023-11-14 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 刘桐(liutong9986@163.com)
  • 作者简介:(xubei@njupt.edu.cn)
  • 基金资助:
    江苏省高校自然科学基金面上项目(21KJB520017)

Semi-supervised Emotional Music Generation Method Based on Improved Gaussian Mixture Variational Autoencoders

XU Bei1,2, LIU Tong1   

  1. 1 School of Computer Science,Nanjing University of Posts and Telecommunications,Nanjing 210023,China
    2 Jiangsu Key Laboratory of Big Data Security & Intelligent Processing,Nanjing 210023,China
  • Received:2023-05-21 Revised:2023-11-14 Online:2024-08-15 Published:2024-08-13
  • About author:XU Bei,born in 1986,Ph.D,associate professor,is a member of CCF(No.P1014M).His main research interests include affective computing and natural language processing.
    LIU Tong,born in 1997,postgraduate.His main research interests include affective computing and music generation.
  • Supported by:
    Natural Science Foundation of the Jiangsu Higher Education Institutions of China(21KJB520017).

摘要: 音乐可以通过序列化的声音信息传递声音内容和情感。情感是音乐所表达的语义中的重要组成部分,因此,音乐生成技术不仅要考虑音乐的结构信息,还应融入情感元素。现有的情感音乐生成技术大多采用基于情感标注的完全监督方法,但音乐领域缺乏大量标准的情感标注数据集,且情感标签不足以表达音乐的情感特征。针对上述问题,提出了基于改进的高斯混合变分自编码器(Gaussian Mixture Variational Autoencoders,GMVAE)的半监督情感音乐生成方法(Semg-GMVAE),将音乐的节奏特征和调式特征与情感建立联系,同时向GMVAE中引入一种特征解纠缠机制来分别学习这两种特征的潜在变量表示,并对其进行半监督聚类推断。最后通过操纵音乐的特征表示,实现了针对快乐、紧张、悲伤、平静情感的音乐生成与情感转换。同时,针对GMVAE难以区分不同情感类别数据的问题,实验指出其关键原因是GMVAE证据下界中的方差正则项与互信息抑制项使得各类别的高斯分量分散性不足,从而影响学习表示的性能和生成的数据样本的情感质量。因此,Semg-GMVAE对这两项因子分别进行了惩罚和增强,并使用Transformer-XL作为编码器和解码器以提升在长序列音乐上的建模能力。基于真实数据集的实验结果表明,相比现有方法,Semg-GMVAE能够将不同情感的音乐在潜在空间中更好地分离,增强了音乐与情感的关联程度,并且能够有效对不同音乐特征进行解纠缠分离,最后通过改变特征表示更好地实现情感音乐生成或情感切换。

关键词: 情感音乐生成, 半监督生成模型, 解纠缠表示学习, 高斯混合变分自编码器, Transformer-XL

Abstract: Music can transmit audio content and emotions through serialized audio features.Emotion is an important component in the semantic expression of music.Therefore,music generation technology should not only consider the structural information of music but also incorporate emotions.Most existing emotional music generation technologies use the complete supervised methods based on emotion labeling.However,the music field lacks a large number of standard emotional labeling datasets,and emotional labels are insufficient to express the emotional features of music.To solve these problems,this paper proposes a semi-supervised emotional music generation method(Semg-GMVAE) based on improved Gaussian mixture variational autoencoders(GMVAE),which connects the rhythm features and mode features of music with emotions,incorporates a feature disentanglement mechanism into GMVAE to learn the potential variable representations of these two features,and performs semi-supervised clustering infe-rence on them.Finally,by manipulating the feature representation of music,our model can achieve music generation and emotion switching on happy,tense,sad,and calm emotions.Meanwhile,this paper conducts a series of experiments on the problem that GMVAE is difficult to distinguish different emotional categories of data.The key reason for the problem is that the variance regularization term and mutual information suppression term in the evidence lower bound of GMVAE make the Gaussian components of each category less dispersed,thus affecting the performance of learned representation and the quality of generation.Therefore,Semg-GMVAE penalizes and augments these two factors respectively,and uses Transformer-XL as the encoder and decoder to enhance the modeling capabilities on long sequence music.Experimental results based on real data show that,compared to existing methods,Semg-GMVAE achieves better separation of music with different emotions in potential space,enhances the correlation between music and emotions,effectively disentangles different music features,and finally achieves better emotional music generation and emotion switching by changing the feature representation.

Key words: Emotional music generation, Semi-supervised generative models, Disentangled representation learning, Gaussian mixture variational autoencoders, Transformer-XL

中图分类号: 

  • TP181
[1]TIE Y,CHEN H J,JIN C,et al.Research on emotion recognition method based on audio and video feature fusion[J].Journal of Chongqing University of Technology(Natural Science),2022,36(1):120-127.
[2]MA L,ZHONG W,MA X,et al.Learning to generate emotional music correlated with music structure features[J].Cognitive Computation and Systems,2022,4(2):100-107.
[3]SULUN S,DAVIES M E P,VIANA P.Symbolic music genera-tion conditioned on continuous-valued emotions[J].IEEE Access,2022,10:44617-44626.
[4]HUNG H T,CHING J,DOH S,et al.EMOPIA:A Multi-Modal Pop Piano Dataset For Emotion Recognition and Emotion-based Music Generation[C]//Proceedings of 22th International Society for Music Information Retrieval Conference(ISMIR).2021:318-325.
[5]KINGMA D P,WELLING M.Auto-encoding variational bayes[J].arXiv:1312.6114,2013.
[6]GREKOW J,DIMITROVA-GREKOW T.Monophonic musicgeneration with a given emotion using conditional variational autoencoder[J].IEEE Access,2021,9:129088-129101.
[7]TAN H H,HERREMANS D.Music FaderNets:ControllableMusic Generation Based on High-Level Features via Low-Level Feature Modelling[C]//Proceedings of 21th International So-ciety for Music Information Retrieval Conference(ISMIR).2020:109-116.
[8]DILOKTHANAKUL N,MEDIANO P A M,GARNELO M,et al.Deep unsupervised clustering with Gaussian mixture variational autoencoders[C]//International Conference on Learning Representations(ICLR).2017.
[9]RUSSELL J A.A circumplex model of affect[J].Journal of Personality and Social Psychology,1980,39(6):1161.
[10]LI Z,ZHAO Y,XU H,et al.Unsupervised clustering throughGaussian mixture variational autoencoder with non-reparamete-rized variational inference and std annealing[C]//2020 International Joint Conference on Neural Networks(IJCNN).IEEE,2020:1-8.
[11]ROBERTS A,ENGEL J,RAFFEL C,et al.Ahierarchical latent vector model for learning long-term structure in music[C]//International Conference on Machine Learning(ICML).PMLR,2018:4364-4373.
[12]BRUNNER G,KONRAD A,WANG Y,et al.MIDI-VAE:Mo-deling dynamics and instrumentation of music with applications to style transfer[C]//Proceedings of the 19th International Society for Music Information Retrieval Conference(ISMIR).2018:747-754.
[13]DAI Z,YANG Z,YANG Y,et al.Transformer-XL:Attentive Language Models beyond a Fixed-Length Context[C]//Proceedings of the 57thAnnual Meeting of the Association for Computational Linguistics(ACL).2019:2978-2988.
[14]HEVNER K.Experimental studies of the elements of expression in music[J].The American Journal of Psychology,1936,48(2):246-268.
[15]CHEŁKOWSKA-ZACHAREWICZ M,JANOWSKI M.Polishadaptation of the Geneva Emotional Music Scale:Factor structure and reliability[J].Psychology of Music,2021,49(5):1117-1131.
[16]THAYER R E.The biopsychology of mood and arousal[M].Oxford University Press,1990.
[17]MEHRABIAN A.Silent messages:implicit communication ofemotions and attitudes[M].Wadsworth Pub,1981.
[18]KREUTZ G,OTT U,TEICHMANN D,et al.Using music to induce emotions:Influences of musical preference and absorption[J].Psychology of Music,2008,36(1):101-126.
[19]VIEILLARD S,PERETZ I,GOSSELIN N,et al.Happy,sad,scary and peaceful musical excerpts for research on emotions[J].Cognition & Emotion,2008,22(4):720-752.
[20]YANG X,SONG Z,KING I,et al.A survey on deep semi-supervised learning[J].arXiv:2103.00550,2021.
[21]KINGMA D P,REZENDE D J,MOHAMED S,et al.Semi-supervised learning with deep generative models[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 2(NIPS).2014:3581-3589.
[22]HABIB R,MARIOORYAD S,SHANNON M,et al.Semi-supervised generative modeling for controllable speech synthesis[C]//International Conference on Learning Representations(ICLR).2019.
[23]CHEUNG V K M,KAO H K,SU L.Semi-supervised violin fingering generation using variational autoencoders[C]//Procee-dings of 22th International Society for Music Information Retrie-val Conference(ISMIR).2021:113-120.
[24]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing,1997,45(11):2673-2681.
[25]LI Y,PAN Q,WANG S,et al.Disentangled variational autoencoder for semi-supervised learning[J].Information Sciences,2019,482:73-85.
[26]JOY T,SCHMON S M,TORR P H S,et al.Capturing labelcharacteristics in VAEs[C]//International Conference on Learning Representations(ICLR).2021.
[27]DEMPSTER A P,LAIRD N M,RUBIN D B.Maximum likelihood from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society:Series B(Methodological),1977,39(1):1-22.
[28]LUO Y J,AGRES K,HERREMANS D.Learning disentangled representations of timbre and pitch for musical instrument sounds using Gaussian mixture variational autoencoders[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:746-753.
[29]BENGIO Y,COURVILLE A,VINCENT P.Representationlearning:A review and new perspectives[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(8):1798-1828.
[30]WANG X,CHEN H,TANG S,et al.Disentangled Representation Learning[J].arXiv:2211.11695,2022.
[31]HIGGINS I,MATTHEY L,PALA,et al.beta-VAE:Learning basic visual concepts with a constrained variational framework[C]//International Conference on Learning Representations(ICLR).2017.
[32]CHEN R T Q,LI X,GROSSE R,et al.Isolating sources ofdisentanglement inVAEs[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems(NIPS).2018:2615-2625.
[33]KUMAR A,SATTIGERI P,BALAKRISHNAN A.Variational inference of disentangled latent concepts from unlabeled observations[C]//International Conference on Learning Representations(ICLR).2018.
[34]WANG Z,WANG D,ZHANG Y,et al.Learning interpretable representation for controllable polyphonic music generation[C]//Proceedings of 21th International Society for Music Information Retrieval Conference(ISMIR).2020:662-669.
[35]YANG R,WANG D,WANG Z,et al.Deep music analogy via latent representation disentanglement[C]//Proceedings of 20th International Society for Music Information Retrieval Confe-rence(ISMIR).2019:596-603.
[36]WU Y,CARSAULT T,NAKAMURA E,et al.Semi-supervised neural chord estimation based on a variational autoencoder with latent chord labels and features[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2020,28:2956-2966.
[37]AKAMA T.Controlling Symbolic Music Generation based onConcept Learning from Domain Knowledge[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:816-823.
[38]CHOI K,CHO K.Deep unsupervised drum transcription[C]//Proceedings of 20th International Society for Music Information Retrieval Conference(ISMIR).2019:183-191.
[39]ZHANG Y.Representation learning for controllable music ge-neration:A survey[C]//Proceedings of 20th International So-ciety for Music Information Retrieval Conference(ISMIR).2020:1-8.
[40]MI L,HE T,PARK C F,et al.Revisiting LatentSpace Interpolation via a Quantitative Evaluation Framework[J].arXiv:2110.06421,2021.
[41]JIANG Z,ZHENG Y,TAN H,et al.Variational deep embedding:an unsupervised and generative approach to clustering[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence(IJCAI).2017:1965-1972.
[42]ZHAO T,LEE K,ESKENAZI M.Unsupervised Discrete Sen-tence Representation Learning for Interpretable Neural Dialog Generation[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(ACL).2018:1098-1107.
[43]REZAABAD A L,VISHWANATH S.Learning representations by maximizing mutual information in variational autoencoders[C]//2020 IEEE International Symposium on Information Theory(ISIT).IEEE,2020:2729-2734.
[44]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems(NeurIPS).2017:5998-6008.
[45]JIANG J,XIA G G,CARLTON D B,et al.Transformer VAE:A Hierarchical Model for Structure-Aware and Interpretable Music Representation Learning[C]//2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP 2020).IEEE,2020:516-520.
[46]WU S L,YANG Y H.MuseMorphose:Full-song and fine-grained piano music style transfer with one transformer VAE[J].IEEE/ACM Transactions on Audio,Speech,and Language Processing,2023,31:1953-1967.
[47]DONG H W,HSIAO W Y,YANG L C,et al.MuseGAN:Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment[C]//Proceedings of the AAAI Conference on Artificial Intelligence(AAAI).2018:34-41.
[48]BERTIN-MAHIEUX T,ELLIS D P W,WHITMAN B,et al.The million song dataset[C]//Proceedings of 12th International Society for Music Information Retrieval Conference(ISMIR).2011:591-596.
[49]HUANG Y S,YANG Y H.Pop music transformer:Beat-based modeling and generation of expressive pop piano compositions[C]//Proceedings of the 28th ACM International Conference on Multimedia(ACM Multimedia).2020:1180-1188.
[50]GLOROT X,BENGIO Y.Understanding the difficulty of trai-ning deep feedforward neural networks[C]//Proceedings of the 13th International Conference on Artificial Intelligence and Statistics(AISTATS).JMLR Workshop and Conference Proceedings,2010:249-256.
[51]ZHENG K,MENG R,ZHENG C,et al.EmotionBox:A music-element-driven emotional music generation system based on music psychology[J].Frontiers in Psychology,2022,13:5189.
[52]VAN DER MAATEN L,HINTON G.Visualizing Data using t-SNE[J].Journal of Machine Learning Research,2008,9:2579-2605.
[53]KAWAI L,ESLING P,HARADA T.Attributes-Aware DeepMusic Transformation[C]//Proceedings of 21th International Society for Music Information Retrieval Conference(ISMIR).2020:670-677.
[54]DONG H W,HSIAO W Y,YANG Y H.Pypianoroll:Opensource Python package for handling multitrack pianorolls[C]//Proceedings of 19th International Society for Music Information Retrieval Conference(ISMIR).Late-breaking paper,2018.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!