计算机科学 ›› 2022, Vol. 49 ›› Issue (2): 107-115.doi: 10.11896/jsjkx.210600085

• 计算机视觉:理论与应用 • 上一篇    下一篇

基于Transformer交叉注意力的文本生成图像技术

谈馨悦, 何小海, 王正勇, 罗晓东, 卿粼波   

  1. 四川大学电子信息学院 成都610065
  • 收稿日期:2021-06-08 修回日期:2021-10-20 出版日期:2022-02-15 发布日期:2022-02-23
  • 通讯作者: 何小海(hxh@scu.edu.cn)
  • 作者简介:2019222055241@stu.scu.edu.cn
  • 基金资助:
    国家自然科学基金(61871278,U1836118);成都市重大科技应用示范项目(2019-YF09-00120-SN);四川省科技计划项目(2018HH0143)

Text-to-Image Generation Technology Based on Transformer Cross Attention

TAN Xin-yue, HE Xiao-hai, WANG Zheng-yong, LUO Xiao-dong, QING Lin-bo   

  1. College of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China
  • Received:2021-06-08 Revised:2021-10-20 Online:2022-02-15 Published:2022-02-23
  • About author:TAN Xin-yue,born in 1997,postgra-duate.Her main research interests include image generation and so on.
    HE Xiao-hai,born in 1964,Ph.D,professor,Ph.D supervisor.His main research interests include image proces-sing,pattern recognition and image communication.
  • Supported by:
    National Natural Science Foundation of China(61871278,U1836118),Chengdu Major Technology Application Demonstration Project(2019-YF09-00120-SN) and Sichuan Science and Technology Program(2018HH0143).

摘要: 近年来,以生成对抗网络为基础的从文本生成图像方法的研究取得了一定的进展。文本生成图像技术的关键在于构建文本信息和视觉信息间的桥梁,促进网络模型生成与对应文本描述一致的逼真图像。目前,主流的方法是通过预训练文本编码器来完成对输入文本描述的编码,但这些方法在文本编码器中未考虑与对应图像的语义对齐问题,独立对输入文本进行编码,忽略了语言空间与图像空间之间的语义鸿沟问题。为解决这一问题,文中设计了一种基于交叉注意力编码器的对抗生成网络(CAE-GAN),该网络通过交叉注意力编码器,将文本信息与视觉信息进行翻译和对齐,以捕捉文本与图像信息之间的跨模态映射关系,从而提升生成图像的逼真度和与输入文本描述的匹配度。实验结果表明,在CUB和coco数据集上,与当前主流的方法DM-GAN模型相比,CAE-GAN模型的IS(Inception Score)分数分别提升了2.53%和1.54%,FID (Fréchet Inception Distance)分数分别降低了15.10%和5.54%,由此可知,CAE-GAN模型生成图像的细节更加完整、质量更高。

关键词: 计算机视觉, 交叉注意力编码, 生成对抗网络, 图像生成, 文本描述生成图像

Abstract: In recent years,the research on the methods of text to image based on generative adversarial network (GAN) continues to grow in popularity and have made some progress.The key of text-to-image generation technology is to build a bridge between the text information and the visual information,and promote the model to generate realistic images consistent with the corresponding text description.At present,the mainstream method is to complete the encoding of the descriptions of the input text by pre-training the text encoder,but these methods do not consider the semantic alignment with the corresponding image in the text encoder,and adopt the independent encoding of the input text,ignoring the semantic gap between the language space and the image space.To address the problem,in this paper,a generative adversarial network based on the cross-attention encoder (CAE-GAN) is proposed.The network uses a cross-attention encoder to translate and align text information with visual information,and captures the cross-modal mapping relationship between text and image information,so as to improve the fidelity of the gene-rated images and the matching degree with input text description.The experimental results show that,compared with the DM-GAN model,the inception score (IS) of CAE-GAN model increases by 2.53% and 1.54% on CUB and coco datasets,respectively.The fréchet inception distance score decreases by 15.10% and 5.54%,respectively,indicating that the details and the quality of the images generated by the CAE-GAN model are more perfect.

Key words: Computer vision, Cross-attention encoding, Generative adversarial networks, Image generation, Text-to-Image generation

中图分类号: 

  • TP183
[1]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:Aneural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[2]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[3]ANTOL S,AGRAWAL A,LU J,et al.Vqa:visual question answering[C]//Proceedings of the International Conference on Conputer Vision.2015:2425-2433.
[4]ANTOL S,AGRAWAL A,LU J,et al.Vqa:visual question an-swering[C]//Proceedings of the International Conference on Conputer Vision.2015:2425-2433.
[5]JOHNSON J,HARIHARAN B,MAATEN L V D,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910.
[6]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Proceedings of the 32nd International Conference on International Confe-rence on Machine Learning.Lille,France,2015:2048-2057.
[7]WEI Y,ZHAO Y,LU C,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].IEEE Transactions on Cybernetics,2016,47(2):449-460.
[8]BI J Q,LIU M F,HU H J,et al.Image captioning based on dependency syntax[J].Journal of Beijing University of Aeronautics and Astronautics,2021,47(3):431-440.
[9]CHEN M J,LIN G J,HAN Q,et al.Asymmetric Patches Nonlocal Total Variation Model for Image Recovery[ J].Journal of Chongqing University of Technology(Natural Science),2020,34(2):127-132,202.
[10]XU F,MA X P,LIU L B.Cross-modal retrieval method for thyroid ultrasound image and text based on generative adversarial network[J].Journal of Biomedical Engineering,2020,37(4):641-651.
[11]REED S,AKATA Z,MOHAN S,et al.Learning what andwhere to draw[OL].https://arxiv.org/pdf/1610.02454.pdf.
[12]ZHANG H,XU T,LI H S,et al.StackGAN:Text to photo-rea-listic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Confe-rence on Computer Vision.Venice,Italy,2017:5907-7363.
[13]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-Grainedtext to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:1316-1324.
[14]SUN Y,LI L Y,YE Z H,et al.Text-to-image synthesis method based on multi-level structure generative adversarial networks[J].Journal of Computer Applications,2019,39(11):3204-3209.
[15]XU Y N,HE X H,ZHANG J,et al.Text-to-image synthesis method based on multi-level progressive resolution generative adversarial networks[J].Journal of Computer Applications,2020,40(12):3612-3617.
[16]MO J W,XU K L,LIN L P,et al.Text-to-image generationcombined with mutual information maximization[J].Journal of Xidian University,2019,46(5):180-188.
[17]HUANG Y W,ZHOU B,TANG X.Text Image GenerationMethod with Scene Description [J].Laser & Optoelectronics Progress,2021,58(4):190-198.
[18]WAH C,BRANSON S,WELINDER P,et al.The Caltech-UCSD Birds 200-2011 Dataset[J].Technical Report CNS-TR-2011-001,California Institute of Technology,2011.
[19]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//ECCV.2014.
[20]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems.Montreal,Canada,2014:2672-2680.
[21]MIRZA M,OSINDERO S.Conditional generative adversarialnets[J].arXiv:1411.1784,2014.
[22]NILSBACK M E,ZISSERMAN A.Automated flower classification over a large number of classes[C]//Proceedings of the 2008 Sixth Indian Conference on Computer Vision,Graphics & Image Processing.Bhubaneshwar,India,2008:722-729.
[23]REED S,AKATA Z,YAN X,et al.Generative adversarial text-to-image synthesis[C]//ICML.2016.
[24]ZHANG H,XU T,LI H S,et al.StackGAN++:Realisticimage synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,41(8):1947-1962.
[25]ZHU M F,PAN P B,CHEN W,et al.DM-GAN:Dynamic me-mory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2019:5802-5810.
[26]HUANG H Y,GU Z F.A generative adversarial network basedon self-attention mechanism for text-to-image generation[J].Journal of Chongqing University,2020,43(3):55-61.
[27]JU S B,XU J,LI Y F.Text-to-single image method based onself attention[OL].http://kns.cnki.net/kcms/detail/11.2127.TP.20210223.1347.018.html.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[29]LI G,DUAN N,FANG Y J,et al.Unicoder-vl:A universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[30]WANG Z H,LIU X H,LI H S,et al.Camp:Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5764-5773.
[31]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[32]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining tas-kagnostic visiolinguistic representations for vision-and-language tasks[J].arXiv:1908.02265,2019.
[33]TAN H,BANSAL M.Lxmert:Learning cross-modality encoder representations from transformers[J].arXiv:1908.07490,2019.
[34]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Trans on Signal Processing,1997,45(11):2673-2681.
[35]SZEGEDY C,ANHOUCKE V V,IOFFE S,et al.Rethinkingthe inception architecture for computer vision[C]//IEEE.IEEE,2016:2818-2826.
[36]SALIMANS T,GOODFELLOW I J,ZAREMBA W,et al.Improved techniques for training gans[C]//NIPS.2016.
[37]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local nash equilibrium [C]//NIPS.2017:6626-6637.
[38]GOU Y C,WU Q C,LI M H,et al.SegAttnGAN:Text to ImageGeneration with Segmentation Attention[J].arXiv:2005.12444,2020.
[39]LI W,ZHANG P,ZHANG L,et al.Object-driven text-to-imagesynthesis via adversarial training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2019:12174-12182.
[40]HINZ T,HEINRICH S,WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].arXiv:1910.13321,2020.
[1] 张佳, 董守斌.
基于评论方面级用户偏好迁移的跨领域推荐算法
Cross-domain Recommendation Based on Review Aspect-level User Preference Transfer
计算机科学, 2022, 49(9): 41-47. https://doi.org/10.11896/jsjkx.220200131
[2] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[3] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[4] 尹文兵, 高戈, 曾邦, 王霄, 陈怡.
基于时频域生成对抗网络的语音增强算法
Speech Enhancement Based on Time-Frequency Domain GAN
计算机科学, 2022, 49(6): 187-192. https://doi.org/10.11896/jsjkx.210500114
[5] 徐辉, 康金梦, 张加万.
基于特征感知的数字壁画复原方法
Digital Mural Inpainting Method Based on Feature Perception
计算机科学, 2022, 49(6): 217-223. https://doi.org/10.11896/jsjkx.210500105
[6] 高志宇, 王天荆, 汪悦, 沈航, 白光伟.
基于生成对抗网络的5G网络流量预测方法
Traffic Prediction Method for 5G Network Based on Generative Adversarial Network
计算机科学, 2022, 49(4): 321-328. https://doi.org/10.11896/jsjkx.210300240
[7] 张继凯, 李琦, 王月明, 吕晓琪.
基于单目RGB图像的三维手势跟踪算法综述
Survey of 3D Gesture Tracking Algorithms Based on Monocular RGB Images
计算机科学, 2022, 49(4): 174-187. https://doi.org/10.11896/jsjkx.210700084
[8] 黎思泉, 万永菁, 蒋翠玲.
基于生成对抗网络去影像的多基频估计算法
Multiple Fundamental Frequency Estimation Algorithm Based on Generative Adversarial Networks for Image Removal
计算机科学, 2022, 49(3): 179-184. https://doi.org/10.11896/jsjkx.201200081
[9] 石达, 芦天亮, 杜彦辉, 张建岭, 暴雨轩.
基于改进CycleGAN的人脸性别伪造图像生成模型
Generation Model of Gender-forged Face Image Based on Improved CycleGAN
计算机科学, 2022, 49(2): 31-39. https://doi.org/10.11896/jsjkx.210600012
[10] 唐雨潇, 王斌君.
基于深度生成模型的人脸编辑研究进展
Research Progress of Face Editing Based on Deep Generative Model
计算机科学, 2022, 49(2): 51-61. https://doi.org/10.11896/jsjkx.210400108
[11] 李建, 郭延明, 于天元, 武与伦, 王翔汉, 老松杨.
基于生成对抗网络的多目标类别对抗样本生成算法
Multi-target Category Adversarial Example Generating Algorithm Based on GAN
计算机科学, 2022, 49(2): 83-91. https://doi.org/10.11896/jsjkx.210800130
[12] 陈贵强, 何军.
自然场景下遥感图像超分辨率重建算法研究
Study on Super-resolution Reconstruction Algorithm of Remote Sensing Images in Natural Scene
计算机科学, 2022, 49(2): 116-122. https://doi.org/10.11896/jsjkx.210700095
[13] 干创, 吴桂兴, 詹庆原, 王鹏焜, 彭志磊.
基于骨架模态的多级门控图卷积动作识别网络
Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition
计算机科学, 2022, 49(1): 181-186. https://doi.org/10.11896/jsjkx.201100164
[14] 张玮琪, 汤轶丰, 李林燕, 胡伏原.
基于场景图的段落生成序列图像方法
Image Stream From Paragraph Method Based on Scene Graph
计算机科学, 2022, 49(1): 233-240. https://doi.org/10.11896/jsjkx.201100207
[15] 蒋宗礼, 樊珂, 张津丽.
基于生成对抗网络和元路径的异质网络表示学习
Generative Adversarial Network and Meta-path Based Heterogeneous Network Representation Learning
计算机科学, 2022, 49(1): 133-139. https://doi.org/10.11896/jsjkx.201000179
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!