Computer Science ›› 2022, Vol. 49 ›› Issue (2): 107-115.doi: 10.11896/jsjkx.210600085

• Computer Vision: Theory and Application • Previous Articles     Next Articles

Text-to-Image Generation Technology Based on Transformer Cross Attention

TAN Xin-yue, HE Xiao-hai, WANG Zheng-yong, LUO Xiao-dong, QING Lin-bo   

  1. College of Electronics and Information Engineering,Sichuan University,Chengdu 610065,China
  • Received:2021-06-08 Revised:2021-10-20 Online:2022-02-15 Published:2022-02-23
  • About author:TAN Xin-yue,born in 1997,postgra-duate.Her main research interests include image generation and so on.
    HE Xiao-hai,born in 1964,Ph.D,professor,Ph.D supervisor.His main research interests include image proces-sing,pattern recognition and image communication.
  • Supported by:
    National Natural Science Foundation of China(61871278,U1836118),Chengdu Major Technology Application Demonstration Project(2019-YF09-00120-SN) and Sichuan Science and Technology Program(2018HH0143).

Abstract: In recent years,the research on the methods of text to image based on generative adversarial network (GAN) continues to grow in popularity and have made some progress.The key of text-to-image generation technology is to build a bridge between the text information and the visual information,and promote the model to generate realistic images consistent with the corresponding text description.At present,the mainstream method is to complete the encoding of the descriptions of the input text by pre-training the text encoder,but these methods do not consider the semantic alignment with the corresponding image in the text encoder,and adopt the independent encoding of the input text,ignoring the semantic gap between the language space and the image space.To address the problem,in this paper,a generative adversarial network based on the cross-attention encoder (CAE-GAN) is proposed.The network uses a cross-attention encoder to translate and align text information with visual information,and captures the cross-modal mapping relationship between text and image information,so as to improve the fidelity of the gene-rated images and the matching degree with input text description.The experimental results show that,compared with the DM-GAN model,the inception score (IS) of CAE-GAN model increases by 2.53% and 1.54% on CUB and coco datasets,respectively.The fréchet inception distance score decreases by 15.10% and 5.54%,respectively,indicating that the details and the quality of the images generated by the CAE-GAN model are more perfect.

Key words: Computer vision, Cross-attention encoding, Generative adversarial networks, Image generation, Text-to-Image generation

CLC Number: 

  • TP183
[1]VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:Aneural image caption generator[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3156-3164.
[2]KARPATHY A,LI F F.Deep visual-semantic alignments forgenerating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2015:3128-3137.
[3]ANTOL S,AGRAWAL A,LU J,et al.Vqa:visual question answering[C]//Proceedings of the International Conference on Conputer Vision.2015:2425-2433.
[4]ANTOL S,AGRAWAL A,LU J,et al.Vqa:visual question an-swering[C]//Proceedings of the International Conference on Conputer Vision.2015:2425-2433.
[5]JOHNSON J,HARIHARAN B,MAATEN L V D,et al.Clevr:A diagnostic dataset for compositional language and elementary visual reasoning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:2901-2910.
[6]XU K,BA J,KIROS R,et al.Show,attend and tell:Neuralimage caption generation with visual attention[C]//Proceedings of the 32nd International Conference on International Confe-rence on Machine Learning.Lille,France,2015:2048-2057.
[7]WEI Y,ZHAO Y,LU C,et al.Cross-modal retrieval with CNN visual features:A new baseline[J].IEEE Transactions on Cybernetics,2016,47(2):449-460.
[8]BI J Q,LIU M F,HU H J,et al.Image captioning based on dependency syntax[J].Journal of Beijing University of Aeronautics and Astronautics,2021,47(3):431-440.
[9]CHEN M J,LIN G J,HAN Q,et al.Asymmetric Patches Nonlocal Total Variation Model for Image Recovery[ J].Journal of Chongqing University of Technology(Natural Science),2020,34(2):127-132,202.
[10]XU F,MA X P,LIU L B.Cross-modal retrieval method for thyroid ultrasound image and text based on generative adversarial network[J].Journal of Biomedical Engineering,2020,37(4):641-651.
[11]REED S,AKATA Z,MOHAN S,et al.Learning what andwhere to draw[OL].
[12]ZHANG H,XU T,LI H S,et al.StackGAN:Text to photo-rea-listic image synthesis with stacked generative adversarial networks[C]//Proceedings of the 2017 IEEE International Confe-rence on Computer Vision.Venice,Italy,2017:5907-7363.
[13]XU T,ZHANG P,HUANG Q,et al.AttnGAN:Fine-Grainedtext to image generation with attentional generative adversarial networks[C]//Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition.Salt Lake City,USA,2018:1316-1324.
[14]SUN Y,LI L Y,YE Z H,et al.Text-to-image synthesis method based on multi-level structure generative adversarial networks[J].Journal of Computer Applications,2019,39(11):3204-3209.
[15]XU Y N,HE X H,ZHANG J,et al.Text-to-image synthesis method based on multi-level progressive resolution generative adversarial networks[J].Journal of Computer Applications,2020,40(12):3612-3617.
[16]MO J W,XU K L,LIN L P,et al.Text-to-image generationcombined with mutual information maximization[J].Journal of Xidian University,2019,46(5):180-188.
[17]HUANG Y W,ZHOU B,TANG X.Text Image GenerationMethod with Scene Description [J].Laser & Optoelectronics Progress,2021,58(4):190-198.
[18]WAH C,BRANSON S,WELINDER P,et al.The Caltech-UCSD Birds 200-2011 Dataset[J].Technical Report CNS-TR-2011-001,California Institute of Technology,2011.
[19]LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:Common objects in context[C]//ECCV.2014.
[20]GOODFELLOW I,POUGET-ABADIE J,MIRZA M,et al.Ge-nerative adversarial networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems.Montreal,Canada,2014:2672-2680.
[21]MIRZA M,OSINDERO S.Conditional generative adversarialnets[J].arXiv:1411.1784,2014.
[22]NILSBACK M E,ZISSERMAN A.Automated flower classification over a large number of classes[C]//Proceedings of the 2008 Sixth Indian Conference on Computer Vision,Graphics & Image Processing.Bhubaneshwar,India,2008:722-729.
[23]REED S,AKATA Z,YAN X,et al.Generative adversarial text-to-image synthesis[C]//ICML.2016.
[24]ZHANG H,XU T,LI H S,et al.StackGAN++:Realisticimage synthesis with stacked generative adversarial networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,41(8):1947-1962.
[25]ZHU M F,PAN P B,CHEN W,et al.DM-GAN:Dynamic me-mory generative adversarial networks for text-to-image synthesis[C]//Proceedings of the IEEE/CVF Conference on Compu-ter Vision and Pattern Recognition.2019:5802-5810.
[26]HUANG H Y,GU Z F.A generative adversarial network basedon self-attention mechanism for text-to-image generation[J].Journal of Chongqing University,2020,43(3):55-61.
[27]JU S B,XU J,LI Y F.Text-to-single image method based onself attention[OL].
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[29]LI G,DUAN N,FANG Y J,et al.Unicoder-vl:A universal encoder for vision and language by cross-modal pre-training[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:11336-11344.
[30]WANG Z H,LIU X H,LI H S,et al.Camp:Cross-modal adaptive message passing for text-image retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2019:5764-5773.
[31]LI L H,YATSKAR M,YIN D,et al.Visualbert:A simple and performant baseline for vision and language[J].arXiv:1908.03557,2019.
[32]LU J,BATRA D,PARIKH D,et al.Vilbert:Pretraining tas-kagnostic visiolinguistic representations for vision-and-language tasks[J].arXiv:1908.02265,2019.
[33]TAN H,BANSAL M.Lxmert:Learning cross-modality encoder representations from transformers[J].arXiv:1908.07490,2019.
[34]SCHUSTER M,PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Trans on Signal Processing,1997,45(11):2673-2681.
[35]SZEGEDY C,ANHOUCKE V V,IOFFE S,et al.Rethinkingthe inception architecture for computer vision[C]//IEEE.IEEE,2016:2818-2826.
[36]SALIMANS T,GOODFELLOW I J,ZAREMBA W,et al.Improved techniques for training gans[C]//NIPS.2016.
[37]HEUSEL M,RAMSAUER H,UNTERTHINER T,et al.Gans trained by a two time-scale update rule converge to a local nash equilibrium [C]//NIPS.2017:6626-6637.
[38]GOU Y C,WU Q C,LI M H,et al.SegAttnGAN:Text to ImageGeneration with Segmentation Attention[J].arXiv:2005.12444,2020.
[39]LI W,ZHANG P,ZHANG L,et al.Object-driven text-to-imagesynthesis via adversarial training[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR).2019:12174-12182.
[40]HINZ T,HEINRICH S,WERMTER S.Semantic object accuracy for generative text-to-image synthesis[J].arXiv:1910.13321,2020.
[1] XU Guo-ning, CHEN Yi-peng, CHEN Yi-ming, CHEN Jin-yin, WEN Hao. Data Debiasing Method Based on Constrained Optimized Generative Adversarial Networks [J]. Computer Science, 2022, 49(6A): 184-190.
[2] XU Hui, KANG Jin-meng, ZHANG Jia-wan. Digital Mural Inpainting Method Based on Feature Perception [J]. Computer Science, 2022, 49(6): 217-223.
[3] GAO Zhi-yu, WANG Tian-jing, WANG Yue, SHEN Hang, BAI Guang-wei. Traffic Prediction Method for 5G Network Based on Generative Adversarial Network [J]. Computer Science, 2022, 49(4): 321-328.
[4] ZHANG Ji-kai, LI Qi, WANG Yue-ming, LYU Xiao-qi. Survey of 3D Gesture Tracking Algorithms Based on Monocular RGB Images [J]. Computer Science, 2022, 49(4): 174-187.
[5] DOU Zhi, WANG Ning, WANG Shi-jie, WANG Zhi-hui, LI Hao-jie. Sketch Colorization Method with Drawing Prior [J]. Computer Science, 2022, 49(4): 195-202.
[6] LI Si-quan, WAN Yong-jing, JIANG Cui-ling. Multiple Fundamental Frequency Estimation Algorithm Based on Generative Adversarial Networks for Image Removal [J]. Computer Science, 2022, 49(3): 179-184.
[7] SHI Da, LU Tian-liang, DU Yan-hui, ZHANG Jian-ling, BAO Yu-xuan. Generation Model of Gender-forged Face Image Based on Improved CycleGAN [J]. Computer Science, 2022, 49(2): 31-39.
[8] GAN Chuang, WU Gui-xing, ZHAN Qing-yuan, WANG Peng-kun, PENG Zhi-lei. Multi-scale Gated Graph Convolutional Network for Skeleton-based Action Recognition [J]. Computer Science, 2022, 49(1): 181-186.
[9] ZHANG Wei-qi, TANG Yi-feng, LI Lin-yan, HU Fu-yuan. Image Stream From Paragraph Method Based on Scene Graph [J]. Computer Science, 2022, 49(1): 233-240.
[10] LIN Zhen-xian, ZHANG Meng-kai, WU Cheng-mao, ZHENG Xing-ning. Face Image Inpainting with Generative Adversarial Network [J]. Computer Science, 2021, 48(9): 174-180.
[11] XU Tao, TIAN Chong-yang, LIU Cai-hua. Deep Learning for Abnormal Crowd Behavior Detection:A Review [J]. Computer Science, 2021, 48(9): 125-134.
[12] PAN Xiao-qin, LU Tian-liang, DU Yan-hui, TONG Xin. Overview of Speech Synthesis and Voice Conversion Technology Based on Deep Learning [J]. Computer Science, 2021, 48(8): 200-208.
[13] YE Hong-liang, ZHU Wan-ning, HONG Lei. Music Style Transfer Method with Human Voice Based on CQT and Mel-spectrum [J]. Computer Science, 2021, 48(6A): 326-330.
[14] FENG Fu-rong, ZHANG Zhao-gong. Recent Advances for Object Contour Detection Technology [J]. Computer Science, 2021, 48(6A): 1-9.
[15] WANG Jian-ming, LI Xiang-feng, YE Lei, ZUO Dun-wen, ZHANG Li-ping. Medical Image Deblur Using Generative Adversarial Networks with Channel Attention [J]. Computer Science, 2021, 48(6A): 101-106.
Full text



No Suggested Reading articles found!