一种基于多模态深度特征融合的视觉问答模型

doi:10.11896/jsjkx.211200303

Abstract

Abstract: In the era of big data,with the explosive growth of multi-source heterogeneous data,multi-modal data fusion has attracted much attention of researchers,and visual question answering(VQA) has become a hot topic in multi-modal data fusion due to its image and text fusion processing characteristics.Visual Q&A task is mainly based on the deep feature fusion association and representation of image and text multi-modal data,and inference learning of the fusion feature results,so as to get the conclusion.Traditional visual question answering models tend to miss key information and mostly focus on the superficial modal feature association representation learning between data,but less on the deep semantic feature fusion.To solve the above pro-blems,this paper proposes a visual question answering model based on cross-modal deep interaction of of graphic features.The proposed method uses convolutional neural network and LSTM network to obtain the data features of image and text modes respectively,and builds a novel deep attention learning network based on combination of meta-attention units,to realize interactive learning of attention features within or between modes of image and text.At last,we represent the learning features so as to output the results.The model is tested and evaluated on VQA-v2.0 dataset.Compared with the traditional baseline model,the expe-rimental results show that the performance of the proposed model is significantly improved.

Key words: Visual question answering, Multi-modal feature fusion, Attention mechanism, Deep learning, Data fusion

CLC Number:

TP391.41

ZOU Yunzhu, DU Shengdong, TENG Fei, LI Tianrui. Visual Question Answering Model Based on Multi-modal Deep Feature Fusion[J].Computer Science, 2023, 50(2): 123-129.

References

[1]WU A M,JIANG P,HAN Y H.Survey of Cross-media Question Answering and Reasoning Based on Vision and Language [J].Computer Science,2021,48(3):71-78.
[2]SELVARAJU R R,COGSWELL M,DAS A,et al.Grad-cam:Visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Confe-rence on Computer Vision.2017:618-626.
[3]DU H J,LIU X L.Image description generation method based on inhibitor learning [J].Journal of Image and Graphics,2020,25(2):333-342.
[4]XU S K,NI C H,JI C C,et al.Image Caption of Safety Helmets Wearing in Construction Scene Based on YOLOv3 [J].Compu-ter Science,2020,47(8):233-240.
[5]LEE K H,CHEN X,HUA G,et al.Stacked cross attention for image-text matching[C]//Proceedings of the European Confe-rence on Computer Vision(ECCV).2018:201-216.
[6]ZHOU Y X,YU J.Design of Image Question and Answer System Based on Deep Learning [J].Computer Application and Software,2018,35(12):199-208.
[7]ZHUANG M Q,TAN X H,FAN Y C,et al.3D Animation Expression Generation and Emotional Supervision Based on Convolutional Neural Network [J].Journal of Chongqing University of Technology(Natural Science),2022,36(01):151-158.
[8]XU S,ZHU Y X.Study on Question Processing Algorithms in Visual Question Answering [J].Computer Science,2020,47(11):226-230.
[9]ANTOL S,AGRAWAL A,LU J,et al.Vqa:Visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:2425-2433.
[10]ZHOU B,TIAN Y,SUKHBAATAR S,et al.Simple baseline for visual question answering [J].arXiv:1512.02167,2015.
[11]MALINOWSKI M,FRITZ M.A multi-world approach to question answering about real-world scenes based on uncertain input [J].Advances in Neural Information Processing Systems,2014,27:1682-1690.
[12]KAFLE K,KANAN C.Answer-type prediction for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:4976-4984.
[13]REN M,KIROS R,ZEMEL R.Exploring models and data for image question answering [J].Advances in Neural Information Processing Systems,2015,28:2953-2961.
[14]LIN M Q,ZHANG X M.Identity Authentication of Multi-Modal Fusion Based on Behavioral Footprint[J].Computer Engineering,2021,47(10):116-124.
[15]FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.2016:457-468.
[16]KIM J H,LEE S W,KWAK D,et al.Multimodal residual lear-ning for visual qa[C]//Advances in Neural Information Proces-sing Systems.2016:361-369.
[17]MENG X S,JIANG A W,LIU C H,et al.Visual Question Answering based on Spatial-DCTHash Dynamic Parameter Network [J].SCIENTIA SINICA Informationis,2017,47(8):1008-1022.
[18]GU L,JI Y,LIU C P.Classification Method of Three-Dimensional Point Cloud Based on Multiple Modal Feature Fusion[J].Computer Engineering,2021,47(2):279-284.
[19]LU J,YANG J,BATRA D,et al.Hierarchical question-imageco-attention for visual question answering [J].Advances in Neural Information Processing Systems,2016,29:289-297.
[20]NGUYEN D K,OKATANI T.Improved fusion of visual andlanguage representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:6087-6096.
[21]YU Z,YU J,FAN J,et al.Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision.2017:1821-1830.
[22]YAN R Y,LIU X L.Visual Question Answering Model Based on Bottom-up Attention and Memory Network [J].Journal of Image and Graphics,2020,25(5):993-1006.
[23]WANG Y L,ZHUO Y F,WU Y J,et al.Question Answering Algorithm on Image Fragmentation Information Based on Deep Neural Network [J].Journal of Computer Research and Deve-lopment,2018,55(12):2600-2610.
[24]CHEN C,HAN D,WANG J.Multimodal encoder-decoder attention networks for visual question answering [J].IEEE Access,2020,8:35662-35671.
[25]FU P C,YANG G,LIU X M,et al.Visual Question Answering Model Based on Spatial Relation and Frequency Feature [J].Computer Engineering,2022,48(9):96-104.
[26]ZOU P R,XIAO F,ZHANG W J,et al.Multi-Modele Co-Attention Network for Visual Question Answering [J].Computer Engineering,2022,48(2):250-260.
[27]REN S,HE K,GIRSHICK R,et al.Faster R-CNN:Towards Real-Time Object Detection with Region Proposal Networks [J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2017,39(6):1137-1149.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [J].arXiv:1706.03762,2017.
[29]LI L.Research on Collaborative Attention Model and Deep Correlated Networks for Visual Question Answer [D].Xiamen:Huaqiao University,2020.
[30]NIU Y L,ZHANG H W.Survey on Visual Question Answering and Dialogue [J].Computer Science,2021,48(3):87-96.
[31]YU Z,YU J,XIANG C,et al.Beyond bilinear:Generalized multimodal factorized high-order pooling for visual question answe-ring[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):5947-5959.

Related Articles 15

[1]	BAI Xuefei, MA Yanan, WANG Wenjian. Segmentation Method of Edge-guided Breast Ultrasound Images Based on Feature Fusion [J]. Computer Science, 2023, 50(3): 199-207.
[2]	LIU Hang, PU Yuanyuan, LYU Dahua, ZHAO Zhengpeng, XU Dan, QIAN Wenhua. Polarized Self-attention Constrains Color Overflow in Automatic Coloring of Image [J]. Computer Science, 2023, 50(3): 208-215.
[3]	CHEN Liang, WANG Lu, LI Shengchun, LIU Changhong. Study on Visual Dashboard Generation Technology Based on Deep Learning [J]. Computer Science, 2023, 50(3): 238-245.
[4]	ZHANG Yi, WU Qin. Crowd Counting Network Based on Feature Enhancement Loss and Foreground Attention [J]. Computer Science, 2023, 50(3): 246-253.
[5]	YING Zonghao, WU Bin. Backdoor Attack on Deep Learning Models:A Survey [J]. Computer Science, 2023, 50(3): 333-350.
[6]	LI Shuai, XU Bin, HAN Yike, LIAO Tongxin. SS-GCN:Aspect-based Sentiment Analysis Model with Affective Enhancement and Syntactic Enhancement [J]. Computer Science, 2023, 50(3): 3-11.
[7]	CHEN Fuqiang, KOU Jiamin, SU Limin, LI Ke. Multi-information Optimized Entity Alignment Model Based on Graph Neural Network [J]. Computer Science, 2023, 50(3): 34-41.
[8]	ZHOU Mingqiang, DAI Kailang, WU Quanwang, ZHU Qingsheng. Attention-aware Multi-channel Graph Convolutional Rating Prediction Model for Heterogeneous Information Networks [J]. Computer Science, 2023, 50(3): 129-138.
[9]	DONG Yongfeng, HUANG Gang, XUE Wanruo, LI Linhao. Graph Attention Deep Knowledge Tracing Model Integrated with IRT [J]. Computer Science, 2023, 50(3): 173-180.
[10]	HUA Xiaofeng, FENG Na, YU Junqing, HE Yunfeng. Shooting Event Detection of Free Kick in Soccer Video Based on Rule Reasoning [J]. Computer Science, 2023, 50(3): 181-190.
[11]	MEI Pengcheng, YANG Jibin, ZHANG Qiang, HUANG Xiang. Sound Event Joint Estimation Method Based on Three-dimension Convolution [J]. Computer Science, 2023, 50(3): 191-198.
[12]	WANG Pengyu, TAI Wenxin, LIU Fang, ZHONG Ting, LUO Xucheng, ZHOU Fan. Self-supervised Flight Trajectory Prediction Based on Data Augmentation [J]. Computer Science, 2023, 50(2): 130-137.
[13]	GUO Nan, LI Jingyuan, REN Xi. Survey of Rigid Object Pose Estimation Algorithms Based on Deep Learning [J]. Computer Science, 2023, 50(2): 178-189.
[14]	LI Junlin, OUYANG Zhi, DU Nisuo. Scene Text Detection with Improved Region Proposal Network [J]. Computer Science, 2023, 50(2): 201-208.
[15]	HUA Jie, LIU Xueliang, ZHAO Ye. Few-shot Object Detection Based on Feature Fusion [J]. Computer Science, 2023, 50(2): 209-213.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Visual Question Answering Model Based on Multi-modal Deep Feature Fusion

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0