Computer Science ›› 2025, Vol. 52 ›› Issue (12): 252-259.doi: 10.11896/jsjkx.241000105

• Artificial Intelligence • Previous Articles     Next Articles

Data and Knowledge Enhanced Medical Visual Question Answer Network

YAN Yujing1, HOU Xia1, GUO Yuting2, ZHANG Mingliang1, SONG Wenfeng1   

  1. 1 School of Computer Science, Beijing Information Science and Technology University, Beijing 102206, China
    2 State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
  • Received:2024-10-21 Revised:2025-03-04 Online:2025-12-15 Published:2025-12-09
  • About author:YAN Yujing,born in 1999,postgra-duate.Her main research interests include computer vision and visual question answering.
    SONG Wenfeng,born in 1987,Ph.D,associate professor,is a member of CCF(No.71334S).Her main research in-terests include pattern recognition,computer vision and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(62572062,62525204) and Beijing Natural Science Foundation(L232102).

Abstract: Med-VQA aims to accurately answer clinical questions based on a given medical image,which is key in advancing clinical medical intelligence.Despite some progress in this field,challenges remain in extracting deep multimodal information from both images and questions and in effectively training models on small-scale datasets.To address these issues,this paper proposes a Med-VQA network that incorporates dual data and knowledge enhancement.Aiming at small-scale datasets,a multimodal conditional mixing module is designed to enhance the input image and question data,and linear combinations of input sample pairs are performed by using the category of questions as constraints to improve the rationality of answer generation.For multimodal feature extraction,an image location recognizer based on convolutional neural networks is designed to encode the captured image location features into the fusion process of image and question features for knowledge enhancement,which can effectively achieve feature extraction under fewer parameters.Experimental results on the SLAKE and VQA-RAD datasets demonstrate that the proposed model significantly outperforms the baseline models.

Key words: Visual question answering, Medical visual question answering, Medical images, Data enhancement, Computer vision

CLC Number: 

  • TP391
[1]ANTOL S,AGRAWAL A,LU J S,et al.VQA:Visual Question Answering[C]//2015 IEEE International Conference on Computer Vision.2015:2425-2433.
[2]ISHMAM M F,SHOVON M S H,MRIDHA M F,et al.From Image to Language:A Critical Analysis of Visual Question Answering(VQA) Approaches,Challenges,and Opportunities[J].Information Fusion,2024,106:102270.
[3]LIN Z,ZHANG D,TAO Q,et al.Medical Visual Question Answering:A Survey[J].Artificial Intelligence in Medicine,2023,143:102611.
[4]SCHMIDHUBER J,HOCHREITER S.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010.
[6]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[7]CHEN G,GONG H,AND LI G.HCP-MIC at VQA-Med 2020:Effective Visual Representation for Medical Visual Question Answering[C]//CLEF(Working Notes).2020.
[8]NGUYEN B D ,DO T T,NGUYEN B X,et al.Overcoming Data Limitation in Medical Visual Question Answering[C]//International Conference on Medical Image Computing and Compu-ter-Assisted Intervention.2019:522-530.
[9]LIU B,ZHAN L M,WU X M.Contrastive pre-training and representation distillation for medical visual question answering based on radiology images[C]//Medical Image Computing and Computer Assisted Intervention.2021:210-220.
[10]LIU L,SU X.How well apply multimodal mixup and simplemlps backbone to medical visual question answering?[C]//International Conference on Bioinformatics and Biomedicine.2022:2648-2655.
[11]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations.2015.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]YANG Z,HE X,GAO J,et al.Stacked attention networks forimage question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[14]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Conference on Neural Information Processing Systems.2018.
[15]BEN ABACHA A,HASAN S A,DATLA V V,et al.VQA-Med:Overview of the medical visual question answering task at image CLEF 2019[C]//Proceedings of CLEF 2019 Working Notes.2019:9-12.
[16]ESLAMI S,DE MELO G,MEINEL C.Does clip benefit visual question answering in the medical domain as much as it does in the general domain?[J].arXiv:2112.13906,2021.
[17]ALLAOUZI I,AHMED M B,BENAMROU B.An Encoder-Decoder Model for Visual Question Answering in the Medical Domain[C]//CLEF.2019.
[18]KHARE Y,BAGAL V,MATHEW M,et al.Mmbert:Multimodal bert pretraining for improved medical vqa[C]//International Symposium on Biomedical Imaging.2021:1033-1036.
[19]WANG X,PENG Y,LU L,et al.ChestX-Ray8:Hospital-scalechest X-Ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:3462-3471.
[20]PELKA O,KOITKA S,RÜCKERT J,et al.Radiology objects in Context(ROCO):a multimodal image dataset[C]//International Conference on Medical Imaging Computing and Computer-Assisted Intervention.2018:180-189.
[21]GONG H,CHEN G,MAO M,et al.VQAMIX:Conditional triplet mixup for medical visual question answering[J].IEEE Transactions on Medical Imaging,2022,41(11):3332-3343.
[22]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.2014:1532-1543.
[23]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[J].arXiv:1710.09412,2017.
[24]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[25]LIU B,ZHAN L M,XU L,et al.SLAKE:A Semantically-La-beled Knowledge-Enhanced Dataset for Medical Visual Question Answering[J].arXiv:2102.09542,2021.
[26]LAU J J,GAYEN S,ABACHA A B,et al.A dataset of clinically generated visual questions and answers about radiology images[J].Scientific Data,2018,5(1):1-10.
[27]LOSHCHILOV I,HUTTER F.Fixing weight decay regularization in adam[J].arXiv:1711.05101,2017.
[28]ESΛI S,MEINEL C,DE MELO G.Pubmedclip:How much does clip benefit visual question answering in the medical domain?[C]//Findings of the Association for Computational Linguistics.2023:1181-1193.
[29]CHEN J,YANG D,JIANG Y,et al.MISS:A Generative Pre-training and Fine-Tuning Approach for Med-VQA[C]//International Conference on Artificial Neural Networks.2024:299-313.
[1] LIU Wei, XU Yong, FANG Juan, LI Cheng, ZHU Yujun, FANG Qun, HE Xin. Multimodal Air-writing Gesture Recognition Based on Radar-Vision Fusion [J]. Computer Science, 2025, 52(9): 259-268.
[2] WANG Yuanlong, ZHANG Ningqian, ZHANG Hu. Visual Storytelling Based on Planning Learning [J]. Computer Science, 2025, 52(9): 269-275.
[3] SU Zhiyuan, ZHAO Lixu, HAO Zhiheng, BAI Rufeng. Suvery of Artificial Intelligence Ensuring eVTOL Flight Safety in the Context of Low-altitudeEconomy [J]. Computer Science, 2025, 52(6A): 250200050-13.
[4] XU Yutao, TANG Shouguo. External Knowledge Query-based for Visual Question Answering [J]. Computer Science, 2025, 52(6A): 240400101-8.
[5] GAO Junyi, ZHANG Wei, LI Zelin. YOLO-BFEPS:Efficient Attention-enhanced Cross-scale YOLOv10 Fire Detection Model [J]. Computer Science, 2025, 52(6A): 240800134-9.
[6] XU Yutao, TANG Shouguo. Visual Question Answering Integrating Visual Common Sense Features and Gated Counting Module [J]. Computer Science, 2025, 52(6A): 240800086-7.
[7] LI Xiaolan, MA Yong. Study on Lightweight Flame Detection Algorithm with Progressive Adaptive Feature Fusion [J]. Computer Science, 2025, 52(4): 64-73.
[8] CAO Wenbo, WEI Mingyang, DUAN Xiaoyong, LIU Xueyuan. Three-dimensional Object Detection Algorithm of Road Scene Based on Attention Mechanism [J]. Computer Science, 2025, 52(11A): 241100112-7.
[9] ZHANG Xiaorui, XU Yanan, SUN Wei. CINN:A High-speed and JPEG-resistant Medical Image Watermarking Network [J]. Computer Science, 2025, 52(11A): 241100037-7.
[10] LI Yujie, MA Zihang, WANG Yifu, WANG Xinghe, TAN Benying. Survey of Vision Transformers(ViT) [J]. Computer Science, 2025, 52(1): 194-209.
[11] ZHANG Jian, LI Hui, ZHANG Shengming, WU Jie, PENG Ying. Review of Pre-training Methods for Visually-rich Document Understanding [J]. Computer Science, 2025, 52(1): 259-276.
[12] ZHU Fukun, TENG Zhen, SHAO Wenze, GE Qi, SUN Yubao. Semantic-guided Neural Network Critical Data Routing Path [J]. Computer Science, 2024, 51(9): 155-161.
[13] CAI Wenliang, HUANG Jun. Lane Detection Method Based on RepVGG [J]. Computer Science, 2024, 51(7): 236-243.
[14] HUANG Haixin, CAI Mingqi, WANG Yuyao. Review of Point Cloud Semantic Segmentation Based on Graph Convolutional Neural Networks [J]. Computer Science, 2024, 51(6A): 230400196-7.
[15] LU Dongsheng, LONG Hua. Method for Homologous Spectrum Monitoring Data Identification Based on Spectrum SIFT [J]. Computer Science, 2024, 51(6A): 230300177-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!