计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 252-259.doi: 10.11896/jsjkx.241000105

• 人工智能 • 上一篇    下一篇

数据知识双增强的医学视觉问答网络

闫玉静1, 侯霞1, 郭玉婷2, 张铭梁1, 宋文凤1   

  1. 1 北京信息科技大学计算机学院 北京 102206
    2 北京航空航天大学虚拟现实技术与系统国家重点实验室 北京 100191
  • 收稿日期:2024-10-21 修回日期:2025-03-04 出版日期:2025-12-15 发布日期:2025-12-09
  • 通讯作者: 宋文凤(songwenfenga@163.com)
  • 作者简介:(2022020605@bistu.edu.cn)
  • 基金资助:
    国家自然科学基金(62572062,62525204);北京市自然科学基金(L232102)

Data and Knowledge Enhanced Medical Visual Question Answer Network

YAN Yujing1, HOU Xia1, GUO Yuting2, ZHANG Mingliang1, SONG Wenfeng1   

  1. 1 School of Computer Science, Beijing Information Science and Technology University, Beijing 102206, China
    2 State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China
  • Received:2024-10-21 Revised:2025-03-04 Published:2025-12-15 Online:2025-12-09
  • About author:YAN Yujing,born in 1999,postgra-duate.Her main research interests include computer vision and visual question answering.
    SONG Wenfeng,born in 1987,Ph.D,associate professor,is a member of CCF(No.71334S).Her main research in-terests include pattern recognition,computer vision and machine learning.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(62572062,62525204) and Beijing Natural Science Foundation(L232102).

摘要: 医学视觉问答(Medical Visual Question Answering,Med-VQA)旨在正确回答与给定医学图像相关的临床问题,在临床医学智能化中起着至关重要的作用。虽然该领域研究已获得一定进展,但是在文本和图像多模态输入信息的深度提取,以及小规模数据集上的有效模型训练方面仍然面临挑战。对此,提出一种数据知识双增强的医学视觉问答网络。针对小规模数据集,设计了多模态条件混合模块对输入的图像和文本进行数据增强,利用问题类别作为约束条件对输入样本对进行线性组合,以提高答案生成的合理性。针对多模态特征提取,设计了一个基于卷积神经网络的图像位置识别器,将其捕获的图像位置特征编码到图像特征和问题特征的融合过程中进行知识增强,可在较少的参数下实现有效的特征提取。在SLAKE和VQA-RAD数据集上的实验结果表明,与基线模型相比,所提模型的性能有明显提升。

关键词: 视觉问答, 医学视觉问答, 医学图像, 数据增强, 计算机视觉

Abstract: Med-VQA aims to accurately answer clinical questions based on a given medical image,which is key in advancing clinical medical intelligence.Despite some progress in this field,challenges remain in extracting deep multimodal information from both images and questions and in effectively training models on small-scale datasets.To address these issues,this paper proposes a Med-VQA network that incorporates dual data and knowledge enhancement.Aiming at small-scale datasets,a multimodal conditional mixing module is designed to enhance the input image and question data,and linear combinations of input sample pairs are performed by using the category of questions as constraints to improve the rationality of answer generation.For multimodal feature extraction,an image location recognizer based on convolutional neural networks is designed to encode the captured image location features into the fusion process of image and question features for knowledge enhancement,which can effectively achieve feature extraction under fewer parameters.Experimental results on the SLAKE and VQA-RAD datasets demonstrate that the proposed model significantly outperforms the baseline models.

Key words: Visual question answering, Medical visual question answering, Medical images, Data enhancement, Computer vision

中图分类号: 

  • TP391
[1]ANTOL S,AGRAWAL A,LU J S,et al.VQA:Visual Question Answering[C]//2015 IEEE International Conference on Computer Vision.2015:2425-2433.
[2]ISHMAM M F,SHOVON M S H,MRIDHA M F,et al.From Image to Language:A Critical Analysis of Visual Question Answering(VQA) Approaches,Challenges,and Opportunities[J].Information Fusion,2024,106:102270.
[3]LIN Z,ZHANG D,TAO Q,et al.Medical Visual Question Answering:A Survey[J].Artificial Intelligence in Medicine,2023,143:102611.
[4]SCHMIDHUBER J,HOCHREITER S.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010.
[6]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[7]CHEN G,GONG H,AND LI G.HCP-MIC at VQA-Med 2020:Effective Visual Representation for Medical Visual Question Answering[C]//CLEF(Working Notes).2020.
[8]NGUYEN B D ,DO T T,NGUYEN B X,et al.Overcoming Data Limitation in Medical Visual Question Answering[C]//International Conference on Medical Image Computing and Compu-ter-Assisted Intervention.2019:522-530.
[9]LIU B,ZHAN L M,WU X M.Contrastive pre-training and representation distillation for medical visual question answering based on radiology images[C]//Medical Image Computing and Computer Assisted Intervention.2021:210-220.
[10]LIU L,SU X.How well apply multimodal mixup and simplemlps backbone to medical visual question answering?[C]//International Conference on Bioinformatics and Biomedicine.2022:2648-2655.
[11]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations.2015.
[12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[13]YANG Z,HE X,GAO J,et al.Stacked attention networks forimage question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29.
[14]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Conference on Neural Information Processing Systems.2018.
[15]BEN ABACHA A,HASAN S A,DATLA V V,et al.VQA-Med:Overview of the medical visual question answering task at image CLEF 2019[C]//Proceedings of CLEF 2019 Working Notes.2019:9-12.
[16]ESLAMI S,DE MELO G,MEINEL C.Does clip benefit visual question answering in the medical domain as much as it does in the general domain?[J].arXiv:2112.13906,2021.
[17]ALLAOUZI I,AHMED M B,BENAMROU B.An Encoder-Decoder Model for Visual Question Answering in the Medical Domain[C]//CLEF.2019.
[18]KHARE Y,BAGAL V,MATHEW M,et al.Mmbert:Multimodal bert pretraining for improved medical vqa[C]//International Symposium on Biomedical Imaging.2021:1033-1036.
[19]WANG X,PENG Y,LU L,et al.ChestX-Ray8:Hospital-scalechest X-Ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:3462-3471.
[20]PELKA O,KOITKA S,RÜCKERT J,et al.Radiology objects in Context(ROCO):a multimodal image dataset[C]//International Conference on Medical Imaging Computing and Computer-Assisted Intervention.2018:180-189.
[21]GONG H,CHEN G,MAO M,et al.VQAMIX:Conditional triplet mixup for medical visual question answering[J].IEEE Transactions on Medical Imaging,2022,41(11):3332-3343.
[22]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.2014:1532-1543.
[23]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[J].arXiv:1710.09412,2017.
[24]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141.
[25]LIU B,ZHAN L M,XU L,et al.SLAKE:A Semantically-La-beled Knowledge-Enhanced Dataset for Medical Visual Question Answering[J].arXiv:2102.09542,2021.
[26]LAU J J,GAYEN S,ABACHA A B,et al.A dataset of clinically generated visual questions and answers about radiology images[J].Scientific Data,2018,5(1):1-10.
[27]LOSHCHILOV I,HUTTER F.Fixing weight decay regularization in adam[J].arXiv:1711.05101,2017.
[28]ESΛI S,MEINEL C,DE MELO G.Pubmedclip:How much does clip benefit visual question answering in the medical domain?[C]//Findings of the Association for Computational Linguistics.2023:1181-1193.
[29]CHEN J,YANG D,JIANG Y,et al.MISS:A Generative Pre-training and Fine-Tuning Approach for Med-VQA[C]//International Conference on Artificial Neural Networks.2024:299-313.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!