计算机科学 ›› 2025, Vol. 52 ›› Issue (12): 252-259.doi: 10.11896/jsjkx.241000105
闫玉静1, 侯霞1, 郭玉婷2, 张铭梁1, 宋文凤1
YAN Yujing1, HOU Xia1, GUO Yuting2, ZHANG Mingliang1, SONG Wenfeng1
摘要: 医学视觉问答(Medical Visual Question Answering,Med-VQA)旨在正确回答与给定医学图像相关的临床问题,在临床医学智能化中起着至关重要的作用。虽然该领域研究已获得一定进展,但是在文本和图像多模态输入信息的深度提取,以及小规模数据集上的有效模型训练方面仍然面临挑战。对此,提出一种数据知识双增强的医学视觉问答网络。针对小规模数据集,设计了多模态条件混合模块对输入的图像和文本进行数据增强,利用问题类别作为约束条件对输入样本对进行线性组合,以提高答案生成的合理性。针对多模态特征提取,设计了一个基于卷积神经网络的图像位置识别器,将其捕获的图像位置特征编码到图像特征和问题特征的融合过程中进行知识增强,可在较少的参数下实现有效的特征提取。在SLAKE和VQA-RAD数据集上的实验结果表明,与基线模型相比,所提模型的性能有明显提升。
中图分类号:
| [1]ANTOL S,AGRAWAL A,LU J S,et al.VQA:Visual Question Answering[C]//2015 IEEE International Conference on Computer Vision.2015:2425-2433. [2]ISHMAM M F,SHOVON M S H,MRIDHA M F,et al.From Image to Language:A Critical Analysis of Visual Question Answering(VQA) Approaches,Challenges,and Opportunities[J].Information Fusion,2024,106:102270. [3]LIN Z,ZHANG D,TAO Q,et al.Medical Visual Question Answering:A Survey[J].Artificial Intelligence in Medicine,2023,143:102611. [4]SCHMIDHUBER J,HOCHREITER S.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780. [5]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.2017:6000-6010. [6]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186. [7]CHEN G,GONG H,AND LI G.HCP-MIC at VQA-Med 2020:Effective Visual Representation for Medical Visual Question Answering[C]//CLEF(Working Notes).2020. [8]NGUYEN B D ,DO T T,NGUYEN B X,et al.Overcoming Data Limitation in Medical Visual Question Answering[C]//International Conference on Medical Image Computing and Compu-ter-Assisted Intervention.2019:522-530. [9]LIU B,ZHAN L M,WU X M.Contrastive pre-training and representation distillation for medical visual question answering based on radiology images[C]//Medical Image Computing and Computer Assisted Intervention.2021:210-220. [10]LIU L,SU X.How well apply multimodal mixup and simplemlps backbone to medical visual question answering?[C]//International Conference on Bioinformatics and Biomedicine.2022:2648-2655. [11]SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[C]//Proceedings of the 3rd International Conference on Learning Representations.2015. [12]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778. [13]YANG Z,HE X,GAO J,et al.Stacked attention networks forimage question answering[C]//Proceedings of the IEEE Confe-rence on Computer Vision and Pattern Recognition.2016:21-29. [14]KIM J H,JUN J,ZHANG B T.Bilinear attention networks[C]//Conference on Neural Information Processing Systems.2018. [15]BEN ABACHA A,HASAN S A,DATLA V V,et al.VQA-Med:Overview of the medical visual question answering task at image CLEF 2019[C]//Proceedings of CLEF 2019 Working Notes.2019:9-12. [16]ESLAMI S,DE MELO G,MEINEL C.Does clip benefit visual question answering in the medical domain as much as it does in the general domain?[J].arXiv:2112.13906,2021. [17]ALLAOUZI I,AHMED M B,BENAMROU B.An Encoder-Decoder Model for Visual Question Answering in the Medical Domain[C]//CLEF.2019. [18]KHARE Y,BAGAL V,MATHEW M,et al.Mmbert:Multimodal bert pretraining for improved medical vqa[C]//International Symposium on Biomedical Imaging.2021:1033-1036. [19]WANG X,PENG Y,LU L,et al.ChestX-Ray8:Hospital-scalechest X-Ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]//IEEE Conference on Computer Vision and Pattern Recognition.2017:3462-3471. [20]PELKA O,KOITKA S,RÜCKERT J,et al.Radiology objects in Context(ROCO):a multimodal image dataset[C]//International Conference on Medical Imaging Computing and Computer-Assisted Intervention.2018:180-189. [21]GONG H,CHEN G,MAO M,et al.VQAMIX:Conditional triplet mixup for medical visual question answering[J].IEEE Transactions on Medical Imaging,2022,41(11):3332-3343. [22]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing.2014:1532-1543. [23]ZHANG H,CISSE M,DAUPHIN Y N,et al.mixup:Beyondempirical risk minimization[J].arXiv:1710.09412,2017. [24]HU J,SHEN L,SUN G.Squeeze-and-excitation networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2018:7132-7141. [25]LIU B,ZHAN L M,XU L,et al.SLAKE:A Semantically-La-beled Knowledge-Enhanced Dataset for Medical Visual Question Answering[J].arXiv:2102.09542,2021. [26]LAU J J,GAYEN S,ABACHA A B,et al.A dataset of clinically generated visual questions and answers about radiology images[J].Scientific Data,2018,5(1):1-10. [27]LOSHCHILOV I,HUTTER F.Fixing weight decay regularization in adam[J].arXiv:1711.05101,2017. [28]ESΛI S,MEINEL C,DE MELO G.Pubmedclip:How much does clip benefit visual question answering in the medical domain?[C]//Findings of the Association for Computational Linguistics.2023:1181-1193. [29]CHEN J,YANG D,JIANG Y,et al.MISS:A Generative Pre-training and Fine-Tuning Approach for Med-VQA[C]//International Conference on Artificial Neural Networks.2024:299-313. |
|
||