计算机科学 ›› 2026, Vol. 53 ›› Issue (3): 383-391.doi: 10.11896/jsjkx.260200058

• 人工智能 • 上一篇    下一篇

基于多任务学习的眼科视频特征融合与多维画像

杜剑彤1, 管泽礼2, 薛哲2   

  1. 1 北京大学第一医院眼科 北京 100034
    2 北京邮电大学计算机学院 北京 100876
  • 收稿日期:2025-08-28 修回日期:2025-10-24 发布日期:2026-03-12
  • 通讯作者: 管泽礼(guanzeli@bupt.edu.cn)
  • 作者简介:(dujiantong@bjmu.edu.cn)

Multi-task Learning-based Ophthalmic Video Feature Fusion and Multi-dimensional Profiling

DU Jiantong1, GUAN Zeli2, XUE Zhe2   

  1. 1 Department of Ophthalmology, Peking University First Hospital, Beijing 100034, China
    2 School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
  • Received:2025-08-28 Revised:2025-10-24 Online:2026-03-12
  • About author:DU Jiantong,born in 1989,Ph.D.His main research interests include ophthalmology and ophthalmic research.
    GUAN Zeli,born in 1994,Ph.D,postdoctoral researcher.His main research interest is artificial intelligence.

摘要: 针对社交网络眼科视频存在的视觉特征区分度低、文本描述口语化严重以及多模态语义异构等挑战,提出了一种基于多任务学习的眼科视频特征融合与多维画像构建方法(OVP),从非结构化的视频流与文本流中挖掘具有医学语义价值的多维特征,以实现对眼科视频的精准表征。利用预训练深度残差网络提取视频关键帧的高维视觉表征,捕捉眼科图像特有的细粒度特征;提出基于眼科知识图谱的眼科视频文本特征提取方法,通过检索并融合外部实体注解与关联知识,有效弥补了社交媒体文本专业语义稀疏的问题,并结合BERT模型提取富含领域知识的文本特征;在此基础上,设计跨模态注意力融合机制,动态计算视觉与文本特征的交互权重,实现了图像信息与医学语义的深度对齐。构建多任务联合优化与眼科多维画像,协同训练视频疾病分类、传播热度预测与内容质量评估3个子任务,利用任务间的共享信息提升泛化能力。在真实眼科视频数据集上进行实验,实验结果表明,OVP方法在眼科视频疾病分类准确率、热度预测及质量评估性能上均显著优于现有基线方法,验证了该方法在复杂眼科视频特征融合与多维度画像构建方面的有效性。

关键词: 眼科视频画像, 多任务学习, 多模态融合, 知识图谱, 深度学习

Abstract: To address challenges in profiling ophthalmic videos on social networks,such as the low discriminability of visual features,the colloquial nature of text descriptions,and multimodal semantic heterogeneity,this paper proposes an OVP(Ophthalmic Video Profiling) method based on multi-task learning.The proposed method aims to mine multi-dimensional features with medical semantic value from unstructured video and text streams to facilitate precise video representation.In the OVP framework,a pre-trained deep residual network is employed to extract high-dimensional visual representations from keyframes,capturing fine-grained features specific to ophthalmic imagery.To overcome the sparsity of professional semantics in social media text,a method for extracting textual features from ophthalmic videos based on an ophthalmic knowledge graph is proposed,which retrieves and fuses external entity annotations and related knowledge before encoding via BERT.Subsequently,a cross-modal attention fusion mechanism is designed to dynamically calculate interaction weights between visual and textual features,achieving deep alignment between visual information and medical semantics.Furthermore,a multi-task joint optimization and ophthalmic multidimensional profiling is constructed to jointly train three sub-tasks:disease classification,popularity prediction,and content quality assessment,utilizing shared information to enhance model generalization.Experiments conducted on a real ophthalmic video dataset demonstrate that the OVP method significantly outperforms existing baseline methods in terms of disease classification accuracy,heatmap prediction,and quality assessment performance for ophthalmic videos.The experimental results validate the effectiveness of the OVP method in feature fusion and multidimensional profiling of complex ophthalmic videos.

Key words: Ophthalmic video profiling, Multi-task learning, Multi-modal fusion, Knowledge graph, Deep learning

中图分类号: 

  • TP391
[1]DE CROON R,VAN HOUDT L,HTUN N N,et al.Health recommender systems:Systematic review[J].Journal of Medical Internet Research,2021,23(6):e18035.
[2]SUAREZ-LLEDO V,ALVAREZ-GALVEZ J.Prevalence ofhealth misinformation on social media:systematic review[J].Journal of Medical Internet Research,2021,42(1):e026.
[3]YUAN L,KANG D,DONG X,et al.Artificial intelligence in clinical education in ophthalmology:a systematic review[J].Visual Neuroscience 2025,12(6):2893-2907.
[4]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:Af vi-deo vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV).2021:6836-6846.
[5]LIU Z,NING J,CAO Y,et al.Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3202-3211.
[6]LI K,WANG Y,ZHANG J,et al.UniFormer:Unifying convo-lution and self-attention for visual recognition[J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence(TPAMI),2023,45(10):12581-12600.
[7]LUO H,JI L,ZHONG M,et al.Clip4clip:An empirical study of clip for end to end video clip retrieval[J].Neurocomputing,2022,508:293-304.
[8]LI D,LI J,LI H,et al.Align and prompt:Video-and-language pre-training with entity prompts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:4953-4963.
[9]NI B,PENG H,CHEN M,et al.Expanding language-image pretrained models for general video recognition[C]//European Conference on Computer Vision(ECCV).Cham:Springer Nature Switzerland,2022:1-18.
[10]TONG Z,SONG Y,WANG J,et al.VideoMAE:Masked au-toencoders are data-efficient learners for self-supervised video pre-training[J].Advances in Neural Information Processing Systems,2022,35:10078-10093.
[11]LIN K,LI L,LIN C,et al.SwinBERT:End-to-end transformers with sparse attention for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17949-17958.
[12]ZHU B,BIN Y,XU H,et al.LanguageBind:Extending video-language pretraining to N-modality by language-based semantic alignment[C]//The Twelfth International Conference on Lear-ning Representations(ICLR).2024.
[13]HUANG S C,SHEN L,LUNGREN M,et al.GLoRIA:A multi-modal global-local representation learning framework for label-efficient medical image recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3942-3951.
[14]STAHLSCHMIDT S R,ULFENBORG B,SYNNERGREN J.Multimodal deep learning for biomedical data fusion:a review[J].Briefings in Bioinformatics,2022,23(2):1-15.
[15]LIAN Z,YANG Q,WANG W,et al.DEEP-FEL:Decentralized,efficient and privacy-enhanced federated edge learning for healthcare cyber physical systems[J].IEEE Transactions onNetwork Science and Engineering,2022,9(5):3558-3569.
[16]FAN H,XIONG B,MANGALAM K,et al.Multiscale vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:6824-6835.
[17]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobile video networks for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:16020-16030.
[18]BAIN M,NAGRANI A,VAROL G,et al.Frozen in time:A joint video and image encoder for end-to-end retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:1728-1738.
[19]SUN R,LI Y,ZHANG T,et al.Lesion-aware transformers for diabetic retinopathy grading[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10938-10947.
[20]GU Y,TINN R,CHENG H,et al.Domain-specific language model pretraining for biomedical natural language processing[J].ACM Transactions on Computing for Healthcare,2021,3(1):1-23.
[21]TREWARTHA A,WALKER N,HUO H,et al.Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science[J].Patterns,2022,3(4):100488.
[22]WANG Z,WU Z,AGARWAL D,et al.MedCLIP:Contrastive learning from unpaired medical images and text[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:3876-3887.
[23]WANG J,LI W,LIU W,et al.al Enabling inductive knowledge graph completion via structure-aware attention network[J].Applied Intelligence,2023,53(8):25003-25027.
[24]YASUNAGA M,REN H,BOSSELUT A,et al.QA-GNN:Reasoning with Language Models and Knowledge Graphs for Question Answering[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2021:535-546.
[25]VANDENHENDE S,GEORGOULIS S,VAN GANSBEKE W,et al.Multi-task learning for dense prediction tasks:A survey[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(7):3614-3633.
[26]ZHOU H Y,YU Y,WANG C,et al.A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics[J].Nature Biomedical Engineering,2023,7(6):743-755.
[27]KAZEMZADEH K.Artificial intelligence in ophthalmology:opportunities,challenges,and ethical considerations[J].Medical Hypothesis,Discovery and Innovation in Ophthalmology,2025,14(1):255.
[28]GAWLIKOWSKI J,TASSI C R,ALI M,et al.A survey of uncertainty in deep neural networks[J].Artificial Intelligence Review,2023,56:1513-1589.
[29]MOOR M,BANERJIE O,ABAD Z S H,et al.Foundation mo-dels for generalist medical artificial intelligence[J].Nature,2023,616(7956):259-265.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!