基于多任务学习的眼科视频特征融合与多维画像

doi:10.11896/jsjkx.260200058

Abstract

Abstract: To address challenges in profiling ophthalmic videos on social networks,such as the low discriminability of visual features,the colloquial nature of text descriptions,and multimodal semantic heterogeneity,this paper proposes an OVP(Ophthalmic Video Profiling) method based on multi-task learning.The proposed method aims to mine multi-dimensional features with medical semantic value from unstructured video and text streams to facilitate precise video representation.In the OVP framework,a pre-trained deep residual network is employed to extract high-dimensional visual representations from keyframes,capturing fine-grained features specific to ophthalmic imagery.To overcome the sparsity of professional semantics in social media text,a method for extracting textual features from ophthalmic videos based on an ophthalmic knowledge graph is proposed,which retrieves and fuses external entity annotations and related knowledge before encoding via BERT.Subsequently,a cross-modal attention fusion mechanism is designed to dynamically calculate interaction weights between visual and textual features,achieving deep alignment between visual information and medical semantics.Furthermore,a multi-task joint optimization and ophthalmic multidimensional profiling is constructed to jointly train three sub-tasks:disease classification,popularity prediction,and content quality assessment,utilizing shared information to enhance model generalization.Experiments conducted on a real ophthalmic video dataset demonstrate that the OVP method significantly outperforms existing baseline methods in terms of disease classification accuracy,heatmap prediction,and quality assessment performance for ophthalmic videos.The experimental results validate the effectiveness of the OVP method in feature fusion and multidimensional profiling of complex ophthalmic videos.

Key words: Ophthalmic video profiling, Multi-task learning, Multi-modal fusion, Knowledge graph, Deep learning

CLC Number:

TP391

DU Jiantong, GUAN Zeli, XUE Zhe. Multi-task Learning-based Ophthalmic Video Feature Fusion and Multi-dimensional Profiling[J].Computer Science, 2026, 53(3): 383-391.

References

[1]DE CROON R,VAN HOUDT L,HTUN N N,et al.Health recommender systems:Systematic review[J].Journal of Medical Internet Research,2021,23(6):e18035.
[2]SUAREZ-LLEDO V,ALVAREZ-GALVEZ J.Prevalence ofhealth misinformation on social media:systematic review[J].Journal of Medical Internet Research,2021,42(1):e026.
[3]YUAN L,KANG D,DONG X,et al.Artificial intelligence in clinical education in ophthalmology:a systematic review[J].Visual Neuroscience 2025,12(6):2893-2907.
[4]ARNAB A,DEHGHANI M,HEIGOLD G,et al.ViViT:Af vi-deo vision transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision(ICCV).2021:6836-6846.
[5]LIU Z,NING J,CAO Y,et al.Video swin transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:3202-3211.
[6]LI K,WANG Y,ZHANG J,et al.UniFormer:Unifying convo-lution and self-attention for visual recognition[J].IEEE Tran-sactions on Pattern Analysis and Machine Intelligence(TPAMI),2023,45(10):12581-12600.
[7]LUO H,JI L,ZHONG M,et al.Clip4clip:An empirical study of clip for end to end video clip retrieval[J].Neurocomputing,2022,508:293-304.
[8]LI D,LI J,LI H,et al.Align and prompt:Video-and-language pre-training with entity prompts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).2022:4953-4963.
[9]NI B,PENG H,CHEN M,et al.Expanding language-image pretrained models for general video recognition[C]//European Conference on Computer Vision(ECCV).Cham:Springer Nature Switzerland,2022:1-18.
[10]TONG Z,SONG Y,WANG J,et al.VideoMAE:Masked au-toencoders are data-efficient learners for self-supervised video pre-training[J].Advances in Neural Information Processing Systems,2022,35:10078-10093.
[11]LIN K,LI L,LIN C,et al.SwinBERT:End-to-end transformers with sparse attention for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2022:17949-17958.
[12]ZHU B,BIN Y,XU H,et al.LanguageBind:Extending video-language pretraining to N-modality by language-based semantic alignment[C]//The Twelfth International Conference on Lear-ning Representations(ICLR).2024.
[13]HUANG S C,SHEN L,LUNGREN M,et al.GLoRIA:A multi-modal global-local representation learning framework for label-efficient medical image recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:3942-3951.
[14]STAHLSCHMIDT S R,ULFENBORG B,SYNNERGREN J.Multimodal deep learning for biomedical data fusion:a review[J].Briefings in Bioinformatics,2022,23(2):1-15.
[15]LIAN Z,YANG Q,WANG W,et al.DEEP-FEL:Decentralized,efficient and privacy-enhanced federated edge learning for healthcare cyber physical systems[J].IEEE Transactions onNetwork Science and Engineering,2022,9(5):3558-3569.
[16]FAN H,XIONG B,MANGALAM K,et al.Multiscale vision transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:6824-6835.
[17]KONDRATYUK D,YUAN L,LI Y,et al.Movinets:Mobile video networks for efficient video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:16020-16030.
[18]BAIN M,NAGRANI A,VAROL G,et al.Frozen in time:A joint video and image encoder for end-to-end retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision.2021:1728-1738.
[19]SUN R,LI Y,ZHANG T,et al.Lesion-aware transformers for diabetic retinopathy grading[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2021:10938-10947.
[20]GU Y,TINN R,CHENG H,et al.Domain-specific language model pretraining for biomedical natural language processing[J].ACM Transactions on Computing for Healthcare,2021,3(1):1-23.
[21]TREWARTHA A,WALKER N,HUO H,et al.Quantifying the advantage of domain-specific pre-training on named entity recognition tasks in materials science[J].Patterns,2022,3(4):100488.
[22]WANG Z,WU Z,AGARWAL D,et al.MedCLIP:Contrastive learning from unpaired medical images and text[C]//Procee-dings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:3876-3887.
[23]WANG J,LI W,LIU W,et al.al Enabling inductive knowledge graph completion via structure-aware attention network[J].Applied Intelligence,2023,53(8):25003-25027.
[24]YASUNAGA M,REN H,BOSSELUT A,et al.QA-GNN:Reasoning with Language Models and Knowledge Graphs for Question Answering[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2021:535-546.
[25]VANDENHENDE S,GEORGOULIS S,VAN GANSBEKE W,et al.Multi-task learning for dense prediction tasks:A survey[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,44(7):3614-3633.
[26]ZHOU H Y,YU Y,WANG C,et al.A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics[J].Nature Biomedical Engineering,2023,7(6):743-755.
[27]KAZEMZADEH K.Artificial intelligence in ophthalmology:opportunities,challenges,and ethical considerations[J].Medical Hypothesis,Discovery and Innovation in Ophthalmology,2025,14(1):255.
[28]GAWLIKOWSKI J,TASSI C R,ALI M,et al.A survey of uncertainty in deep neural networks[J].Artificial Intelligence Review,2023,56:1513-1589.
[29]MOOR M,BANERJIE O,ABAD Z S H,et al.Foundation mo-dels for generalist medical artificial intelligence[J].Nature,2023,616(7956):259-265.

Related Articles 15

[1]	LI Zequn, DING Fei. Fatigue Driving Detection Based on Dual-branch Fusion and Segmented Domain AdaptationTransfer Learning [J]. Computer Science, 2026, 53(3): 78-87.
[2]	FU Yukai, LI Qingzhen, DONG Zhixue, SHI Dongli, ZHAO Peng. Pedestrian Re-identification Methods Based on Limited Target Data and Deep Learning [J]. Computer Science, 2026, 53(3): 287-294.
[3]	XU Cheng, LIU Yuxuan, WANG Xin, ZHANG Cheng, YAO Dengfeng, YUAN Jiazheng. Review of Speech Disorder Assessment Methods Driven by Large Language Models [J]. Computer Science, 2026, 53(3): 307-320.
[4]	QIN Jing, LI Guanfeng, CHEN Yuyin, XIAO Yuhang. Embedding Model of Knowledge Graph via Jointly Modeling Ontology and Instances [J]. Computer Science, 2026, 53(3): 331-340.
[5]	YU Ding, LI Zhangwei. Prediction Method of RNA Secondary Structure Based on Transformer Architecture [J]. Computer Science, 2026, 53(3): 375-382.
[6]	SU Ruitao, REN Jiongjiong, CHEN Shaozhen. Deep Learning-based Neural Differential Distinguishers for GIFT-128 and ASCON [J]. Computer Science, 2026, 53(3): 453-458.
[7]	XI Penghui, WU Xiazhen, JIANG Wencong, FANG Liangda, HE Chaobo, GUAN Quanlong. Review of Personalized Educational Resource Recommendations [J]. Computer Science, 2026, 53(2): 1-15.
[8]	ZHANG Jing, PAN Jinghao, JIANG Wenchao. Background Structure-aware Few-shot Knowledge Graph Completion [J]. Computer Science, 2026, 53(2): 331-341.
[9]	HUANG Jing, WANG Teng, LIU Jian, HU Kai, PENG Xin, HUANG Yamin, WEN Yuanqiao. Multimodal Visual Detection for Underwater Sonar Target Images [J]. Computer Science, 2026, 53(2): 227-235.
[10]	LIU Chenhong, LI Fenglian, YANG Jia, WANG Suzhe, CHEN Guijun. Boundary-focused Multi-scale Feature Fusion Network for Stroke Lesion Segmentation [J]. Computer Science, 2026, 53(2): 264-272.
[11]	CHEN Yuyin, LI Guanfeng, QIN Jing, XIAO Yuhang. Survey on Complex Logical Query Methods in Knowledge Graphs [J]. Computer Science, 2026, 53(2): 273-288.
[12]	HUANG Miaomiao, WANG Huiying, WANG Meixia, WANG Yejiang , ZHAO Yuhai. Review of Graph Embedding Learning Research:From Simple Graph to Complex Graph [J]. Computer Science, 2026, 53(1): 58-76.
[13]	WANG Cheng, JIN Cheng. KAN-based Unsupervised Multivariate Time Series Anomaly Detection Network [J]. Computer Science, 2026, 53(1): 89-96.
[14]	XUE Jingyan, XIA Jianan, HUO Ruili, LIU Jie, ZHOU Xuezhong. Review of Retinal Image Analysis Methods for OCT/OCTA Based on Deep Learning [J]. Computer Science, 2026, 53(1): 128-140.
[15]	ZHOU Bingquan, JIANG Jie, CHEN Jiangmin, ZHAN Lixin. EvR-DETR:Event-RGB Fusion for Lightweight End-to-End Object Detection [J]. Computer Science, 2026, 53(1): 153-162.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Multi-task Learning-based Ophthalmic Video Feature Fusion and Multi-dimensional Profiling

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0