计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700182-10.doi: 10.11896/jsjkx.240700182

• 大语言模型技术及应用 • 上一篇    下一篇

幻觉主动缓解的糖尿病问诊大模型

张乐1, 车超1,2, 梁艳3   

  1. 1 大连大学先进设计与智能计算省部共建教育部重点实验室 辽宁 大连 116622
    2 大连大学软件工程学院 辽宁 大连 116622
    3 上海建桥学院机电学院 上海 201306
  • 出版日期:2025-06-16 发布日期:2025-06-12
  • 通讯作者: 梁艳(liangy@gench.edu.cn)
  • 作者简介:(zhangle20000111@hotmail.com)
  • 基金资助:
    国家自然科学基金(62076045);辽宁省教育厅服务地方项目(LJKFZ20220290);大连大学学科交叉项目(DLUXK-2023-YB-003)

Hallucinations Proactive Relief in Diabetes Q&A LLM

ZHANG Le1, CHE Chao1,2, LIANG Yan3   

  1. 1 Key Laboratory of Advanced Design and Intelligent Computing(Dalian University),Ministry of Education,Dalian,Liaoning 116622,China
    2 School of Software Engineering,Dalian University,Dalian,Liaoning 116622,China
    3 College of Mechanical and Electronic Engineering,Shanghai Jianqiao University,Shanghai 201306,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:ZHANG Le,born in 2000,postgraduate,is a member of CCF(No.T9208G).His main research interests include large language model and natural language processing.
    LIANG Yan,born in 1982,master.Her main research interests include digital and signal processing.
  • Supported by:
    National Natural Science Foundation of China (62076045), Liaoning Provincial Department of Education Service Local Program (LJKFZ20220290) and Dalian University Interdisciplinary Program (DLUXK-2023-YB-003).

摘要: 糖尿病的治疗是一项长期且高度个性化的工作,给患者的日常生活带来了巨大负担。患者通过医学大语言模型进行糖尿病问诊能有效减轻患者的医疗负担,但大语言模型在处理医学等专业领域文本时更可能会产生幻觉,即错误、无意义或与输入不匹配的输出。且现有的幻觉缓解技术在医学领域的准确率并不理想,这会极大地影响大语言模型的准确率。为了解决这一问题,提出一种结合指令微调和检索增强生成的幻觉自查与主动缓解方法,主要在生成过程前对用户提问形成附加知识,在生成过程后通过相似度对比判断幻觉是否产生。实验在多个医学数据集上进行,在大规模糖尿病多轮对话数据集上取得了0.79的F1值、2.38的BLEU-4值和9.26的Rouge-l值,在准确率和生成效率方面均优于现有的大语言模型幻觉缓解技术。

关键词: 大语言模型, 检索增强生成, 幻觉缓解, 糖尿病, 问答系统

Abstract: The treatment of diabetes is a long-term and highly personalized endeavor and imposes a significant burden on patients’ daily lives.Diabetes consultation through medical large language models(LLMs) can effectively alleviate the medical healthcare burden of patients.But LLMs are more likely to produce hallucinations,i.e.,outputs that are incorrect,meaningless,or mismatched with the input,when processing texts in specialized domains such as medicine.And the accuracy rate of existing hallucination relief techniques in the medical field is not satisfactory,which will greatly affect the accuracy rate of the LLMs.To address this problem,this paper proposes a hallucination self-inspection and proactive relief method that combines instruction fine-tuning and retrieval augmented generation to form additional knowledge about user questions before the generation process,and to determine whether a hallucination is generated by similarity comparison after the generation process.Experiments are conducted on several medical datasets,and an F1 value of 0.79,a BLEU-4 value of 2.38,and a Rouge-l value of 9.26 are achieved on the large-scale diabetic multi-round conversation dataset,which outperforms the existing hallucination relief techniques for LLMs in terms of accuracy and generation efficiency.

Key words: Large language model, Retrieval augmented generation, Hallucination relief, Diabetes, Question and answer system

中图分类号: 

  • F416
[1]ZENG A,LIU X,DU Z,et al.GLM-130B:An Open Bilingual Pre-Trained Model [C]//The Eleventh International Conference on Learning Representations,ICLR 2023,Kigali,Rwanda,May 1-5,2023.OpenReview.net,2023.
[2]SUN Y,WANG S,FENG S,et al.Ernie 3.0:Large-scale knowledge enhanced pre-training for language understanding and generation [J].arXiv:2107.02137,2021.
[3]BAI J,BAI S,CHU Y,et al.Qwen technical report [J].arXiv:2309.16609,2023.
[4]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:OpenFoundation and Fine-Tuned Chat Models [J].arXiv:2307.09288,2023.
[5]VARSHNEY N,YAO W,ZHANG H,et al.A Stitch in Time Saves Nine:Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation [J].arXiv:2307.03987,2023.
[6]LI Y,LI Z,ZHANG K,et al.ChatDoctor:A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI(LLaMA) Using Medical Domain Knowledge [J].Cureus,2023,15(6):1-12.
[7]WANG H,LIU C,XI N,et al.Huatuo:Tuning LLaMA Model with Chinese Medical Knowledge [J].arXiv:2304.06975,2023.
[8]LIAO Y,MENG Y,LIU H,et al.MING:Chinese Medical Consultation Large Model [EB/OL].(2023-01-01) [2024-07-24].https://github.com/MediaBrain-SJTU/MING.
[9]WANG H,ZHAO S,QIANG Z,et al.Knowledge-tuning Large Language Models with Structured Medical Knowledge Bases for Reliable Response Generation in Chinese [J].arXiv:2309.04175,2023.
[10]LEE N,PING W,XU P,et al.Factuality enhanced language models for open-ended text generation [J].Advances in Neural Information Processing Systems,2022,35:34586-34599.
[11]RASHKIN H,REITTER D,TOMAR G S,et al.IncreasingFaithfulness in Knowledge-Grounded Dialogue with Controllable Features [C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing(Volume 1:Long Papers).Association for Computational Linguistics,2021:704-718.
[12]LI Y,YAO K,QIN L,et al.Slot-consistent NLG for task-oriented dialogue systems with iterative rectification network [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020.
[13]CHEN S,ZHANG F,SONE K,et al.Improving Faithfulness in Abstractive Summarization with Contrast Candidate Generation and Selection [EB/OL].(2021) [Association for Computational Linguistics].https://aclanthology.org/2021.naacl-main.475.
[14]PENG B,GALLEY M,HE P,et al.Check your facts and tryagain:Improving large language models with external knowledge and automated feedback [J].arXiv:2302.12813,2023.
[15]MANAKUL P,LIUSIEA,GALES M.SelfCheckGPT:Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Singapore,2023.Association for Computational Linguistics.
[16]AZARIA A,AZOULAY R,RECHES S.ChatGPT is a remarkable tool-For experts [J].Data Intelligence,2023,5(4):1-49.
[17]LIU X,JI K,FU Y,et al.P-tuning v2:Prompt tuning can be comparable to fine-tuning universally across scales and tasks [J].arXiv:2110.07602,2021.
[18]TAORI R,GULRAJANI I,ZHANG T,et al.Alpaca:A strong,replicable instruction-following model [J].Stanford Center for Research on Foundation Models,2023,3(6):7.
[19]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space [C]//1st International Conference on Learning Representations(ICLR 2013).Scottsdale,Arizona,USA,May 2-4,2013,Workshop Track Proceedings.Y.Bengio and Y.LeCun(eds.),2013.
[20]DU Z,QIAN Y,LIU X,et al.GLM:General Language Model Pretraining with Autoregressive Blank Infilling [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Dublin,Ireland,2022.Association for Computational Linguistics.
[21]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [J].Advances in Neural Information Processing Systems,2017,30.
[22]SHEN J,WU Y,SHANG J,et al.DeepNet:Scaling Transformers to 1,000 Layers [C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR).IEEE,2023:10777-10786.
[23]SU J,AHMED M,LU Y,et al.Roformer:Enhanced Transformer with Rotary Position Embedding [J].Neurocomputing,2024,568:127063.
[24]HENDRYCKS D,GIMPEL K.Gaussian error linear units(gelus) [J].arXiv:1606.08415,2016.
[25]HOULSBY N,GIURGIU A,JASTRZEBSKI S,et al.Parameter-efficient transfer learning for NLP [C]//International Conference on Machine Learning.PMLR,2019.
[26]LIU X,JI K,FU Y,et al.P-tuning:Prompt tuning can be comparable to fine-tuning across scales and tasks [C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics(Volume 2:Short Papers).2022.
[27]ZHANG T,KISHORE V,WU F,et al.BERTScore:Evaluating Text Generation with BERT [C]//8th International Conference on Learning Representations(ICLR 2020).Addis Ababa,Ethiopia,April 26-30,2020.OpenReview.net,2020.
[28]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries [C]//Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics.Barcelona,Spain,2004.Association for Computational Linguistics.
[29]PAPINENI K,ROUKOS S,WARD T,et al.BLEU:A Method for Automatic Evaluation of Machine Translation [C]//Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.2002.
[30]BROWN T,MANN B,RYDER N,et al.Language models arefew-shot learners [J].Advances in Neural Information Processing Systems,2020,33:1877-1901.
[31]LIN S,HILTON J,EVANS O.TruthfulQA:Measuring HowModels Mimic Human Falsehoods [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Dublin,Ireland,2022.Association for Computational Linguistics.
[32]HUANG K H,CHAN H P,JI H.Zero-shot Faithful FactualError Correction [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.Toronto,Canada,2023.Association for Computational Linguistics.
[33]CAI Y,WANG L,WANG Y,et al.MedBench:A Large-ScaleChinese Benchmark for Evaluating Medical Large Language Models [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:17709-17717.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!