计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 217-230.doi: 10.11896/jsjkx.241200055

• 人工智能 • 上一篇    下一篇

SPEAKSMART:大语言模型共情说服性回复的评测

陈昱妍1, 贾纪源2, 常婧雯1, 左凯文3, 肖仰华1   

  1. 1 复旦大学计算机科学技术学院 上海 200438
    2 南方科技大学电子与电气工程系 广东 深圳 518055
    3 华威大学计算机科学学院 考文垂CV4 7AL
  • 收稿日期:2024-12-09 修回日期:2025-02-18 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 肖仰华(shawyh@fudan.edu.cn)
  • 作者简介:(chenyuyan21@m.fudan.edu.cn)

SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models

CHEN Yuyan1, JIA Jiyuan2, CHANG Jingwen1, ZUO Kaiwen3, XIAO Yanghua1   

  1. 1 School of Computer Science and Technology,Fudan University,Shanghai 200438,China
    2 Department of Electronic and Electrical Engineering,Southern University of Science and Technology,Shenzhen,Guangdong 518055,China
    3 Department of Computer Science,University of Warwick,Coventry CV4 7AL,United Kingdom
  • Received:2024-12-09 Revised:2025-02-18 Online:2025-10-15 Published:2025-10-14
  • About author:CHEN Yuyan,born in 1996,Ph.D.Her main research interests include natural language processing,knowledge graphs,question-answering systems,dialogue systems and multimodal cognitive intelligence.
    XIAO Yanghua,Ph.D,professor,Ph.D supervisor.His main research interests include knowledge graphs,semantic representation and reasoning;large language models with evaluation,robustness,reliability enhancement,and controllable ge-neration;socially inspired artificial intelligence,encompassing social computing,causal inference,and interpretable decision-making.

摘要: 近年来,大语言模型(Large Language Models,LLMs)在情感对话中展现了令人惊叹的能力,并且在实现目标方面表现出较强的能力。然而,现有研究主要集中在通过情感共鸣式的回复提供安慰,而不是通过这些回复实现特定的现实目标。为了填补这一研究空白,提出了一个名为SPEAKSMART的基准,其涵盖5个场景,用于评测LLMs在对话中通过高度共情的回复实现现实目标的能力。随后,引入了一个基于提供者满意度和请求者满意度的二维评测框架。利用SPEAKSMART对多种LLMs进行了评测,并设计了基线方法,以增强其在对话中生成具有说服力的共情回复的能力。实验结果表明,Claude3和LLaMA3-70B在不同场景中的表现最佳,而其他LLMs则有提升空间,这为增强LLMs处理需要高度共情回复以实现目标的现实任务的能力奠定了基础。

关键词: 大语言模型, 情感对话, 目标达成, SPEAKSMART基准, 共情回复

Abstract: In recent years,LLMs have shown amazing capabilities in emotional dialogues and strong goal-achievement abilities.However,existing research mainly focuses on providing comfort through empathetic responses,rather than achieving specific real-world goals using these responses.To address this gap,this paper proposes a benchmark named SPEAKSMART,covering five scenarios to evaluate LLMs' ability to achieve real-world goals through highly empathetic responses in conversations.Subsequently,a two-dimensional evaluation framework based on provider satisfaction and requester satisfaction is introducted.Various LLMs are evaluated using SPEAKSMART and a baseline approach is designed to enhance their capabilities for generating empathetic and persuasive responses in conversations.Experiments reveal that Claude3 and LLaMA3-70B perform best across different scenarios,while other LLMs show room for improvement.This research lays the foundation for enhancing LLMs' ability to handle real-world tasks requiring highly empathetic responses to achieve goals.

Key words: Large language models,Emotional dialogue,Goal achievement,SPEAKSMART benchmark,Empathetic response

中图分类号: 

  • TP391
[1]LUO M,WARREN C J,CHENG L,et al.Assessing empathy in large language models with real-world physician-patient interactions [J].arXiv:2405.16402,2024.
[2]WOODSIDE A G,SOOD S,MILLER K E.When consumersand brands talk:Storytelling theory and research in psychology and marketing [J].Psychology & Marketing,2008,25(2):97-145.
[3]ALMAZROUEI E,ALOBEIDLI H,ALSHAMSI A,et al.TheFalcon series of open language models [J].arXiv:2311.16867,2023.
[4]JIANG H,ZHANG X,CAO X,et al.PersonaLLM:Investigating the ability of GPT-3.5 to express personality traits and gender differences [J].arXiv:2305.02547,2023.
[5]LEE Y K,SUH J,ZHAN H,et al.Large language models produce responses perceived to be empathic [J].arXiv:2403.18148,2024.
[6]OpenAI,ACHIAN J,ADLER S,et al.GPT-4 technical report [J].arXiv:2303.08774,2023.
[7]LOH S B,SESAGIRI RAAMKUMAR A.Harnessing large language models' empathetic response generation capabilities for online mental health counselling support [J].arXiv:2310.08017,2023.
[8]ULLMAN T.Large language models fail on trivial alterations totheory-of-mind tasks [J].arXiv:2302.08399,2023.
[9]ZHAO W X,ZHAO Y Y,LU X,et al.Is ChatGPT equipped with emotional dialogue capabilities? [J].arXiv:2304.09582,2023.
[10]ABDELNABI S,GOMAA A,SIVAPRASAD S,et al.LLM-deliberation:Evaluating LLMs with interactive multi-agent negotiation games [J].arXiv:2309.17234,2023.
[11]GRATTAFIORI A,DUBEY A,JAUHRI A,et al.The llama 3 herd of models[J].arXiv:2407.21783.2024.
[12]BIANCHI F,CHIA P J,YUKSEKGONUL M,et al.How well can LLMs negotiate? NegotiationArena platform and analysis [J].arXiv:2402.05863,2024.
[13]KWON D,WEISS E,KULSHRESTHA T,et al.Are LLMs effective negotiators? Systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues [J].arXiv:2402.13550,2024.
[14]LI H,LEUNG J,SHEN Z.Towards goal-oriented large lan-guage model prompting:A survey [J].arXiv:2401.14043,2024.
[15]CHEN Z,WHITE M,MOONEY R,et al.When is tree search useful for LLM planning? It depends on the discriminator [J].arXiv:2402.10890,2024.
[16]ZHANG Q,WANG Y,YU T,et al.Reviseval:Improving llm-as-a-judge via response-adapted references[J].arXiv:2410.05193,2024.
[17]BANDURA A.Self-efficacy:Toward a unifying theory of beha-vioral change [J].Psychological Review,1977,84(2):191.
[18]PETTY R E,CACIOPPO J T.The elaboration likelihood model of persuasion[J].Advances in Experimental Social Psychology,1986,19:123-205.
[19]BREHM J W.A theory of psychological reactance [M].Academic Press,1966.
[20]DECI E L,RYAN R M.Intrinsic motivation and self-determination in human behavior [M].Springer Science & Business Media,2013.
[21]Skinner B F.Science and human behavior [M].New York:Simon and Schuster,1953.
[22]SKINNER B F.Science and human behavior (Vol.92904) [M].New York:Simon and Schuster,1965.
[23]SKINNER B F.The behavior of organisms:An experimental analysis [M].BF Skinner Foundation,2019.
[24]TAJFEL H.Experiments in intergroup discrimination [J].Scientific American,1970,223(5):96-103.
[25]HOMANS G C.The human group [M].Routledge,2017.
[26]CIALDINI R B.Influence:The psychology of persuasion [M].New York:Collins,2007.
[27]ZHANG J D,LIU J F,WANG Z Y,et al.AI Question-Answe-ring Driven by Large Models in User-Responsive Scenarios:Ta-king Medical Triage as an Example[J].Journal of Nanjing University (Information Management Edition),2025,41(1):100-120.
[28]LOEWENSTEIN G.The psychology of curiosity:A review and reinterpretation[J].Psychological Bulletin,1994,116(1):75-98.
[29]CSIKSZENTMIHALYI M.Beyond boredom and anxiety:Experiencing flow in work and play[M].Jossey-Bass,2000.
[30]CHEN Y,YUAN Y,LIU P,et al.Talk funny! A large-scale humor response dataset with chain-of-humor interpretation [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:17826-17834.
[31]PAL D,VANIJJA V,THAPLIYAL H,et al.What affects theusage of artificial conversational agents? An agent personality and love theory perspective [J].Computers in Human Behavior,2023,145:107788.
[32]CHEN Y R,XING X F,LIN J K,et al.SoulChat:ImprovingLLMs' empathy,listening,and comfort abilities through fine-tuning with multi-turn empathy conversations [C]//Findings of the Association for Computational Linguistics:EMNLP 2023.2023:1170-1183.
[33]LE S T,FAN A,AKIKI C,et al.BLOOM:A 176B-parameter open-access multilingual language model [J].arXiv:2211.05100,2022.
[34]BAI Y,KADAVATH S,KUNDU S,et al.Constitutional AI:Harmlessness from AI feedback [J].arXiv:2212.08073,2022.
[35]ANTHROPI C.Claude 3 haiku:Our fastest model yet [EB/OL].https://www.anthropic.com.
[36]BROWN T B,MANN B,RYDER N,et al.Language models are few-shot learners [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems(NIPS'20).2020:1877-1901.
[37]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models [J].arXiv:2307.09288,2023.
[38]CHIANG W L,LI Z,LIN Z,et al.Vicuna:An open-source chatbot impressing GPT-4 with 90% ChatGPT quality [EB/OL].https://vicuna.lmsys.
[39]ZHENG L M,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-Bench and Chatbot Arena [C]//Procee-dings of the 37th International Conference on Neutral Information Processing Systems.2020:46595-46623.
[40]PRAKASH V,LEE K,BHATTACHARYA A,et al.Assessment of LLM Responses to End-user Security Questions[J].arXiv:2411.14571,2024.
[41]ZOU Z,MUBIN O,ALNAJJAR F,et al.A pilot study of measuring emotional response and perception of LLM-generated questionnaire and human-generated questionnaires[J].Scientific Reports,2024,14(1):2781.
[42]ZENG H,NIU C,WU F,et al.Personalized LLM for GeneratingCustomized Responses to the Same Query from Different Users[J].arXiv:2412.11736,2024.
[43]ZHOU Y,HUANG Z,LU F,et al.Don't Say No:Jailbreaking LLM by Suppressing Refusal[J].arXiv:2404.16369,2024.
[44]YADKORI Y A,KUZBORSKIJ I,GYÖRGY A,et al.To Believe or Not to Believe Your LLM[J].arXiv:2406.02543,2024.
[45]PHUTE M,HELBLING A,HULL M,et al.Llm self defense:By self examination,llms know they are being tricked[J].arXiv:2308.07308,2023.
[46]MCKNIGHT P E,NAJAB J.Mann-Whitney U Test[J].TheCorsini Encyclopedia of Psychology,2010,84(3):1.
[47]CHEONG I,XIA K,FENG K J K,et al.(A) I Am Not a Lawyer,But…:Engaging Legal Experts towards Responsible LLM Policies for Legal Advice[C]//The 2024 ACM Conference on Fairness,Accountability,and Transparency.2024:2454-2469.
[48]CHEN Y,LIU Y,YAN J,et al.See what llms cannot answer:A self-challenge framework for uncovering llm weaknesses[J].arXiv:2408.08978,2024.
[49]LI M,SU Y S,HUANG H Y,et al.Language-specific representation of emotion-concept knowledge causally supports emotion inference [J].Iscience,2024,27(12):11401.
[50]LEE Y J,LIM C G,CHOI H J.Does GPT-3 generate empathetic dialogues? A novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation [C]//Proceedings of the 29th International Conference on Computational Linguistics.2022:669-683.
[51]JIANG S W,ZHANG J W,HUA L S,et al.Implementation of a Meteorological Database Question-Answering Model Based on Large Model Retrieval-Augmented Generation[J].Application Research of Computers,2024,41(2):45-56.
[52]TIAN Y L,SI F D,NIU L,et al.Research on Fault Tree Intelligent Question-Answering Method Based on Large Model Decision-Making[J].Journal of Systems Engineering,2024,42(5):78-89.
[53]ZHANG J Y,WANG T K,MO C Y,et al.Construction and Evaluation of an Electric Power Knowledge Base Intelligent Question-Answering System Based on Large Language Models[J].Computer Science and Applications,2024,41(6):23-34.
[54]TAO X Y.Research on Intelligent Question-Answering System of Large Language Models Based on Hybrid Architecture[J].Posts and Telecommunications Design Technology,2024 (5):48-55.
[55]LI B X.Stable Output Method of Retrieval-Augmented LargeModels for Private Question-Answering Systems[J].CAAI Transactions on Intelligent Systems,2024,42(4):67-78.
[56]CHEN J Z,WANG S Y,LUO H R.Knowledge Graph Question-Answering Integrating Large Model Fine-Tuning and Graph Neural Networks[J].Computer Engineering and Applications,2024,60(24):166-175.
[57]HUANG Z,SHAN W Z,GUO Z P,et al.Design and Implementation of a Trustworthy Large Model Government Affairs Ques-tion-Answering System[C]//Proceedings of the 2024 World In-telligent Industry Expo on Artificial Intelligence Security Go-vernance Theme Forum.2024:193-197.
[58]CHEN D H,LU X,ZHANG Y F.Research on Question-Answe-ring System in the Bidding Field Based on LangChain+LLM[J].Journal of Hubei University of Economics (Statistics and Mathematics Edition),2024,15(3):45-55.
[59]ZHANG J D,LIU J F,WANG Z Y,et al.AI Question-Answering Driven by Large Models in User-Responsive Scenarios:Taking Medical Triage as an Example[J].Journal of Nanjing University (Information Management Edition),2025,41(1):100-120.
[60]ZHAN H L,WANG Y F,FENG T,et al.Let's negotiate! A survey of negotiation dialogue systems [J].arXiv:2402.01097,2024.
[61]HUA Y,QU L,HAFFARI G.Assistive large language modelagents for socially-aware negotiation dialogues [J].arXiv:2402.01737,2024.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!