SPEAKSMART:大语言模型共情说服性回复的评测

doi:10.11896/jsjkx.241200055

Abstract

Abstract: In recent years,LLMs have shown amazing capabilities in emotional dialogues and strong goal-achievement abilities.However,existing research mainly focuses on providing comfort through empathetic responses,rather than achieving specific real-world goals using these responses.To address this gap,this paper proposes a benchmark named SPEAKSMART,covering five scenarios to evaluate LLMs' ability to achieve real-world goals through highly empathetic responses in conversations.Subsequently,a two-dimensional evaluation framework based on provider satisfaction and requester satisfaction is introducted.Various LLMs are evaluated using SPEAKSMART and a baseline approach is designed to enhance their capabilities for generating empathetic and persuasive responses in conversations.Experiments reveal that Claude3 and LLaMA3-70B perform best across different scenarios,while other LLMs show room for improvement.This research lays the foundation for enhancing LLMs' ability to handle real-world tasks requiring highly empathetic responses to achieve goals.

Key words: Large language models,Emotional dialogue,Goal achievement,SPEAKSMART benchmark,Empathetic response

CLC Number:

TP391

CHEN Yuyan, JIA Jiyuan, CHANG Jingwen, ZUO Kaiwen, XIAO Yanghua. SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models[J].Computer Science, 2025, 52(10): 217-230.

References

[1]LUO M,WARREN C J,CHENG L,et al.Assessing empathy in large language models with real-world physician-patient interactions [J].arXiv:2405.16402,2024.
[2]WOODSIDE A G,SOOD S,MILLER K E.When consumersand brands talk:Storytelling theory and research in psychology and marketing [J].Psychology & Marketing,2008,25(2):97-145.
[3]ALMAZROUEI E,ALOBEIDLI H,ALSHAMSI A,et al.TheFalcon series of open language models [J].arXiv:2311.16867,2023.
[4]JIANG H,ZHANG X,CAO X,et al.PersonaLLM:Investigating the ability of GPT-3.5 to express personality traits and gender differences [J].arXiv:2305.02547,2023.
[5]LEE Y K,SUH J,ZHAN H,et al.Large language models produce responses perceived to be empathic [J].arXiv:2403.18148,2024.
[6]OpenAI,ACHIAN J,ADLER S,et al.GPT-4 technical report [J].arXiv:2303.08774,2023.
[7]LOH S B,SESAGIRI RAAMKUMAR A.Harnessing large language models' empathetic response generation capabilities for online mental health counselling support [J].arXiv:2310.08017,2023.
[8]ULLMAN T.Large language models fail on trivial alterations totheory-of-mind tasks [J].arXiv:2302.08399,2023.
[9]ZHAO W X,ZHAO Y Y,LU X,et al.Is ChatGPT equipped with emotional dialogue capabilities? [J].arXiv:2304.09582,2023.
[10]ABDELNABI S,GOMAA A,SIVAPRASAD S,et al.LLM-deliberation:Evaluating LLMs with interactive multi-agent negotiation games [J].arXiv:2309.17234,2023.
[11]GRATTAFIORI A,DUBEY A,JAUHRI A,et al.The llama 3 herd of models[J].arXiv:2407.21783.2024.
[12]BIANCHI F,CHIA P J,YUKSEKGONUL M,et al.How well can LLMs negotiate? NegotiationArena platform and analysis [J].arXiv:2402.05863,2024.
[13]KWON D,WEISS E,KULSHRESTHA T,et al.Are LLMs effective negotiators? Systematic evaluation of the multifaceted capabilities of LLMs in negotiation dialogues [J].arXiv:2402.13550,2024.
[14]LI H,LEUNG J,SHEN Z.Towards goal-oriented large lan-guage model prompting:A survey [J].arXiv:2401.14043,2024.
[15]CHEN Z,WHITE M,MOONEY R,et al.When is tree search useful for LLM planning? It depends on the discriminator [J].arXiv:2402.10890,2024.
[16]ZHANG Q,WANG Y,YU T,et al.Reviseval:Improving llm-as-a-judge via response-adapted references[J].arXiv:2410.05193,2024.
[17]BANDURA A.Self-efficacy:Toward a unifying theory of beha-vioral change [J].Psychological Review,1977,84(2):191.
[18]PETTY R E,CACIOPPO J T.The elaboration likelihood model of persuasion[J].Advances in Experimental Social Psychology,1986,19:123-205.
[19]BREHM J W.A theory of psychological reactance [M].Academic Press,1966.
[20]DECI E L,RYAN R M.Intrinsic motivation and self-determination in human behavior [M].Springer Science & Business Media,2013.
[21]Skinner B F.Science and human behavior [M].New York:Simon and Schuster,1953.
[22]SKINNER B F.Science and human behavior (Vol.92904) [M].New York:Simon and Schuster,1965.
[23]SKINNER B F.The behavior of organisms:An experimental analysis [M].BF Skinner Foundation,2019.
[24]TAJFEL H.Experiments in intergroup discrimination [J].Scientific American,1970,223(5):96-103.
[25]HOMANS G C.The human group [M].Routledge,2017.
[26]CIALDINI R B.Influence:The psychology of persuasion [M].New York:Collins,2007.
[27]ZHANG J D,LIU J F,WANG Z Y,et al.AI Question-Answe-ring Driven by Large Models in User-Responsive Scenarios:Ta-king Medical Triage as an Example[J].Journal of Nanjing University (Information Management Edition),2025,41(1):100-120.
[28]LOEWENSTEIN G.The psychology of curiosity:A review and reinterpretation[J].Psychological Bulletin,1994,116(1):75-98.
[29]CSIKSZENTMIHALYI M.Beyond boredom and anxiety:Experiencing flow in work and play[M].Jossey-Bass,2000.
[30]CHEN Y,YUAN Y,LIU P,et al.Talk funny! A large-scale humor response dataset with chain-of-humor interpretation [C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:17826-17834.
[31]PAL D,VANIJJA V,THAPLIYAL H,et al.What affects theusage of artificial conversational agents? An agent personality and love theory perspective [J].Computers in Human Behavior,2023,145:107788.
[32]CHEN Y R,XING X F,LIN J K,et al.SoulChat:ImprovingLLMs' empathy,listening,and comfort abilities through fine-tuning with multi-turn empathy conversations [C]//Findings of the Association for Computational Linguistics:EMNLP 2023.2023:1170-1183.
[33]LE S T,FAN A,AKIKI C,et al.BLOOM:A 176B-parameter open-access multilingual language model [J].arXiv:2211.05100,2022.
[34]BAI Y,KADAVATH S,KUNDU S,et al.Constitutional AI:Harmlessness from AI feedback [J].arXiv:2212.08073,2022.
[35]ANTHROPI C.Claude 3 haiku:Our fastest model yet [EB/OL].https://www.anthropic.com.
[36]BROWN T B,MANN B,RYDER N,et al.Language models are few-shot learners [C]//Proceedings of the 34th International Conference on Neural Information Processing Systems(NIPS'20).2020:1877-1901.
[37]TOUVRON H,MARTIN L,STONE K,et al.Llama 2:Openfoundation and fine-tuned chat models [J].arXiv:2307.09288,2023.
[38]CHIANG W L,LI Z,LIN Z,et al.Vicuna:An open-source chatbot impressing GPT-4 with 90% ChatGPT quality [EB/OL].https://vicuna.lmsys.
[39]ZHENG L M,CHIANG W L,SHENG Y,et al.Judging LLM-as-a-judge with MT-Bench and Chatbot Arena [C]//Procee-dings of the 37th International Conference on Neutral Information Processing Systems.2020:46595-46623.
[40]PRAKASH V,LEE K,BHATTACHARYA A,et al.Assessment of LLM Responses to End-user Security Questions[J].arXiv:2411.14571,2024.
[41]ZOU Z,MUBIN O,ALNAJJAR F,et al.A pilot study of measuring emotional response and perception of LLM-generated questionnaire and human-generated questionnaires[J].Scientific Reports,2024,14(1):2781.
[42]ZENG H,NIU C,WU F,et al.Personalized LLM for GeneratingCustomized Responses to the Same Query from Different Users[J].arXiv:2412.11736,2024.
[43]ZHOU Y,HUANG Z,LU F,et al.Don't Say No:Jailbreaking LLM by Suppressing Refusal[J].arXiv:2404.16369,2024.
[44]YADKORI Y A,KUZBORSKIJ I,GYÖRGY A,et al.To Believe or Not to Believe Your LLM[J].arXiv:2406.02543,2024.
[45]PHUTE M,HELBLING A,HULL M,et al.Llm self defense:By self examination,llms know they are being tricked[J].arXiv:2308.07308,2023.
[46]MCKNIGHT P E,NAJAB J.Mann-Whitney U Test[J].TheCorsini Encyclopedia of Psychology,2010,84(3):1.
[47]CHEONG I,XIA K,FENG K J K,et al.(A) I Am Not a Lawyer,But…:Engaging Legal Experts towards Responsible LLM Policies for Legal Advice[C]//The 2024 ACM Conference on Fairness,Accountability,and Transparency.2024:2454-2469.
[48]CHEN Y,LIU Y,YAN J,et al.See what llms cannot answer:A self-challenge framework for uncovering llm weaknesses[J].arXiv:2408.08978,2024.
[49]LI M,SU Y S,HUANG H Y,et al.Language-specific representation of emotion-concept knowledge causally supports emotion inference [J].Iscience,2024,27(12):11401.
[50]LEE Y J,LIM C G,CHOI H J.Does GPT-3 generate empathetic dialogues? A novel in-context example selection method and automatic evaluation metric for empathetic dialogue generation [C]//Proceedings of the 29th International Conference on Computational Linguistics.2022:669-683.
[51]JIANG S W,ZHANG J W,HUA L S,et al.Implementation of a Meteorological Database Question-Answering Model Based on Large Model Retrieval-Augmented Generation[J].Application Research of Computers,2024,41(2):45-56.
[52]TIAN Y L,SI F D,NIU L,et al.Research on Fault Tree Intelligent Question-Answering Method Based on Large Model Decision-Making[J].Journal of Systems Engineering,2024,42(5):78-89.
[53]ZHANG J Y,WANG T K,MO C Y,et al.Construction and Evaluation of an Electric Power Knowledge Base Intelligent Question-Answering System Based on Large Language Models[J].Computer Science and Applications,2024,41(6):23-34.
[54]TAO X Y.Research on Intelligent Question-Answering System of Large Language Models Based on Hybrid Architecture[J].Posts and Telecommunications Design Technology,2024 (5):48-55.
[55]LI B X.Stable Output Method of Retrieval-Augmented LargeModels for Private Question-Answering Systems[J].CAAI Transactions on Intelligent Systems,2024,42(4):67-78.
[56]CHEN J Z,WANG S Y,LUO H R.Knowledge Graph Question-Answering Integrating Large Model Fine-Tuning and Graph Neural Networks[J].Computer Engineering and Applications,2024,60(24):166-175.
[57]HUANG Z,SHAN W Z,GUO Z P,et al.Design and Implementation of a Trustworthy Large Model Government Affairs Ques-tion-Answering System[C]//Proceedings of the 2024 World In-telligent Industry Expo on Artificial Intelligence Security Go-vernance Theme Forum.2024:193-197.
[58]CHEN D H,LU X,ZHANG Y F.Research on Question-Answe-ring System in the Bidding Field Based on LangChain+LLM[J].Journal of Hubei University of Economics (Statistics and Mathematics Edition),2024,15(3):45-55.
[59]ZHANG J D,LIU J F,WANG Z Y,et al.AI Question-Answering Driven by Large Models in User-Responsive Scenarios:Taking Medical Triage as an Example[J].Journal of Nanjing University (Information Management Edition),2025,41(1):100-120.
[60]ZHAN H L,WANG Y F,FENG T,et al.Let's negotiate! A survey of negotiation dialogue systems [J].arXiv:2402.01097,2024.
[61]HUA Y,QU L,HAFFARI G.Assistive large language modelagents for socially-aware negotiation dialogues [J].arXiv:2402.01737,2024.

Related Articles 15

[1]	WANG Baocai, WU Guowei. Interpretable Credit Risk Assessment Model:Rule Extraction Approach Based on AttentionMechanism [J]. Computer Science, 2025, 52(10): 50-59.
[2]	ZHENG Hanyuan, GE Rongjun, HE Shengji, LI Nan. Direct PET to CT Attenuation Correction Algorithm Based on Imaging Slice Continuity [J]. Computer Science, 2025, 52(10): 115-122.
[3]	XU Hengyu, CHEN Kun, XU Lin, SUN Mingzhai, LU Zhou. SAM-Retina:Arteriovenous Segmentation in Dual-modal Retinal Image Based on SAM [J]. Computer Science, 2025, 52(10): 123-133.
[4]	WEN Jing, ZHANG Songsong, LI Xufeng. Target Tracking Method Based on Cross Scale Fusion of Features and Trajectory Prompts [J]. Computer Science, 2025, 52(10): 144-150.
[5]	SHENG Xiaomeng, ZHAO Junli, WANG Guodong, WANG Yang. Immediate Generation Algorithm of High-fidelity Head Avatars Based on NeRF [J]. Computer Science, 2025, 52(10): 159-167.
[6]	ZHENG Dichen, HE Jikai, LIU Yi, GAO Fan, ZHANG Dengyin. Low Light Image Adaptive Enhancement Algorithm Based on Retinex Theory [J]. Computer Science, 2025, 52(10): 168-175.
[7]	RUAN Ning, LI Chun, MA Haoyue, JIA Yi, LI Tao. Review of Quantum-inspired Metaheuristic Algorithms and Its Applications [J]. Computer Science, 2025, 52(10): 190-200.
[8]	XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua. Subject Knowledge Evaluation Method for Language Models Based on Multiple ChoiceQuestions [J]. Computer Science, 2025, 52(10): 201-207.
[9]	WANG Jian, WANG Jingling, ZHANG Ge, WANG Zhangquan, GUO Shiyuan, YU Guiming. Multimodal Information Extraction Fusion Method Based on Dempster-Shafer Theory [J]. Computer Science, 2025, 52(10): 208-216.
[10]	LI Sihui, CAI Guoyong, JIANG Hang, WEN Yimin. Novel Discrete Diffusion Text Generation Model with Convex Loss Function [J]. Computer Science, 2025, 52(10): 231-238.
[11]	ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[12]	CHEN Jiahao, DUAN Liguo, CHANG Xuanwei, LI Aiping, CUI Juanjuan, HAO Yuanbin. Text Sentiment Classification Method Based on Large-batch Adversarial Strategy and EnhancedFeature Extraction [J]. Computer Science, 2025, 52(10): 247-257.
[13]	WANG Ye, WANG Zhongqing. Text Simplification for Aspect-based Sentiment Analysis Based on Large Language Model [J]. Computer Science, 2025, 52(10): 258-265.
[14]	ZHAO Jinshuang, HUANG Degen. Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training [J]. Computer Science, 2025, 52(10): 266-274.
[15]	SUN Liangxu, LI Linlin, LIU Guoli. Sub-problem Effectiveness Guided Multi-objective Evolution Algorithm [J]. Computer Science, 2025, 52(10): 296-307.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0