计算机科学 ›› 2025, Vol. 52 ›› Issue (10): 266-274.doi: 10.11896/jsjkx.250100023

• 人工智能 • 上一篇    下一篇

基于数据增强和两阶段训练的摘要忠实度评估

赵金爽, 黄德根   

  1. 大连理工大学计算机科学与技术学院 辽宁 大连 116024
  • 收稿日期:2025-01-06 修回日期:2025-04-29 出版日期:2025-10-15 发布日期:2025-10-14
  • 通讯作者: 黄德根(huangdg@dlut.edu.cn)
  • 作者简介:(jinshuangz@mail.dlut.edu.cn)
  • 基金资助:
    云南省重点研发计划(202203AA080004);国家自然科学基金(U1936109)

Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training

ZHAO Jinshuang, HUANG Degen   

  1. School of Computer Science and Technology,Dalian University of Technology,Dalian,Liaoning 116024,China
  • Received:2025-01-06 Revised:2025-04-29 Online:2025-10-15 Published:2025-10-14
  • About author:ZHAO Jinshuang,born in 2000,postgraduate,is a member of CCF(No.Z2722G).Her main research interests include natural language processing and text summarization.
    HUANG Degen,born in 1965,Ph.D,professor,is a member of CCF(No.17961S).His main research interests include natural language processing,machine translation and text summarization.
  • Supported by:
    Key R&D Program of Yunnan Province(202203AA080004) and National Natural Science Foundation of China(U1936109).

摘要: 文本摘要的忠实度,即其与原文在事实层面的一致性,对于自动文本摘要的实际应用具有重要意义。现有的摘要忠实度评估方法在利用文本摘要数据集方面存在不足,且构建的不忠实摘要与原文差异显著,这限制了评估方法的有效性。针对此问题,提出一种基于数据增强和两阶段训练的摘要忠实度评估模型——FaithEval。首先,定义两种数据增强方法,即同主题相似检索和外插掩码填充,用于生成与原文内容相关联的不忠实摘要,应用这些方法从文本摘要数据集中提取训练数据;然后,充分利用数据集的信息,基于原文和参考摘要构建的训练数据,分两个阶段对模型进行训练,逐步强化模型的忠实度评估能力;最后,人工构建摘要忠实度评估测试集SFETS,为检验模型性能提供基准。实验结果表明,在SFETS和Rank19数据集上,FaithEval均表现出色,尤其在SFETS数据集上,达到了当前最优的效果。

关键词: 文本摘要, 忠实度评估, 数据增强, 两阶段训练, 基准测试集

Abstract: The faithfulness of text summaries,which refers to their factual consistency with the original texts,is very important for the practical application of automatic text summarization.Current methods for evaluating the faithfulness of summaries have shortcomings in utilizing text summarization datasets,and the constructed unfaithful summaries differ significantly from the original texts,which limit the effectiveness of these evaluation methods.To solve this problem,this paper proposes a summary faithfulness evaluation model,FaithEval,based on data augmentation and two-stage training.Firstly,two data augmentation methods are defined:Similarity Search with Same Topic and Insert and Fill External Mask,which are used to generate summaries that are related but not faithful to the original texts.These methods are used to extract training data from the text summarization dataset.Secondly,to fully utilize the dataset information,the model is trained in two stages based on the training data constructed from the original texts and the reference summaries,progressively strengthening the faithfulness evaluation ability of the model.Finally,the test set for summary faithfulness evaluation SFETS,is constructed manually to provide a benchmark for testing model performance.Experiments show that FaithEval performs well on both SFETS and Rank19 datasets,and achieves the current state-of-the-art performance on the SFETS dataset.

Key words: Text summarization,Faithfulness evaluation,Data augmentation,Two-stage training,Benchmark test set

中图分类号: 

  • TP391
[1]KRYSCINSKI W,KESKAR N S,MCCANN B,et al.NeuralText Summarization:A Critical Evaluation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.2019:540-551.
[2]WU S X,HUANG D G,LI J Y.Abstractive Text Summarization Based on Semantic Alignment Network[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2021,57(1):1-6.
[3]CHEANG C,CHAN H,WONG D,et al.TempoSum:Evaluating the Temporal Generalization of Abstractive Summarization[J].arXiv:2305.01951v1,2023.
[4]SUN K L,LUO X D,LUO Y R.Survey of Applications of Pretrained Language Models[J].Computer Science,2023,50(1):176-184.
[5]CAO Z,WEI F,LI W,et al.Faithful to the Original:Fact AwareNeural Abstractive Summarization[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.AAAI,2018:4784-4791.
[6]PAGNONI A,BALACHANDRAN V,TSVETKOV Y.Under-standing Factuality in Abstractive Summarization with FRANK:A Benchmark for Factuality Metrics[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.ACL,2021:4812-4829.
[7]KRYSCINSKI W,MCCANN B,XIONG C,et al.Evaluating the Factual Consistency of Abstractive Text Summarization[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.ACL,2020:9332-9346.
[8]CAO M,DONG Y,WU J,et al.Factual Error Correction forAbstractive Summarization Models[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.ACL,2020:6251-6258.
[9]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[10]LEE H,YOO K M,PARK J,et al.Masked Summarization toGenerate Factually Inconsistent Summaries for Improved Factual Consistency Checking[C]//Proceedings of the Findings of the Association for Computational Linguistics.ACL,2022:1019-1030.
[11]FALKE T,RIBEIRO L F R,UTAMA P A,et al.RankingGenerated Summaries by Correctness:An Interesting but Challenging Application for Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.ACL,2020:2214-2220.
[12]HUANG Y,FENG X,FENG X,et al.The Factual Inconsistency Problem in Abstractive Text Summarization:A Survey[J].ar-Xiv:2104.14839,2021.
[13]LUO Z,XIE Q,ANANIADOU S.ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization[J].ar-Xiv:2303.15621v1,2023.
[14]GOODRICH B,RAO V,LIU P J,et al.Assessing The Factual Accuracy of Generated Text[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:166-175.
[15]SCIALOM T,DRAY P A,GALLINARI P,et al.QuestEval:Summarization Asks for Fact-based Evaluation[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2021:6594-6604.
[16]DURMUS E,HE H,DIAB M.FEQA:A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.ACL,2020:5055-5070.
[17]LIN C.ROUGE:A Package for Automatic Evaluation of Summaries[C]//Proceedings of the Meeting of the Association for Computational Linguistics.2004:74-81.
[18]ZHANG T,KISHORE V,WU F,et al.BERTScore:Evaluating Text Generation with BERT[C]//Proceedings of the 8th International Conference on Learning Representations.2020.
[19]KOCMI T,FEDERMANN C.Large Language Models AreState-of-the-Art Evaluators of Translation Quality[C]//Proceedings of the 24th Annual Conference of the European Association for Machine Translation.Tampere,Finland:European Association for Machine Translation,2023:193-203.
[20]WANG J,LIANG Y,MENG F,et al.Is ChatGPT a Good NLG Evaluator? A Preliminary Study[C]//Proceedings of the 4th New Frontiers in Summarization Workshop.ACL,2023:1-11.
[21]LIU Y,ITER D,XU Y,et al.G-EVAL:NLG Evaluation using GPT-4 with Better Human Alignment[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.ACL,2023:2511-2522.
[22]OPENAI.GPT-4 Technical Report[J].arXiv:2303.08774,2023.
[23]WANG P,LI L,CHEN L,et al.Large Language Models are not Fair Evaluators[J].arXiv:2305.17926v2,2023.
[24]GEKHMAN Z,HERZIG J,AHARONI R,et al.TrueTeacher:Learning Factual Consistency Evaluation with Large Language Models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.ACL,2023:2053-2070.
[25]BLEI D M,NG A Y,JORDAN M T.Latent Dirichlet Allocation[C]//Proceedings of the 15th Annual Neural Information Processing Systems Conference.Vancouver,BC,Neural Information Processing Systems Foundation,2002:601-608.
[26]LEWIS M,LIU Y,GOYAL N,et al.BART:Denoising Se-quence-to-Sequence Pre-training for Natural Language Generation,Translation,and Comprehension[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.ACL,2020:7871-7880.
[27]HU B,CHEN Q,ZHU F.LCSTS:A Large Scale Chinese Short Text Summarization Dataset[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.ACL,2015:1967-1972.
[28]HERMANN K M,KOISKY T,GREFENSTETTE E,et al.Teaching Machines to Read and Comprehend[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems.Montreal,QC:Neural Information Processing Systems Foundation,2015:1693-1701.
[29]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[30]GLM T,ZENG A,XU B,et al.ChatGLM:A Family of Large Language Models from GLM-130B to GLM-4 All Tools[J].arXiv:2406.12793,2024.
[31]CHUNG H W,HOU L,LONGPRE S,et al.Scaling Instruction-Finetuned Language Models[J].arXiv:2210.11416,2022.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!