Computer Science ›› 2025, Vol. 52 ›› Issue (10): 266-274.doi: 10.11896/jsjkx.250100023

• Artificial Intelligence • Previous Articles     Next Articles

Summary Faithfulness Evaluation Based on Data Augmentation and Two-stage Training

ZHAO Jinshuang, HUANG Degen   

  1. School of Computer Science and Technology,Dalian University of Technology,Dalian,Liaoning 116024,China
  • Received:2025-01-06 Revised:2025-04-29 Online:2025-10-15 Published:2025-10-14
  • About author:ZHAO Jinshuang,born in 2000,postgraduate,is a member of CCF(No.Z2722G).Her main research interests include natural language processing and text summarization.
    HUANG Degen,born in 1965,Ph.D,professor,is a member of CCF(No.17961S).His main research interests include natural language processing,machine translation and text summarization.
  • Supported by:
    Key R&D Program of Yunnan Province(202203AA080004) and National Natural Science Foundation of China(U1936109).

Abstract: The faithfulness of text summaries,which refers to their factual consistency with the original texts,is very important for the practical application of automatic text summarization.Current methods for evaluating the faithfulness of summaries have shortcomings in utilizing text summarization datasets,and the constructed unfaithful summaries differ significantly from the original texts,which limit the effectiveness of these evaluation methods.To solve this problem,this paper proposes a summary faithfulness evaluation model,FaithEval,based on data augmentation and two-stage training.Firstly,two data augmentation methods are defined:Similarity Search with Same Topic and Insert and Fill External Mask,which are used to generate summaries that are related but not faithful to the original texts.These methods are used to extract training data from the text summarization dataset.Secondly,to fully utilize the dataset information,the model is trained in two stages based on the training data constructed from the original texts and the reference summaries,progressively strengthening the faithfulness evaluation ability of the model.Finally,the test set for summary faithfulness evaluation SFETS,is constructed manually to provide a benchmark for testing model performance.Experiments show that FaithEval performs well on both SFETS and Rank19 datasets,and achieves the current state-of-the-art performance on the SFETS dataset.

Key words: Text summarization,Faithfulness evaluation,Data augmentation,Two-stage training,Benchmark test set

CLC Number: 

  • TP391
[1]KRYSCINSKI W,KESKAR N S,MCCANN B,et al.NeuralText Summarization:A Critical Evaluation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing.2019:540-551.
[2]WU S X,HUANG D G,LI J Y.Abstractive Text Summarization Based on Semantic Alignment Network[J].Acta Scientiarum Naturalium Universitatis Pekinensis,2021,57(1):1-6.
[3]CHEANG C,CHAN H,WONG D,et al.TempoSum:Evaluating the Temporal Generalization of Abstractive Summarization[J].arXiv:2305.01951v1,2023.
[4]SUN K L,LUO X D,LUO Y R.Survey of Applications of Pretrained Language Models[J].Computer Science,2023,50(1):176-184.
[5]CAO Z,WEI F,LI W,et al.Faithful to the Original:Fact AwareNeural Abstractive Summarization[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.AAAI,2018:4784-4791.
[6]PAGNONI A,BALACHANDRAN V,TSVETKOV Y.Under-standing Factuality in Abstractive Summarization with FRANK:A Benchmark for Factuality Metrics[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.ACL,2021:4812-4829.
[7]KRYSCINSKI W,MCCANN B,XIONG C,et al.Evaluating the Factual Consistency of Abstractive Text Summarization[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.ACL,2020:9332-9346.
[8]CAO M,DONG Y,WU J,et al.Factual Error Correction forAbstractive Summarization Models[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing.ACL,2020:6251-6258.
[9]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[10]LEE H,YOO K M,PARK J,et al.Masked Summarization toGenerate Factually Inconsistent Summaries for Improved Factual Consistency Checking[C]//Proceedings of the Findings of the Association for Computational Linguistics.ACL,2022:1019-1030.
[11]FALKE T,RIBEIRO L F R,UTAMA P A,et al.RankingGenerated Summaries by Correctness:An Interesting but Challenging Application for Natural Language Inference[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.ACL,2020:2214-2220.
[12]HUANG Y,FENG X,FENG X,et al.The Factual Inconsistency Problem in Abstractive Text Summarization:A Survey[J].ar-Xiv:2104.14839,2021.
[13]LUO Z,XIE Q,ANANIADOU S.ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization[J].ar-Xiv:2303.15621v1,2023.
[14]GOODRICH B,RAO V,LIU P J,et al.Assessing The Factual Accuracy of Generated Text[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:166-175.
[15]SCIALOM T,DRAY P A,GALLINARI P,et al.QuestEval:Summarization Asks for Fact-based Evaluation[C]//Procee-dings of the 2021 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2021:6594-6604.
[16]DURMUS E,HE H,DIAB M.FEQA:A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.ACL,2020:5055-5070.
[17]LIN C.ROUGE:A Package for Automatic Evaluation of Summaries[C]//Proceedings of the Meeting of the Association for Computational Linguistics.2004:74-81.
[18]ZHANG T,KISHORE V,WU F,et al.BERTScore:Evaluating Text Generation with BERT[C]//Proceedings of the 8th International Conference on Learning Representations.2020.
[19]KOCMI T,FEDERMANN C.Large Language Models AreState-of-the-Art Evaluators of Translation Quality[C]//Proceedings of the 24th Annual Conference of the European Association for Machine Translation.Tampere,Finland:European Association for Machine Translation,2023:193-203.
[20]WANG J,LIANG Y,MENG F,et al.Is ChatGPT a Good NLG Evaluator? A Preliminary Study[C]//Proceedings of the 4th New Frontiers in Summarization Workshop.ACL,2023:1-11.
[21]LIU Y,ITER D,XU Y,et al.G-EVAL:NLG Evaluation using GPT-4 with Better Human Alignment[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.ACL,2023:2511-2522.
[22]OPENAI.GPT-4 Technical Report[J].arXiv:2303.08774,2023.
[23]WANG P,LI L,CHEN L,et al.Large Language Models are not Fair Evaluators[J].arXiv:2305.17926v2,2023.
[24]GEKHMAN Z,HERZIG J,AHARONI R,et al.TrueTeacher:Learning Factual Consistency Evaluation with Large Language Models[C]//Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.ACL,2023:2053-2070.
[25]BLEI D M,NG A Y,JORDAN M T.Latent Dirichlet Allocation[C]//Proceedings of the 15th Annual Neural Information Processing Systems Conference.Vancouver,BC,Neural Information Processing Systems Foundation,2002:601-608.
[26]LEWIS M,LIU Y,GOYAL N,et al.BART:Denoising Se-quence-to-Sequence Pre-training for Natural Language Generation,Translation,and Comprehension[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.ACL,2020:7871-7880.
[27]HU B,CHEN Q,ZHU F.LCSTS:A Large Scale Chinese Short Text Summarization Dataset[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.ACL,2015:1967-1972.
[28]HERMANN K M,KOISKY T,GREFENSTETTE E,et al.Teaching Machines to Read and Comprehend[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems.Montreal,QC:Neural Information Processing Systems Foundation,2015:1693-1701.
[29]TOUVRON H,LAVRIL T,IZACARD G,et al.LLaMA:Open and Efficient Foundation Language Models[J].arXiv:2302.13971,2023.
[30]GLM T,ZENG A,XU B,et al.ChatGLM:A Family of Large Language Models from GLM-130B to GLM-4 All Tools[J].arXiv:2406.12793,2024.
[31]CHUNG H W,HOU L,LONGPRE S,et al.Scaling Instruction-Finetuned Language Models[J].arXiv:2210.11416,2022.
[1] WANG Baocai, WU Guowei. Interpretable Credit Risk Assessment Model:Rule Extraction Approach Based on AttentionMechanism [J]. Computer Science, 2025, 52(10): 50-59.
[2] ZHENG Hanyuan, GE Rongjun, HE Shengji, LI Nan. Direct PET to CT Attenuation Correction Algorithm Based on Imaging Slice Continuity [J]. Computer Science, 2025, 52(10): 115-122.
[3] XU Hengyu, CHEN Kun, XU Lin, SUN Mingzhai, LU Zhou. SAM-Retina:Arteriovenous Segmentation in Dual-modal Retinal Image Based on SAM [J]. Computer Science, 2025, 52(10): 123-133.
[4] WEN Jing, ZHANG Songsong, LI Xufeng. Target Tracking Method Based on Cross Scale Fusion of Features and Trajectory Prompts [J]. Computer Science, 2025, 52(10): 144-150.
[5] SHENG Xiaomeng, ZHAO Junli, WANG Guodong, WANG Yang. Immediate Generation Algorithm of High-fidelity Head Avatars Based on NeRF [J]. Computer Science, 2025, 52(10): 159-167.
[6] ZHENG Dichen, HE Jikai, LIU Yi, GAO Fan, ZHANG Dengyin. Low Light Image Adaptive Enhancement Algorithm Based on Retinex Theory [J]. Computer Science, 2025, 52(10): 168-175.
[7] RUAN Ning, LI Chun, MA Haoyue, JIA Yi, LI Tao. Review of Quantum-inspired Metaheuristic Algorithms and Its Applications [J]. Computer Science, 2025, 52(10): 190-200.
[8] XIONG Zhuozhi, GU Zhouhong, FENG Hongwei, XIAO Yanghua. Subject Knowledge Evaluation Method for Language Models Based on Multiple ChoiceQuestions [J]. Computer Science, 2025, 52(10): 201-207.
[9] WANG Jian, WANG Jingling, ZHANG Ge, WANG Zhangquan, GUO Shiyuan, YU Guiming. Multimodal Information Extraction Fusion Method Based on Dempster-Shafer Theory [J]. Computer Science, 2025, 52(10): 208-216.
[10] CHEN Yuyan, JIA Jiyuan, CHANG Jingwen, ZUO Kaiwen, XIAO Yanghua. SPEAKSMART:Evaluating Empathetic Persuasive Responses by Large Language Models [J]. Computer Science, 2025, 52(10): 217-230.
[11] LI Sihui, CAI Guoyong, JIANG Hang, WEN Yimin. Novel Discrete Diffusion Text Generation Model with Convex Loss Function [J]. Computer Science, 2025, 52(10): 231-238.
[12] ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[13] CHEN Jiahao, DUAN Liguo, CHANG Xuanwei, LI Aiping, CUI Juanjuan, HAO Yuanbin. Text Sentiment Classification Method Based on Large-batch Adversarial Strategy and EnhancedFeature Extraction [J]. Computer Science, 2025, 52(10): 247-257.
[14] WANG Ye, WANG Zhongqing. Text Simplification for Aspect-based Sentiment Analysis Based on Large Language Model [J]. Computer Science, 2025, 52(10): 258-265.
[15] SUN Liangxu, LI Linlin, LIU Guoli. Sub-problem Effectiveness Guided Multi-objective Evolution Algorithm [J]. Computer Science, 2025, 52(10): 296-307.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!