基于预训练模型的无监督剧本摘要

doi:10.11896/jsjkx.211100039

计算机科学 ›› 2023, Vol. 50 ›› Issue (2): 310-316.doi: 10.11896/jsjkx.211100039

基于预训练模型的无监督剧本摘要

苏琦, 王红玲, 王中卿

苏州大学计算机科学与技术学院江苏苏州 215006

收稿日期:2021-11-03 修回日期:2022-06-28 出版日期:2023-02-15 发布日期:2023-02-22
通讯作者: 王红玲(hlwang@suda.edu.cn)
作者简介:(20205227102@stu.suda.edu.cn)
基金资助:
国家自然科学基金(61976146)

Unsupervised Script Summarization Based on Pre-trained Model

SU Qi, WANG Hongling, WANG Zhongqing

School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006,China

Received:2021-11-03 Revised:2022-06-28 Online:2023-02-15 Published:2023-02-22
Supported by:
National Natural Science Foundation of China(61976146)

摘要/Abstract

摘要： 剧本是一种特殊的文本结构,以人物的对话和对场景的描述信息组成文本。无监督剧本摘要是指对篇幅很长的剧本进行压缩、提取,形成能够概括剧本信息的短文本。提出了一种基于预训练模型的无监督剧本摘要方法,首先在预训练过程中通过增加对文本序列处理的预训练任务,使得预训练生成的模型能够充分考虑剧本中对话的场景描述及人物说话的情感特点,然后使用该预训练模型作为训练器计算剧本中的句间相似度,结合TextRank算法对关键句进行打分、排序,最终抽取得分最高的句子作为摘要。实验结果表明,该方法相比基准模型方法取得了更好的效果,系统性能在ROUGE评价上有显著的提高。

关键词: 训练模型, 预训练任务, 剧本摘要, 无监督, 句间相似度, 对话

Abstract: The script is a special text structure,which is composed of the dialogue between characters and the description of the scene.Unsupervised script summary refers to compressing and extracting a long script to form a short text that can summarize the information of the script.Therefore,this paper proposes an unsupervised script summary method based on a pre-training mo-del.By adding pre-training tasks for text sequence processing in pre-training,the generated pre-training model fully takes into account the description of the dialogue in the script and the emotional characteristics of the characters,then the model is used as a trainer to calculate the similarity between sentences and combined with the TextRank algorithm to score and sort the key sentences.Finally,the sentence with the highest score is selected as the summary.Experimental results show that the proposed method has better performance than the base model,and the performance is significantly improved in the ROUGE evaluation.

Key words: Pre-trained model, Pre-training task, Script summary, Unsupervised, Sentence similarity, Dialogue

中图分类号:

TP391

苏琦, 王红玲, 王中卿. 基于预训练模型的无监督剧本摘要[J]. 计算机科学, 2023, 50(2): 310-316. https://doi.org/10.11896/jsjkx.211100039

SU Qi, WANG Hongling, WANG Zhongqing. Unsupervised Script Summarization Based on Pre-trained Model[J]. Computer Science, 2023, 50(2): 310-316. https://doi.org/10.11896/jsjkx.211100039

参考文献

[1]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[2]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[3]LIU Y.Fine-tune BERT for extractivesummarization[J].arXiv:1903.10318,2019.
[4]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017:5998-6008.
[5]KANO R,MIURA Y,TANIGUCHIT,et al.Identifying Implicit Quotes for Unsupervised Extractive Summarization of Conversations[C]//Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing.2020:291-302.
[6]PAPALAMPIDI P,KELLER F,FRERMANNL,et al.Screenplay Summarization Using Latent Narrative Structure[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.2020:1920-1933.
[7]ZHOU Q,WEI F,ZHOU M.At Which Level Should We Extract?An Empirical Analysis on Extractive Document Summarization[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:5617-5628.
[8]FENG X,FENG X,QIN L,et al.Language model as an annotator:Exploring dialogpt for dialogue summarization[J].arXiv:2105.12544,2021.
[9]ZOU Y,ZHU B,HU X,et al.Low-Resource Dialogue Summarization with Domain-Agnostic Multi-Source Pretraining[J].ar-Xiv:2109.04080,2021.
[10]CHEN J,YANG D.Simple Conversational Data Augmentationfor Semi-supervised Abstractive Dialogue Summarization[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:6605-6616.
[11]ZHAO L,ZENG W,XU W,et al.Give the Truth:Incorporate Semantic Slot into Abstractive Dialogue Summarization[C]//Findings of the Association for Computational Linguistics:EMNLP.2021:2435-2446.
[12]ZOU Y,ZHAO L,KANG Y,et al.Topic-oriented spoken dialogue summarization for customer service with saliency-aware topic modeling[J].arXiv:2012.07311,2020.
[13]DAI A M,LE Q V.Semi-supervised sequence learning[J].Advances in Neural Information Processing Systems,2015,28:3079-3087.
[14]ZHANG H,CAI J,XU J,et al.Pretraining-Based Natural Language Generation for Text Summarization[C]//Proceedings of the 23rd Conference on Computational Natural Language Lear-ning(CoNLL).2019:789-797.
[15]LIU Y,LAPATA M.Text Summarization with Pretrained En-coders[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3730-3740.
[16]LI R N.Research on Semantic-based Text Similarity Calculation Method[D].Beijing:Beijing University of Technology,2018.
[17]HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[18]PAGE L,BRIN S,MOTWANI R,et al.The PageRank citation ranking:Bringing order to the web[R].Stanford InfoLab,1999.
[19]LIN C Y.Rouge:A package for automatic evaluation of summaries[C]//Text Summarization Branches Out.2004:74-81.
[20]DONG L,YANG N,WANG W,et al.Unified language model pre-training for natural language understanding and generation[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems.2019:13063-13075.
[21]SUTSKEVER I,VINYALS O,LE Q V.Sequence to sequence learning with neural networks[C]//Advances in Neural Information Processing Systems.2014:3104-3112.

相关文章 15

[1]	刘哲, 殷成凤, 李天瑞. 基于BERT和多特征融合嵌入的中文拼写检查 Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding 计算机科学, 2023, 50(3): 282-290. https://doi.org/10.11896/jsjkx.220100104
[2]	王斌, 梁宇栋, 刘哲, 张超, 李德玉. 亮度自调节的无监督图像去雾与低光图像增强算法研究 Study on Unsupervised Image Dehazing and Low-light Image Enhancement Algorithms Based on Luminance Adjustment 计算机科学, 2023, 50(1): 123-130. https://doi.org/10.11896/jsjkx.211100058
[3]	宋杰, 梁美玉, 薛哲, 杜军平, 寇菲菲. 基于无监督集群级的科技论文异质图节点表示学习方法 Scientific Paper Heterogeneous Graph Node Representation Learning Method Based onUnsupervised Clustering Level 计算机科学, 2022, 49(9): 64-69. https://doi.org/10.11896/jsjkx.220500196
[4]	李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[5]	侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木. 中文预训练模型研究进展 Advances in Chinese Pre-training Models 计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[6]	赵丹丹, 黄德根, 孟佳娜, 董宇, 张攀. 基于BERT-GRU-ATT模型的中文实体关系分类 Chinese Entity Relations Classification Based on BERT-GRU-ATT 计算机科学, 2022, 49(6): 319-325. https://doi.org/10.11896/jsjkx.210600123
[7]	陆亮, 孔芳. 面向对话的融入知识的实体关系抽取 Dialogue-based Entity Relation Extraction with Knowledge 计算机科学, 2022, 49(5): 200-205. https://doi.org/10.11896/jsjkx.210300198
[8]	刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[9]	徐晖, 王中卿, 李寿山, 张民. 结合情感信息的个性化对话生成 Personalized Dialogue Generation Integrating Sentimental Information 计算机科学, 2022, 49(11A): 211100019-6. https://doi.org/10.11896/jsjkx.211100019
[10]	朱若尘, 杨长春, 张登辉. EGOS-DST:对话现象感知和模式引导的一步对话状态追踪算法 EGOS-DST:Efficient Schema-guided Approach to One-step Dialogue State Tracking for Diverse Expressions 计算机科学, 2022, 49(11A): 210900246-7. https://doi.org/10.11896/jsjkx.210900246
[11]	张宇欣, 陈益强. 基于多尺度特征融合的驾驶员注意力分散检测方法 Driver Distraction Detection Based on Multi-scale Feature Fusion Network 计算机科学, 2022, 49(11): 170-178. https://doi.org/10.11896/jsjkx.211000040
[12]	姚奕, 杨帆. 联合知识图谱和预训练模型的中文关键词抽取方法 Chinese Keyword Extraction Method Combining Knowledge Graph and Pre-training Model 计算机科学, 2022, 49(10): 243-251. https://doi.org/10.11896/jsjkx.210800176
[13]	王凯, 李舟军, 盛文博, 陈舒玮, 王明轩, 刘剑青, 蓝海波, 张锐. 多轮对话技术及其在电网数据查询中的应用 Multi-turn Dialogue Technology and Its Application in Power Grid Data Query 计算机科学, 2022, 49(10): 265-271. https://doi.org/10.11896/jsjkx.200600078
[14]	侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[15]	宁秋怡, 史小静, 段湘煜, 张民. 基于风格感知的无监督领域适应算法 Unsupervised Domain Adaptation Based on Style Aware 计算机科学, 2022, 49(1): 271-278. https://doi.org/10.11896/jsjkx.201200094

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于预训练模型的无监督剧本摘要

Unsupervised Script Summarization Based on Pre-trained Model

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0