计算机科学 ›› 2020, Vol. 47 ›› Issue (11A): 97-105.doi: 10.11896/jsjkx.200300087

• 人工智能 • 上一篇    下一篇

基于稀疏表示的多文档自动摘要

钱玲龙, 武娇, 王人锋, 陆慧娟   

  1. 中国计量大学 杭州 310018
  • 出版日期:2020-11-15 发布日期:2020-11-17
  • 通讯作者: 陆慧娟(hjlu@cjlu.edu.cn)
  • 作者简介:linglong.qian@kcl.uk
  • 基金资助:
    国家自然科学基金(61272315,61602431);浙江省自然科学基金(LQ20F030015);国家级大学生创新创业训练计划-基于自然语言处理的智能阅读模型(201810356020)

Multi-document Automatic Summarization Based on Sparse Representation

QIAN Ling-long, WU Jiao, WANG Ren-feng, LU Hui-juan   

  1. China Jiliang University,Hangzhou 310018,China
  • Online:2020-11-15 Published:2020-11-17
  • About author:QIAN Ling-long,born in 1996,postgraduate,is a member of China Computer Federation.His main research interest include national language processing,knowledge graph and explainable AI.
    LU Hui-juan,born in 1962,professor,is director and an outstanding member of China Computer Federation.Her main research interest include machine learning,deep learning and big data.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61272315,61602431),Natural Science Foundation of Zhejiang province,China (LQ20F030015) and National College Student Innovation and Entrepreneurship Training Program (201810356020).

摘要: 文档自动摘要是自然语言处理领域中的重要任务,受限于难以准确理解文档语义,大多通过词频、关键词等人工特征对文档句子进行重要程度排序,以此提取摘要。受稀疏表示理论启发,提出了一种基于稀疏表示的动态语义空间划分算法。算法对初始划分的语义子空间进行字典学习,利用所得字典对所有句向量进行稀疏重构,从而将各句向量动态调整至重构误差最小的划分,迭代地实现语义空间的重划分。对于划分后语义子空间内摘要句的提取,提出了一种基于稀疏相似度排序的自动摘要提取算法。将各语义子空间的所有句向量作为字典原子,通过稀疏重构,得到能体现句子对其他句子语义表征程度的稀疏相似度,以各句累积稀疏相似度作为衡量句子表征空间语义信息能力的指标,依据其排序来提取摘要句。在猫途鹰网站热门景点旅游评论数据集上进行了实验,结果表明语义空间重构误差快速迭代5次即可稳定收敛且平均有效降低重构误差约17%,且算法对数据维度不敏感,所提摘要避免了重复提取冗余度大、重复性高的文本,是一种有效的自动摘要方法。

关键词: 稀疏重构, 字典学习, 自动摘要

Abstract: Automatic document summary is an important task in the field of natural language processing.Limited by the difficulty of accurately understanding the semantics of documents,most of the documents are sorted by artificial features,such as word frequency and keywords,to extract the abstract.Inspired by the theory of sparse representation,a dynamic semantic space partition algorithm based on sparse representation is proposed.The algorithm performs dictionary learning on the initially divided semantic subspace,uses the obtained dictionary to sparsely reconstruct the sentence vector.Dynamically adjusts it to the division which has the smallest reconstruction error.Iteratively realizes the re-division of the semantic space.For abstracting sentences in the divided semantic subspace,an automatic extraction algorithm based on sparse similarity ranking is proposed.All sentence vectors in each semantic subspace are viewed as dictionary atoms.Through sparse reconstruction,the sparse similarity can be obtained which reflects the degree of semantic representation of one sentences to others.The cumulative sparse similarity of each sentence to other sentences is used as a metric to measure the ability of the sentence to represent the spatial semantic information.Ranking the cumulative sparse similarity,and then extract the required top N sentences.The experimental results on the travel review data set of popular attractions on the TripAdvisor website show that the semantic space reconstruction error can be rapidly reduced after5 iterations,remain stable which shows the convergence.Except for effectively reduce the reconstruction error by nearly 17%,the algorithm is also not sensitive to data dimensions.The proposed summary avoids repeated abstraction of redundant and highly repetitive text,which is an effective multi-document automatic summarization method.

Key words: Automatic summarization, Dictionary learning, Sparse reconstruction

中图分类号: 

  • TP391.1
[1] ZHANG C.Text summary algorithm based on semantic reconstruction[D].Nanjing:Nanjing University,2016.
[2] ALLAHYARI M,POURIYEH S,ASSEFI M,et al.Text Summarization Techniques:A Brief Survey[J].International Journal of Advanced Computer Science & Applications,2017,8(10):397-405.
[3] FERILLI S,PAZIENZA A.An Abstract Argumentation-Based Approach to Automatic Extractive Text Summarization[C]//Italian Research Conference on Digital Libraries.Springer,Cham,2018.
[4] LIU H,YU H,DENG Z H.Multi-document summarizationbased on two-level sparse representation model[C]//AAAI.2015:196-202.
[5] HE R,TANG J,GONG P,et al.Multi-document summarization via group sparse learning[J].Information Sciences,2016,349:12-24.
[6] HE Z,CHEN C,BU J,et al.Document summarization based on data reconstruction[C]//AAAI.2012:620-626.
[7] HE Z,CHEN C,BU J,et al.Unsupervised document summarization from data reconstruction perspective[J].Neurocompu-ting,2015,157:356-366.
[8] JIAO L C.Sparse learning,classification and recognition[M].SCIENCE PRESS,2017.
[9] XIONG X.Research on Extractive Answer Fusion for Q & A Community[D].Harbin:Harbin Institute of Technology,2018.
[10] PETERS M E,NEUMANN M,IYYER M,et al.Deep contextualized word representations[J].arXiv:1802.05365,2018.
[11] DEVLIN J,CHANG M W,LEEK,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[12] CANDES E J,ROMBERG J K.Practical signal recovery from random projection [J].Proc Spie,2005,5674:76-86.
[13] PENG S.Sparse representation coding model and its application in text classification [D].Tianjin:Tianjin University,2015.
[14] DAVENPORT M A,DUARTE M F,ELDAR Y C,et al.Introduction to Compressed Sensing[M]//Compressed Sensing:Theory and Applications.Cambridge:Cambridge University Press,2012.
[15] AHARON M,ELAD M,BRUCKSTEIN A.K-SVD:an algo-rithm for designing overcomplete dictionaries for sparse representation[J].IEEE Transactions on Image Processing,2006,54(11):4311-4322.
[16] PATI Y C,REZAIITAR R,KRISHNAPRASAD P S.Orthogonal matching pursuit:recursive function approximation with applications to wavelet decomposition[C]//Proceeding of the 27thAsilomar Conference on Signals,Systems and Computers.1993:40-44.
[17] DAVIS G,MALLAT S,AVELLANEDAM.Adaptive greedyapproximation[J].Constructive Approximation,1997,13 (1):57-98.
[18] CHENG H,LIU Z,HOU L,et al.Sparsity-Induced Similarity Measure and Its Applications[J].IEEE Transactions on Circuits &Systems for Video Technology,2016,26(4):613-626.
[1] 周蔚, 王兆毓, 魏斌.
面向法律裁判文书的生成式自动摘要模型
Abstractive Automatic Summarizing Model for Legal Judgment Documents
计算机科学, 2021, 48(12): 331-336. https://doi.org/10.11896/jsjkx.210500028
[2] 张帆, 贺文琪, 姬红兵, 李丹萍, 王磊.
基于块对角化表示的多视角字典对学习
Multi-view Dictionary-pair Learning Based on Block-diagonal Representation
计算机科学, 2021, 48(1): 233-240. https://doi.org/10.11896/jsjkx.200800211
[3] 田旭, 常侃, 黄升, 覃团发.
基于残差字典及协作表达的单图像超分辨率算法
Single Image Super-resolution Algorithm Using Residual Dictionary and Collaborative Representation
计算机科学, 2020, 47(9): 135-141. https://doi.org/10.11896/jsjkx.190600146
[4] 李金霞, 赵志刚, 李强, 吕慧显, 李明生.
改进的局部和相似性保持特征选择算法
Improved Locality and Similarity Preserving Feature Selection Algorithm
计算机科学, 2020, 47(6A): 480-484. https://doi.org/10.11896/JsJkx.20190800095
[5] 王军浩, 闫德勤, 刘德山, 邢钰佳.
融合极端学习机的判别性分析字典学习算法
Algorithm with Discriminative Analysis Dictionary Learning by Fusing Extreme Learning Machine
计算机科学, 2020, 47(5): 137-143. https://doi.org/10.11896/jsjkx.190600090
[6] 王丽芳, 史超宇, 蔺素珍, 秦品乐, 高媛.
基于联合图像块聚类自适应字典学习的多模态医学图像融合
Multi-modal Medical Image Fusion Based on Joint Patch Clustering of Adaptive Dictionary Learning
计算机科学, 2019, 46(7): 238-245. https://doi.org/10.11896/j.issn.1002-137X.2019.07.036
[7] 杜秀丽, 左思铭, 邱少明.
基于图像灰度熵的自适应字典学习算法
Adaptive Dictionary Learning Algorithm Based on Image Gray Entropy
计算机科学, 2019, 46(5): 266-271. https://doi.org/10.11896/j.issn.1002-137X.2019.05.041
[8] 李秀琴, 王天荆, 白光伟, 沈航.
基于压缩感知的两阶段多目标定位算法
Two-phase Multi-target Localization Algorithm Based on Compressed Sensing
计算机科学, 2019, 46(5): 50-56. https://doi.org/10.11896/j.issn.1002-137X.2019.05.007
[9] 吴晨, 袁昱纬, 王宏伟, 刘宇, 刘思彤, 全吉成.
基于词向量融合的遥感场景零样本分类算法
Word Vectors Fusion Based Remote Sensing Scenes Zero-shot Classification Algorithm
计算机科学, 2019, 46(12): 286-291. https://doi.org/10.11896/jsjkx.181202257
[10] 张真真,王建林.
结合第二代Bandelet变换分块的字典学习图像去噪算法
Dictionary Learning Image Denoising Algorithm Combining Second Generation Bandelet Transform Block
计算机科学, 2018, 45(7): 264-270. https://doi.org/10.11896/j.issn.1002-137X.2018.07.046
[11] 吕巨建, 赵慧民, 陈荣军, 李键红.
基于自适应稀疏邻域重构的无监督主动学习算法
Unsupervised Active Learning Based on Adaptive Sparse Neighbors Reconstruction
计算机科学, 2018, 45(6): 251-258. https://doi.org/10.11896/j.issn.1002-137X.2018.06.045
[12] 李键红,吴亚榕,吕巨建.
基于组稀疏表示的在线单帧图像超分辨率算法
Online Single Image Super-resolution Algorithm Based on Group Sparse Representation
计算机科学, 2018, 45(4): 312-318. https://doi.org/10.11896/j.issn.1002-137X.2018.04.053
[13] 王铁建,吴飞,荆晓远.
基于多核字典学习的软件缺陷预测
Multiple Kernel Dictionary Learning for Software Defect Prediction
计算机科学, 2017, 44(12): 131-134. https://doi.org/10.11896/j.issn.1002-137X.2017.12.026
[14] 徐煜明,宋佳伟,肖贤建.
基于亚像素块匹配和字典学习的超分辨率算法
Super Resolution Algorithm Based on Sub-pixel Block Matching and Dictionary Learning
计算机科学, 2016, 43(8): 304-308. https://doi.org/10.11896/j.issn.1002-137X.2016.08.062
[15] 施静兰,常侃,张智勇,覃团发.
人脸识别中基于系数相似性的字典学习算法
Coefficient-similarity-based Dictionary Learning Algorithm for Face Recognition
计算机科学, 2016, 43(6): 298-302. https://doi.org/10.11896/j.issn.1002-137X.2016.06.059
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!