计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 56-59.

• 智能计算 • 上一篇    下一篇

利用整数线性规划自动抽取多样性关键短语

李珊珊, 陈黎, 唐裕婷, 王艺霖, 于中华   

  1. 四川大学计算机学院 成都610065
  • 出版日期:2019-06-14 发布日期:2019-07-02
  • 通讯作者: 于中华(1967-),男,副教授,主要研究方向为自然语言处理,E-mail:yuzhonghua@scu.edu.cn(通信作者)。
  • 作者简介:李珊珊(1989-),女,硕士,主要研究方向为自然语言处理;陈 黎(1977-),女,讲师,主要研究方向为自然语言处理;唐裕婷(1994-),女,硕士生,主要研究方向为自然语言处理;王艺霖(1995-),男,硕士生,主要研究方向为自然语言处理;
  • 基金资助:
    本文受四川省科技支撑项目(2014GZ0063),四川省重点研发项目(2018GZ0182)资助。

Automatic Extraction of Diversity Keyphrase by Utilizing Integer Liner Programming

LI Shan-shan, CHEN Li, TANG Yu-ting, WANG Yi-lin, YU Zhong-hua   

  1. College of Computer Science,Sichuan University,Chengdu 610065,China
  • Online:2019-06-14 Published:2019-07-02

摘要: 关键短语是文本信息的精简概括,能够代表文本的主题和核心观点。而关键短语的自动抽取更是自然语言处理和信息检索的重要任务之一。针对目前无监督方法自动抽取关键短语存在过度生成候选短语语义的问题,提出了一种将整数线性规划和短语语义相似度相结合的自动抽取算法。通过惩罚语义相似度高的候选短语实现目标函数的最大化,以此形成多样性的关键短语。实验利用TextRank和TFIDF算法在两种不同的语料集中分别产生候选短语,并利用提出的优化算法对候选短语的权值得分进行优化。最后将所提算法产生的优化结果与现有多个算法的结果进行了比较。实验结果表明,通过加入相似性度量的惩罚能够有效解决语义过度问题,并获取更多样的关键短语,其优化结果的P,R和F值均高于其他算法。

关键词: 多样性, 关键短语自动抽取, 语义过度生成, 整数线性规划

Abstract: Keyphrases are the concise summary of text information,which can represent the main topics and the core ideas of texts.And the automatic extraction of key phrases is one of the important tasks for natural language processing and information retrieval.Aiming at the existing problem caused by semantic over-generation on candidate phrases with unsupervised method,this paper proposed an algorithm for automaticextraction of keyphrase by using integer linear programming (ILP) and similarity of candidate phrases,in which candidate phrases with high sematic similarity are punished for maximizing the object function to obtain diversified keyphrases.TextRand and TFIDF algorithms are applied in the proposed method to create candidate phrases based on two different corpus sets and the proposedoptimization algorithm is utilized to optimize the weight scores of candidate phrases.Finally,the results of the proposed optimization algorithm is compared with the ones of baseline methods,and the experimental results show that the proposed method can solve the semantic over-generation problem effectively by punishing candidate phrases with high semantic similarity.Moreover,the optimization algorithm can obtain more diverse keyphrases and the optimized results of P,R and F value outperform the ones of baseline methods.

Key words: Automatic keyphrase extraction, Diversity, Integer liner programming, Semantic over-generation

中图分类号: 

  • TP309
[1]BARKER K,CORNACCHIA N.Using Noun Phrase Heads to Extract Document Keyphrases[C]∥Advances in Artificial Intelligence.Springer Berlin Heidelberg,2000:40-52.
[2]EKMAN P.An argument for basic emotions[J].Cogition and emotion,1992,6(3-4):169-200.
[3]CARAGEA C,BULGAROV F A,GODEA A,et al.Citation-Enhanced Keyphrase Extraction from Research Papers:A Supervised Approach[C]∥Conference on Empirical Methods in Natural Language Processing.2014:1435-1446.
[4]MIHALCEA R,TARAU P.TextRank:Bringing Order into Texts[M].Emnlp,2004:404-411.
[5]WAN X,XIAO J.CollabRank:Towards a Collaborative Ap-proach to SingleDocumentKeyphrase Extraction[C]∥Proceedings of the Conference International Conference on Computational Linguistics,COLING 2008.Manchester,Uk.DBLP,2008:969-976.
[6]LIU Z,LIANG C,SUN M.Topical Word Trigger Model for Keyphrase Extraction[C]∥COLING.2012:1715-1730.
[7]NGUYEN T D,KAN M Y.Keyphrase extraction in scientific publications[C]∥International Conference on Asian Digital Libraries:Looking Back 10 Years and Forging New Frontiers.Springer-Verlag,2007:317-326.
[8]TOMOKIYO T,HURST M.A language model approach to keyphrase extraction[C]∥ACL 2003 Workshop on Multiword Expressions:Analysis,Acquisition and Treatment.Association for Computational Linguistics,2003:33-40.
[9]HASAN K S,NG V.Automatic Keyphrase Extraction:A Survey of the State of the Art[C]∥Meeting of the Association for Computational Linguistics.2014:1262-1273.
[10]WITTEN I H,PAYNTER G W,FRANK E,et al.KEA:practical automatic keyphrase extraction[C]∥ACM Conference on Digital Libraries.Berkeley,CA,USA.CiteSeer,1999:254-255.
[11]TURNEY P D.Coherent keyphrase extraction via Web mining [C]∥International Joint Conference on Artificial Intelligence.Morgan Kaufmann Publishers Inc,2003:434-439.
[12]FLORESCU C,CARAGEA C.A Position-Biased PageRank Algorithm for Keyphrase Extraction[C]∥Proceedings of the American Association for Artificial Intelligence.2017.
[13]HASAN K S,NG V.Automatic Keyphrase Extraction:A Survey of the State of the Art[C]∥Meeting of the Association for Computational Linguistics.2014:1262-1273.
[14]BOUDIN F.Reducing Over-generation Errors for Automatic Keyphrase Extraction using Integer Linear Programming[C]∥ACL 2015 Workshop on Novel Computational Approaches to Keyphrase Extraction.China,2015.
[15]HULTH A.Improved automatic keyword extraction given more linguistic knowledge[C]∥Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2003:216-223.
[16]KIM S N,MEDELYANO,KAN M Y,et al.SemEval-2010 task 5:Automatic keyphrase extraction from scientific articles[C]∥International Workshop on Semantic Evaluation.Association for Computational Linguistics,2010:21-26.
[17]LE T T N,NGUYEN M L,SHIMAZU A.Unsupervised Keyphrase Extraction:Introducing New Kinds of Words to Keyphrases[C]∥Australasian Joint Conference on Artificial Intelligence.Springer International Publishing,2016:665-671.
[1] 王宇飞, 陈文.
基于DECORATE集成学习与置信度评估的Tri-training算法
Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment
计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[2] 陈壮, 邹海涛, 郑尚, 于化龙, 高尚.
基于用户覆盖及评分差异的多样性推荐算法
Diversity Recommendation Algorithm Based on User Coverage and Rating Differences
计算机科学, 2022, 49(5): 159-164. https://doi.org/10.11896/jsjkx.210300263
[3] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[4] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[5] 朱丽花, 王玲, 唐麒, 魏急波.
一种针对动态部分可重构SoC软硬件划分的高效MILP模型
Efficient MILP Model for HW/SW Partitioning of Dynamic Partial Reconfigurable SoC
计算机科学, 2020, 47(4): 18-24. https://doi.org/10.11896/jsjkx.190300001
[6] 张艳红, 张春光, 周湘贞, 王怡鸥.
项目多属性模糊联合的多样性视频推荐算法
Diverse Video Recommender Algorithm Based on Multi-property Fuzzy Aggregate of Items
计算机科学, 2019, 46(8): 78-83. https://doi.org/10.11896/j.issn.1002-137X.2019.08.012
[7] 张学扶, 曾攀, 金敏.
相关性和相似度联合的癌症分类预测
Cancer Classification Prediction Model Based on Correlation and Similarity
计算机科学, 2019, 46(7): 300-307. https://doi.org/10.11896/j.issn.1002-137X.2019.07.046
[8] 关晓蔷, 庞继芳, 梁吉业.
基于类别随机化的随机森林算法
Randomization of Classes Based Random Forest Algorithm
计算机科学, 2019, 46(2): 196-201. https://doi.org/10.11896/j.issn.1002-137X.2019.02.030
[9] 常啸林, 樊永文, 朱维军, 刘洋.
基于拟态防御的管理信息系统
Management Information System Based on Mimic Defense
计算机科学, 2019, 46(11A): 438-441.
[10] 石进平,李劲,和凤珍.
基于社交关系和用户偏好的多样性图推荐方法
Diversity Recommendation Approach Based on Social Relationship and User Preference
计算机科学, 2018, 45(6A): 423-427.
[11] 庞博,金乾坤,合尼古力·吾买尔,齐兴斌.
软件定义网络中基于网络切片和ILP模型的路由方案
Routing Scheme Based on Network Slicing and ILP Model in SDN
计算机科学, 2018, 45(4): 143-147. https://doi.org/10.11896/j.issn.1002-137X.2018.04.023
[12] 单天羽, 管煜旸.
基于种群多样性的可变种群缩减差分进化算法
Differential Evolution Algorithm with Adaptive Population Size Reduction Based on Population Diversity
计算机科学, 2018, 45(11A): 160-166.
[13] 曹敏姿, 张琳琳, 毕雪华, 赵楷.
个性化(α,l)-多样性k-匿名隐私保护模型
Personalized (α,l)-diversity k-anonymity Model for Privacy Preservation
计算机科学, 2018, 45(11): 180-186. https://doi.org/10.11896/j.issn.1002-137X.2018.11.028
[14] 何旭,景小宁,冯超,程越.
基于种群多样性的FPSO算法在空中加油区域配置中的应用
Diversity-guided FPSO Algorithm for Solving Air Refueling Region Deplaying Problem
计算机科学, 2017, 44(Z11): 547-551. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.116
[15] 焦重阳,周清雷,张文宁.
混合拓扑结构的粒子群算法及其在测试数据生成中的应用研究
MPSO and Its Application in Test Data Automatic Generation
计算机科学, 2017, 44(12): 249-254. https://doi.org/10.11896/j.issn.1002-137X.2017.12.045
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!