计算机科学 ›› 2025, Vol. 52 ›› Issue (6A): 240700105-8.doi: 10.11896/jsjkx.240700105
王宝会1, 许卜仁2, 李长傲1, 叶子豪1
WANG Baohui1, XU Boren2, LI Chang’ao1, YE Zihao1
摘要: 微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Know-ledge,SPTSPK),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SPTSPK算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F1@5和F1@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。
中图分类号:
[1]LI T C,WANG B,XI Y Y.Conversation extraction in short text message streams based on multiple strategies[J].Application Research of Computers,2016,33(4):997-1002. [2]ZHANG Y,JIN R,ZHOU Z H.Understanding bag-of-wordsmodel:a statistical framework[J].International journal of machine learning and cybernetics,2010,1:43-52. [3]WANG X,CAO J,LIU Y,et al.Text clustering based on the improved TFIDF by the iterative algorithm[C]//2012 IEEE Symposium on Electrical & Electronics Engineering.IEEE,2012:140-143. [4]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21. [5]TIAN Y,WANG W D,RAO J H,et al.Conversation Detection and Organization of Mobile Text Messages[J].Journal of Software,2012,23(10):2586-2599. [6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022. [7]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31. [8]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Association for Computational Linguistics,2019:194-199. [9]SUBAKTI A,MURFI H,HARIADI N.The performance ofBERT as data representation of text clustering[J].Journal of Big Data,2022,9(1):1-21. [10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [11]BARKER K,CORNACCHIA N.Using noun phrase heads toextract document keyphrases[C]//Advances in Artificial Intelligence:13th Biennial Conference of the Canadian Society for Computational Studies of Intelligence.Springer Berlin Heidelberg,2000:40-52. [12]LUHN H P.A statistical approach to mechanized encoding and searching of literary information[J].IBM Journal of research and development,1957,1(4):309-317. [13]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing.Association for Computational Linguistics,2004:404-411. [14]PAGE L,BRIN S,MOTWANI R,et al.The PageRank citation ranking:Bringing order to the web [R].Stanford InfoLab,1999. [15]WANG Y C,JOSHI M J M,COHEN W.Recovering implicit thread structure in newsgroup style conversations[C]//Proceedings of the Proceedings of the International AAAI Conference on Web and Social Media.AAAI,2008:152-160. [16]DANESH S,SUMNER T,MARTIN J H.Sgrank:Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]//Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics.Association for Computational Linguistics,2015:117-126. [17]FLORESCU C,CARAGEA C.Positionrank:An unsupervisedapproach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:1105-1115. [18]WANG J,PENG H.Keyphrases extraction from web document by the least squares support vector machine[C]//The 2005 IEEE/WIC/ACM International Conference on Web Intelligence.IEEE,WIC,ACM,2005:293-296. [19]ZHANG C.Automatic keyword extraction from documentsusing conditional random fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180. [20]CHEN W,WU Y Z,CHEN W L,et al.Automatic Keyword Extraction Based on BiLSTM-CRF [J].ComputerScience,2018,45(S1):91-96,113. [21]MENG R,ZHAO S,HAN S,et al.Deep keyphrase generation[J].arXiv:1704.06879,2017. [22]CHEN J,ZHANG X,WU Y,et al.Keyphrase generation with correlation constraints[J].arXiv:1808.07185,2018. [23]SUTSKEVER I,VINYALS O,LE Q V.Sequenceto sequencelearning with neural networks[J].Advances in neural information processing systems,2014,27. [24]SHEN X,WANG Y,MENG R,et al.Unsupervised deep keyphrase generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2022:11303-11311. [25]WU X,CHEN C X.Automatic Keyword Extraction Based on BiLSTM-CRF [J].Data Analysis and Knowledge Discovery,2021,5(5):1-9. [26]WANG Y,LI J,CHAN H P,et al.Topic-aware neural keyphrase generation for social media language[J].arXiv:1906.03889,2019. [27]CAMPOS R,MANGARAVITE V,PASQUALI A,et al.YAKE! Keyword extraction from single documents using multiple local features[J].Information Sciences,2020,509:257-289. [28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30. [29]LI X,WANG W B,SHANG X D.Application of Transformer optimized by pointer generator network and coverage loss in field of abstractive text summarization[J].Journal of Computer Applications,2021,41(6):1647-1651. [30]SEE A,LIU P J,MANNING C D.Get to the point:Summarization with pointer-generator networks[J].arXiv:1704.04368,2017. [31]YUAN X,WANG T,MENG R,et al.One size does not fit all:Generating and evaluating variable number of keyphrases[J].arXiv:1810.05241,2018. [32]BENNANI-SMIRES K,MUSAT C,HOSSMANN A,et al.Simple unsupervised keyphrase extraction using sentence embeddings[J].arXiv:1801.04470,2018. [33]BOUGOUIN A,BOUDIN F,DAILLE B.Topicrank:Graph-based topic ranking for keyphrase extraction[C]//International Joint Conference on Natural Language Processing.Asian Federation of Natural Language Processing,2013:543-551. [32]BOUDIN F.Unsupervised keyphrase extraction with multipartite graphs[J].arXiv:1803.08721,2018. |
|