微信会话文本关键词提取的算法研究

doi:10.11896/jsjkx.240700105

摘要/Abstract

摘要： 微信群组中存在大量会话文本数据,对其进行关键词提取有助于理解群组动态和主题演变。由于微信会话文本存在长度短、主题交叉、语言不规范等特点,传统提取方法效果欠佳。为此,提出了一个基于会话主题聚类的多阶段关键词提取算法。首先,提出了一种结合预训练知识的会话主题聚类算法(Single Pass Using Thread Segmentation and Pre-training Know-ledge,SP_TSPK),综合考虑语义相关性、消息活跃度和用户亲密度,有效解决了会话主题交叉和信息量不足的问题。其次,设计了一种多阶段关键词提取算法(Multi-Stage Keyword Extraction,MSKE),将任务分解为无监督关键词抽取和有监督关键词生成,有效提取原文中存在和缺失的关键词,减少了候选词规模和语义冗余;最终,组合SP_TSPK算法与MSKE算法实现微信会话文本关键词提取。在WeChat数据集上相比AutoKeyGen算法,F₁@5和F₁@O平均提升了12.8%与10.8%,R@10平均达到其2.59倍。实验结果表明,该算法能有效地提取微信会话文本关键词。

关键词: 文本聚类, 文本生成, 会话主题聚类, 关键词提取

Abstract: WeChat group chats contain a large volume of conversational text data,and extracting keywords from these conversations helps to understand group dynamics and topic evolution.Traditional keyword extraction methods perform poorly due to the characteristics of WeChat conversations,such as short length,topic interleaving,and informal language use.To address these challenges,this paper proposes a multi-stage keyword extraction algorithm based on conversation topic clustering.First,we introduce a conversation topic clustering algorithm(single pass using thread segmentation and pre-training knowledge,SP_TSPK),addressing the issues of topic interleaving and insufficient information by comprehensively considering semantic relevance,message activityand user intimacy.Second,we propose a multi-stage keyword extraction algorithm(MSKE) that decomposes the task into unsupervised keyword extraction and supervised keyword generation to extract both present and absent keywords from the original text,reducing the scale of candidate words and semantic redundancy.Finally,we conbine SP_TSPK with MSKE to achieve keyword extraction from WeChat conversation texts.Compared to AutoKeyGen on the WeChat dataset,average F₁@5 and F₁@O increase by 12.8% and 10.8% respectively,and average R@10 reaches 2.59 times.Experimental results show that the proposed algorithm can effectively extract keywords from WeChat conversation texts.

Key words: Text clustering, Text generation, Conversation topic clustering, Keyword extraction

中图分类号:

TP391

王宝会, 许卜仁, 李长傲, 叶子豪. 微信会话文本关键词提取的算法研究[J]. 计算机科学, 2025, 52(6A): 240700105-8. https://doi.org/10.11896/jsjkx.240700105

WANG Baohui, XU Boren, LI Chang’ao, YE Zihao. Study on Algorithm for Keyword Extraction from WeChat Conversation Text[J]. Computer Science, 2025, 52(6A): 240700105-8. https://doi.org/10.11896/jsjkx.240700105

参考文献

[1]LI T C,WANG B,XI Y Y.Conversation extraction in short text message streams based on multiple strategies[J].Application Research of Computers,2016,33(4):997-1002.
[2]ZHANG Y,JIN R,ZHOU Z H.Understanding bag-of-wordsmodel:a statistical framework[J].International journal of machine learning and cybernetics,2010,1:43-52.
[3]WANG X,CAO J,LIU Y,et al.Text clustering based on the improved TFIDF by the iterative algorithm[C]//2012 IEEE Symposium on Electrical & Electronics Engineering.IEEE,2012:140-143.
[4]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
[5]TIAN Y,WANG W D,RAO J H,et al.Conversation Detection and Organization of Mobile Text Messages[J].Journal of Software,2012,23(10):2586-2599.
[6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[7]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[8]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Association for Computational Linguistics,2019:194-199.
[9]SUBAKTI A,MURFI H,HARIADI N.The performance ofBERT as data representation of text clustering[J].Journal of Big Data,2022,9(1):1-21.
[10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[11]BARKER K,CORNACCHIA N.Using noun phrase heads toextract document keyphrases[C]//Advances in Artificial Intelligence:13th Biennial Conference of the Canadian Society for Computational Studies of Intelligence.Springer Berlin Heidelberg,2000:40-52.
[12]LUHN H P.A statistical approach to mechanized encoding and searching of literary information[J].IBM Journal of research and development,1957,1(4):309-317.
[13]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing.Association for Computational Linguistics,2004:404-411.
[14]PAGE L,BRIN S,MOTWANI R,et al.The PageRank citation ranking:Bringing order to the web [R].Stanford InfoLab,1999.
[15]WANG Y C,JOSHI M J M,COHEN W.Recovering implicit thread structure in newsgroup style conversations[C]//Proceedings of the Proceedings of the International AAAI Conference on Web and Social Media.AAAI,2008:152-160.
[16]DANESH S,SUMNER T,MARTIN J H.Sgrank:Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]//Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics.Association for Computational Linguistics,2015:117-126.
[17]FLORESCU C,CARAGEA C.Positionrank:An unsupervisedapproach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:1105-1115.
[18]WANG J,PENG H.Keyphrases extraction from web document by the least squares support vector machine[C]//The 2005 IEEE/WIC/ACM International Conference on Web Intelligence.IEEE,WIC,ACM,2005:293-296.
[19]ZHANG C.Automatic keyword extraction from documentsusing conditional random fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180.
[20]CHEN W,WU Y Z,CHEN W L,et al.Automatic Keyword Extraction Based on BiLSTM-CRF [J].ComputerScience,2018,45(S1):91-96,113.
[21]MENG R,ZHAO S,HAN S,et al.Deep keyphrase generation[J].arXiv:1704.06879,2017.
[22]CHEN J,ZHANG X,WU Y,et al.Keyphrase generation with correlation constraints[J].arXiv:1808.07185,2018.
[23]SUTSKEVER I,VINYALS O,LE Q V.Sequenceto sequencelearning with neural networks[J].Advances in neural information processing systems,2014,27.
[24]SHEN X,WANG Y,MENG R,et al.Unsupervised deep keyphrase generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2022:11303-11311.
[25]WU X,CHEN C X.Automatic Keyword Extraction Based on BiLSTM-CRF [J].Data Analysis and Knowledge Discovery,2021,5(5):1-9.
[26]WANG Y,LI J,CHAN H P,et al.Topic-aware neural keyphrase generation for social media language[J].arXiv:1906.03889,2019.
[27]CAMPOS R,MANGARAVITE V,PASQUALI A,et al.YAKE! Keyword extraction from single documents using multiple local features[J].Information Sciences,2020,509:257-289.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30.
[29]LI X,WANG W B,SHANG X D.Application of Transformer optimized by pointer generator network and coverage loss in field of abstractive text summarization[J].Journal of Computer Applications,2021,41(6):1647-1651.
[30]SEE A,LIU P J,MANNING C D.Get to the point:Summarization with pointer-generator networks[J].arXiv:1704.04368,2017.
[31]YUAN X,WANG T,MENG R,et al.One size does not fit all:Generating and evaluating variable number of keyphrases[J].arXiv:1810.05241,2018.
[32]BENNANI-SMIRES K,MUSAT C,HOSSMANN A,et al.Simple unsupervised keyphrase extraction using sentence embeddings[J].arXiv:1801.04470,2018.
[33]BOUGOUIN A,BOUDIN F,DAILLE B.Topicrank:Graph-based topic ranking for keyphrase extraction[C]//International Joint Conference on Natural Language Processing.Asian Federation of Natural Language Processing,2013:543-551.
[32]BOUDIN F.Unsupervised keyphrase extraction with multipartite graphs[J].arXiv:1803.08721,2018.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed