Computer Science ›› 2025, Vol. 52 ›› Issue (6A): 240700105-8.doi: 10.11896/jsjkx.240700105

• Artificial Intelligence • Previous Articles     Next Articles

Study on Algorithm for Keyword Extraction from WeChat Conversation Text

WANG Baohui1, XU Boren2, LI Chang’ao1, YE Zihao1   

  1. 1 College of Software,Beihang University,Beijing 100191,China
    2 School of Computing,Beihang University,Beijing 100191,China
  • Online:2025-06-16 Published:2025-06-12
  • About author:WANG Baohui,born in 1973,professor,master supervisor.His main research interests include network security,big data,artificial intelligence,etc.

Abstract: WeChat group chats contain a large volume of conversational text data,and extracting keywords from these conversations helps to understand group dynamics and topic evolution.Traditional keyword extraction methods perform poorly due to the characteristics of WeChat conversations,such as short length,topic interleaving,and informal language use.To address these challenges,this paper proposes a multi-stage keyword extraction algorithm based on conversation topic clustering.First,we introduce a conversation topic clustering algorithm(single pass using thread segmentation and pre-training knowledge,SPTSPK),addressing the issues of topic interleaving and insufficient information by comprehensively considering semantic relevance,message activityand user intimacy.Second,we propose a multi-stage keyword extraction algorithm(MSKE) that decomposes the task into unsupervised keyword extraction and supervised keyword generation to extract both present and absent keywords from the original text,reducing the scale of candidate words and semantic redundancy.Finally,we conbine SPTSPK with MSKE to achieve keyword extraction from WeChat conversation texts.Compared to AutoKeyGen on the WeChat dataset,average F1@5 and F1@O increase by 12.8% and 10.8% respectively,and average R@10 reaches 2.59 times.Experimental results show that the proposed algorithm can effectively extract keywords from WeChat conversation texts.

Key words: Text clustering, Text generation, Conversation topic clustering, Keyword extraction

CLC Number: 

  • TP391
[1]LI T C,WANG B,XI Y Y.Conversation extraction in short text message streams based on multiple strategies[J].Application Research of Computers,2016,33(4):997-1002.
[2]ZHANG Y,JIN R,ZHOU Z H.Understanding bag-of-wordsmodel:a statistical framework[J].International journal of machine learning and cybernetics,2010,1:43-52.
[3]WANG X,CAO J,LIU Y,et al.Text clustering based on the improved TFIDF by the iterative algorithm[C]//2012 IEEE Symposium on Electrical & Electronics Engineering.IEEE,2012:140-143.
[4]SPARCK J K.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
[5]TIAN Y,WANG W D,RAO J H,et al.Conversation Detection and Organization of Mobile Text Messages[J].Journal of Software,2012,23(10):2586-2599.
[6]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3(Jan):993-1022.
[7]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[8]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP.Association for Computational Linguistics,2019:194-199.
[9]SUBAKTI A,MURFI H,HARIADI N.The performance ofBERT as data representation of text clustering[J].Journal of Big Data,2022,9(1):1-21.
[10]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[11]BARKER K,CORNACCHIA N.Using noun phrase heads toextract document keyphrases[C]//Advances in Artificial Intelligence:13th Biennial Conference of the Canadian Society for Computational Studies of Intelligence.Springer Berlin Heidelberg,2000:40-52.
[12]LUHN H P.A statistical approach to mechanized encoding and searching of literary information[J].IBM Journal of research and development,1957,1(4):309-317.
[13]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing.Association for Computational Linguistics,2004:404-411.
[14]PAGE L,BRIN S,MOTWANI R,et al.The PageRank citation ranking:Bringing order to the web [R].Stanford InfoLab,1999.
[15]WANG Y C,JOSHI M J M,COHEN W.Recovering implicit thread structure in newsgroup style conversations[C]//Proceedings of the Proceedings of the International AAAI Conference on Web and Social Media.AAAI,2008:152-160.
[16]DANESH S,SUMNER T,MARTIN J H.Sgrank:Combining statistical and graphical methods to improve the state of the art in unsupervised keyphrase extraction[C]//Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics.Association for Computational Linguistics,2015:117-126.
[17]FLORESCU C,CARAGEA C.Positionrank:An unsupervisedapproach to keyphrase extraction from scholarly documents[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics,2017:1105-1115.
[18]WANG J,PENG H.Keyphrases extraction from web document by the least squares support vector machine[C]//The 2005 IEEE/WIC/ACM International Conference on Web Intelligence.IEEE,WIC,ACM,2005:293-296.
[19]ZHANG C.Automatic keyword extraction from documentsusing conditional random fields[J].Journal of Computational Information Systems,2008,4(3):1169-1180.
[20]CHEN W,WU Y Z,CHEN W L,et al.Automatic Keyword Extraction Based on BiLSTM-CRF [J].ComputerScience,2018,45(S1):91-96,113.
[21]MENG R,ZHAO S,HAN S,et al.Deep keyphrase generation[J].arXiv:1704.06879,2017.
[22]CHEN J,ZHANG X,WU Y,et al.Keyphrase generation with correlation constraints[J].arXiv:1808.07185,2018.
[23]SUTSKEVER I,VINYALS O,LE Q V.Sequenceto sequencelearning with neural networks[J].Advances in neural information processing systems,2014,27.
[24]SHEN X,WANG Y,MENG R,et al.Unsupervised deep keyphrase generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence.AAAI,2022:11303-11311.
[25]WU X,CHEN C X.Automatic Keyword Extraction Based on BiLSTM-CRF [J].Data Analysis and Knowledge Discovery,2021,5(5):1-9.
[26]WANG Y,LI J,CHAN H P,et al.Topic-aware neural keyphrase generation for social media language[J].arXiv:1906.03889,2019.
[27]CAMPOS R,MANGARAVITE V,PASQUALI A,et al.YAKE! Keyword extraction from single documents using multiple local features[J].Information Sciences,2020,509:257-289.
[28]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30.
[29]LI X,WANG W B,SHANG X D.Application of Transformer optimized by pointer generator network and coverage loss in field of abstractive text summarization[J].Journal of Computer Applications,2021,41(6):1647-1651.
[30]SEE A,LIU P J,MANNING C D.Get to the point:Summarization with pointer-generator networks[J].arXiv:1704.04368,2017.
[31]YUAN X,WANG T,MENG R,et al.One size does not fit all:Generating and evaluating variable number of keyphrases[J].arXiv:1810.05241,2018.
[32]BENNANI-SMIRES K,MUSAT C,HOSSMANN A,et al.Simple unsupervised keyphrase extraction using sentence embeddings[J].arXiv:1801.04470,2018.
[33]BOUGOUIN A,BOUDIN F,DAILLE B.Topicrank:Graph-based topic ranking for keyphrase extraction[C]//International Joint Conference on Natural Language Processing.Asian Federation of Natural Language Processing,2013:543-551.
[32]BOUDIN F.Unsupervised keyphrase extraction with multipartite graphs[J].arXiv:1803.08721,2018.
[1] ZHANG Junsan, CHENG Ming, SHEN Xiuxuan, LIU Yuxue, WANG Leiquan. Diversified Label Matrix Based Medical Image Report Generation [J]. Computer Science, 2024, 51(8): 200-208.
[2] LI Jingwen, YE Qi, RUAN Tong, LIN Yupian, XUE Wandong. Semi-supervised Text Style Transfer Method Based on Multi-reward Reinforcement Learning [J]. Computer Science, 2024, 51(8): 263-271.
[3] GUI Haitao, WANG Zhongqing. Personalized Dialogue Response Generation Combined with Conversation State Information [J]. Computer Science, 2024, 51(6A): 230800055-7.
[4] QIN Xianping, DING Zhaoxu, ZHONG Guoqiang, WANG Dong. Deep Learning-based Method for Mining Ocean Hot Spot News [J]. Computer Science, 2024, 51(11A): 231200005-10.
[5] LIANG Mingxuan, WANG Shi, ZHU Junwu, LI Yang, GAO Xiang, JIAO Zhixiang. Survey of Knowledge-enhanced Natural Language Generation Research [J]. Computer Science, 2023, 50(6A): 220200120-8.
[6] CHEN Zhang-hui, XIONG Yun. Stylized Image Captioning Model Based on Disentangle-Retrieve-Generate [J]. Computer Science, 2022, 49(6): 180-186.
[7] WANG Bo-yu, WANG Zhong-qing, ZHOU Guo-dong. Dialogue Act Prediction Based on Response Generation [J]. Computer Science, 2021, 48(2): 212-216.
[8] ZHOU Xiao-shi, ZHANG Zi-wei, WEN Juan. Natural Language Steganography Based on Neural Machine Translation [J]. Computer Science, 2021, 48(11A): 557-564.
[9] ZHANG Hao-yang and ZHOU Liang. Application of Improved GHSOM Algorithm in Civil Aviation Regulation Knowledge Map Construction [J]. Computer Science, 2020, 47(6A): 429-435.
[10] KANG Yan,CUI Guo-rong,LI Hao,YANG Qi-yue,LI Jin-yuan,WANG Pei-yao. Software Requirements Clustering Algorithm Based on Self-attention Mechanism and Multi- channel Pyramid Convolution [J]. Computer Science, 2020, 47(3): 48-53.
[11] CHEN Qing-chao, WANG Tao, YIN Shi-zhuang, FENG Wen-bo. Chain Merging Method for Unknown Text Protocol Candidate Keyword Stored in Multi-levelDictionary [J]. Computer Science, 2020, 47(12): 332-335.
[12] DUAN Jian-yong, YOU Shi-xin, ZHANG Mei, WANG Hao. Keyword Extraction Based on Multi-feature Fusion [J]. Computer Science, 2020, 47(11A): 73-77.
[13] XU Li. Text Keyword Extraction Method Based on Weighted TextRank [J]. Computer Science, 2019, 46(6A): 142-145.
[14] HUANG Jian-yi, LI Jian-jiang, WANG Zheng, FANG Ming-zhe. Single-Pass Short Text Clustering Based on Context Similarity Matrix [J]. Computer Science, 2019, 46(4): 50-56.
[15] LV Jia-gao,LIANG Kui-yang,CAI Wei. Frontier Scientific Keyword Extraction Based on Bibliometric and Crowdsourcing [J]. Computer Science, 2019, 46(3): 275-282.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!