计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 71-76.doi: 10.11896/jsjkx.220900214

• 数据库&大数据&数据科学 • 上一篇    下一篇

融合无监督SimCSE的短文本聚类研究

贺文灏, 吴春江, 周世杰, 何朝鑫   

  1. 电子科技大学信息与软件学院 成都 610054
  • 收稿日期:2022-09-23 修回日期:2023-02-27 出版日期:2023-11-15 发布日期:2023-11-06
  • 通讯作者: 周世杰(sjzhou@uestc.edu.cn)
  • 作者简介:(202022090510@std.uestc.edu.cn)

Study on Short Text Clustering with Unsupervised SimCSE

HE Wenhao, WU Chunjiang, ZHOU Shijie, HE Chaoxin   

  1. School of Information and Software Engineering,University of Electronic Science and Technology of China,Chengdu 610054,China
  • Received:2022-09-23 Revised:2023-02-27 Online:2023-11-15 Published:2023-11-06
  • About author:HE Wenhao,born in 1997,postgra-duate.His main research interests include natural language processing and machine learning.ZHOU Shijie,born in 1970,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include artificial intelligence and network security.

摘要: 传统的浅层文本聚类方法在对短文本聚类时,面临上下文信息有限、用词不规范、实际意义词少等挑战,导致文本的嵌入表示稀疏、关键特征难以提取等问题。针对以上问题,文中提出一种融合简单数据增强方法的深度聚类模型SSKU(SBERT SimCSE K-means Umap)。该模型采用SBERT对短文本进行嵌入表示,利用无监督SimCSE方法联合深度聚类K-Means算法对文本嵌入模型进行微调,改善短文本的嵌入表示使其适于聚类。使用Umap流形降维方法学习嵌入局部的流形结构来改善短文本特征稀疏问题,优化嵌入结果。最后使用K-Means算法对降维后嵌入进行聚类,得到聚类结果。在StackOverFlow,Biomedical等4个公开短文本数据集进行大量实验并与最新的深度聚类算法作对比,结果表明所提模型在准确度与标准互信息两个评价指标上均表现出良好的聚类性能。

关键词: 短文本, 深度聚类, 预训练模型, 降维方法, 自然语言处理

Abstract: Traditional shallow text clustering methods face challenges such as limited context information,irregular use of words,and few words with actual meaning when clustering short texts,resulting in sparse embedding representations of the text and difficulty in extracting key features.To address these issues,a deep clustering model SSKU(SBERT SimCSE Kmeans Umap) incorporating simple data augmentation methods is proposed in the paper.The model uses SBERT to embed short texts and fine-tunes the text embedding model using the unsupervised SimCSE method in conjunction with the deep clustering KMeans algorithm to improve the embedding representation of short texts to make them suitable for clustering.To improve the sparse features of short text and optimize the embedding results,Umap manifold dimension reduction method is used to learn the local manifold structure.Using K-Means algorithm to cluster the dimensionality-reduced embeddings,and the clustering results are obtained.Extensive experiments are carried out on four publicly available short text datasets,such as StackOverFlow and Biomedical, and compared with the latest deep clustering algorithms.The results show that the proposed model exhibits good clustering performance in terms of both accuracy and standard mutual information evaluation metrics.

Key words: Short text, Deep clustering, Pre-training model, Dimension reduction, Natural language processing

中图分类号: 

  • TP391
[1]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[2]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[3]HU X,ZHANG X,LU C,et al.Exploiting wikipedia as external knowledge for document clustering[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery Anddata Mining.2009:389-396.
[4]BANERJEE S,RAMANATHAN K,GUPTA A.Clusteringshort texts using Wikipedia[C]//Proceedings of the 30th An-nual International ACM SIGIR Conference on Research and development in Information Retrieval.2007:787-788.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013.
[6]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992.
[7]MACQUEEN J.Classification and analysis of multivariate observations[C]//5th Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[8]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793.
[9]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487.
[10]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199.
[11]ZHANG D,NAN F,WEI X,et al.Supporting Clustering with Contrastive Learning[C]//NAACL-HLT.2021.
[12]WANG D,LI T,DENG P,et al.A Generalized Deep Learning Algorithm based on NMF for Multi-view Clustering[J].IEEE Transactions on Big Data,2022.
[13]PUGACHEV L,BURTSEV M.Short text clustering withtransformers[J].arXiv:2102.00541,2021.
[14]MCCONVILLE R,SANTOS-RODRIGUEZ R,PIECHOCKI R J,et al.N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:5145-5152.
[15]MCINNES L,HEALY J,MELVILLE J.Umap:Uniform manifold approximation and projection for dimension reduction[J].arXiv:1802.03426,2018.
[16]GUO X F.A Study on Image Clustering Algorithms with Deep Neural Networks[D].Changsha:National University of Defense Technology,2020.
[17]TENENBAUM J B,SILVA V,LANGFORD J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319-2323.
[18]PHAN X H,NGUYEN L M,HORIGUCHI S.Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceedings of the 17th International Conference on World Wide Web.2008:91-100.
[19]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[20]RAKIB M R H,ZEH N,JANKOWSKA M,et al.Enhancementof short text clustering by iterative classification[C]//International Conference on Applications of Natural Language to Information Systems.Cham:Springer,2020:105-117.
[21]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//International Conference on Learning Representations.2017.
[22]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced sample buil-ding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!