计算机科学 ›› 2023, Vol. 50 ›› Issue (11): 71-76.doi: 10.11896/jsjkx.220900214
贺文灏, 吴春江, 周世杰, 何朝鑫
HE Wenhao, WU Chunjiang, ZHOU Shijie, HE Chaoxin
摘要: 传统的浅层文本聚类方法在对短文本聚类时,面临上下文信息有限、用词不规范、实际意义词少等挑战,导致文本的嵌入表示稀疏、关键特征难以提取等问题。针对以上问题,文中提出一种融合简单数据增强方法的深度聚类模型SSKU(SBERT SimCSE K-means Umap)。该模型采用SBERT对短文本进行嵌入表示,利用无监督SimCSE方法联合深度聚类K-Means算法对文本嵌入模型进行微调,改善短文本的嵌入表示使其适于聚类。使用Umap流形降维方法学习嵌入局部的流形结构来改善短文本特征稀疏问题,优化嵌入结果。最后使用K-Means算法对降维后嵌入进行聚类,得到聚类结果。在StackOverFlow,Biomedical等4个公开短文本数据集进行大量实验并与最新的深度聚类算法作对比,结果表明所提模型在准确度与标准互信息两个评价指标上均表现出良好的聚类性能。
中图分类号:
[1]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [2]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021. [3]HU X,ZHANG X,LU C,et al.Exploiting wikipedia as external knowledge for document clustering[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery Anddata Mining.2009:389-396. [4]BANERJEE S,RAMANATHAN K,GUPTA A.Clusteringshort texts using Wikipedia[C]//Proceedings of the 30th An-nual International ACM SIGIR Conference on Research and development in Information Retrieval.2007:787-788. [5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013. [6]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992. [7]MACQUEEN J.Classification and analysis of multivariate observations[C]//5th Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297. [8]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793. [9]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487. [10]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199. [11]ZHANG D,NAN F,WEI X,et al.Supporting Clustering with Contrastive Learning[C]//NAACL-HLT.2021. [12]WANG D,LI T,DENG P,et al.A Generalized Deep Learning Algorithm based on NMF for Multi-view Clustering[J].IEEE Transactions on Big Data,2022. [13]PUGACHEV L,BURTSEV M.Short text clustering withtransformers[J].arXiv:2102.00541,2021. [14]MCCONVILLE R,SANTOS-RODRIGUEZ R,PIECHOCKI R J,et al.N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:5145-5152. [15]MCINNES L,HEALY J,MELVILLE J.Umap:Uniform manifold approximation and projection for dimension reduction[J].arXiv:1802.03426,2018. [16]GUO X F.A Study on Image Clustering Algorithms with Deep Neural Networks[D].Changsha:National University of Defense Technology,2020. [17]TENENBAUM J B,SILVA V,LANGFORD J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319-2323. [18]PHAN X H,NGUYEN L M,HORIGUCHI S.Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceedings of the 17th International Conference on World Wide Web.2008:91-100. [19]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31. [20]RAKIB M R H,ZEH N,JANKOWSKA M,et al.Enhancementof short text clustering by iterative classification[C]//International Conference on Applications of Natural Language to Information Systems.Cham:Springer,2020:105-117. [21]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//International Conference on Learning Representations.2017. [22]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced sample buil-ding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021. |
|