计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 240400011-6.doi: 10.11896/jsjkx.240400011

• 大数据&数据科学 • 上一篇    下一篇

STK:基于对比学习嵌入的聚类方法

刘晋霞, 张曦   

  1. 太原科技大学经济与管理学院 太原 030024
  • 出版日期:2024-11-16 发布日期:2024-11-13
  • 通讯作者: 张曦(efdcad@163.com)
  • 作者简介:(liujinxia@tyust.edu.cn)
  • 基金资助:
    教改项目(JG2023092)

STK:Clustering Method Based on Contrastive Learning Embedding

LIU Jinxia, ZHANG Xi   

  1. School of Economics and Management,Taiyuan University of Science and Technology,Taiyuan 030024,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:LIU Jinxia,born in 1973,Ph.D,asso-ciate professor,master's supervisor.Her main research interests include intelligent decision-making,data analysis,and innovative management.
    ZHANG Xi,born in 2000,postgraduate.His main research interests include big data-driven management and decision-making.
  • Supported by:
    Education Reform Projects(JG2023092).

摘要: SimCSE作为一种对比学习方法,在文本嵌入和聚类中表现出了良好的性能。文中旨在优化SimCSE训练模型生成的句子嵌入使其适用于聚类任务,通过多个算法组合和训练参数调整,解决聚类算法选择、噪声及异常值的影响等问题。文中提出一种联合KL散度和KMeans算法的无监督聚类模型STK(SimCSE t-SNE KMeans),使用SimCSE对文本进行编码;随后采用t-SNE算法对高维嵌入进行降维,通过最小化KL散度保留低维空间中高维数据点之间的相似性关系,降维的同时改善文本嵌入表示;最后使用KMeans算法对降维后的嵌入进行聚类,得到聚类结果。通过将本研究的聚类结果与Bert,UMAP,HDBSCAN等算法得到的结果进行比较,发现文中提出的模型在制氢领域专利和论文数据集上表现出更好的聚类效果,尤其在轮廓系数这一评价指标上。

关键词: SimCSE, 句嵌入, KL散度, 聚类, 轮廓系数

Abstract: SimCSE,as a contrastive learning method,has shown good performance in text embedding and clustering.The aim of this paper is to optimize the sentence embedding generated by SimCSE training models to make them suitable for clustering tasks.By combining multiple algorithms and adjusting training parameters,the problems of clustering algorithm selection,noise,and outliers can be solved.This paper proposes an unsupervised clustering model SimCSE t-SNE KMeans(STK) that combines KL divergence and K-Means algorithm.SimCSE is used to encode the text,and then the t-SNE algorithm is used to reduce the dimensionality of high-dimensional embeddings.By minimizing KL divergence and preserving the similarity relationship between high-dimensional data points in low dimensional space,the dimensionality is reduced while improving the text embedding representation.Finally,the KMeans algorithm is used to cluster the reduced embeddings and obtain clustering results.By comparing the clustering results of this study with those obtained by algorithms such as Bert,UMAP,HDBSCAN,etc.,it is found that the model proposed in the paper showed better clustering performance in the field of hydrogen productionpatent and paper datasets,especially in the evaluation index of Silhouette coefficient.

Key words: SimCSE, Sentence embedding, KL divergence, Clustering, Silhouette coefficient

中图分类号: 

  • TP391
[1]MOHANTY I,GOYAL A,DOTTERWEICH A.Emotions are subtle:Learning sentiment based text representations using contrastive learning[J].arXiv:2112.01054,2021.
[2]ZHANG J.S-SimCSE:sampled sub-networks for contrastivelearning of sentence embedding[J].arXiv:2111.11750,2021.
[3]RETSINAS G,STAMATOPOULOS N,LOULOUDIS G,et al.Nonlinear manifold embedding on keyword spotting using t-SNE[C]//2017 14th IAPR International Conference on Document Analysis and Recognition(ICDAR).IEEE,2017:487-492.
[4]CAI T T,MA R.Theoretical foundations of t-sne for visualizing high-dimensional clustered data[J].The Journal of Machine Learning Research,2022,23(1):13581-13634.
[5]WANG Y,HUANG H,RUDIN C,et al.Understanding how dimension reduction tools work:an empirical approach to deciphering t-SNE,UMAP,TriMAP,and PaCMAP for data visualization[J].The Journal of Machine Learning Research,2021,22(1):9129-9201.
[6]HAO H B.Disease Knowledge Graph Q&A System Based onSimCSE [J].Computer and Information Technology,2023,31(2):97-100.
[7]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced samplebuilding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021.
[8]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[9]CAO R,WANG Y,LIANG Y,et al.Exploring the impact of negative samples of contrastive learning:A case study of sentence embedding[J].arXiv:2202.13093,2022.
[10]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023.
[11]GUO J H,YUAN Y C,WANG K J,et al.Unsupervised sentence embedding method based on improved SimCSE [J].Computer Engineering and Design,2023,44(8):2382-2388.
[12]ZHANG J,LAN Z,HE J.Contrastive Learning of Sentence Embeddings from Scratch[J].arXiv:2305.15077,2023.
[13]HE W H,WU C J,ZHOU S J,et al.Short text clustering research using unsupervised SimCSE fusion [J].Computer Science,2023,50(11):71-76.
[14]WANG X H,WANG X,WANG S F,et al.A Disruptive Technology Identification Method Based on SimCSE-LDA and Anomaly Detection:Taking Agricultural Robots as an Example [J].Intelligence Theory and Practice,2023,46(5):135-143.
[15]MELIT DEVASSY B,GEORGE S,NUSSBAUM P.Unsuper-vised clustering of hyperspectral paper data using t-SNE[J].Journal of Imaging,2020,6(5):29.
[16]HE H,ZHANG J,LAN Z,et al.Instance smoothed contrastive learning for unsupervised sentence embedding[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12863-12871.
[17]POLIČAR P G,STRAŽAR M,ZUPAN B.openTSNE:a modular Python library for t-SNE dimensionality reduction and embedding[J].BioRxiv,2019:731877.
[18]GONZÁLEZ-MÁRQUEZ R,BERENS P,KOBAK D.Two-di-mensional visualization of large document libraries using t-SNE[C]//ICLR 2022 Workshop on Geometrical and Topological Representation Learning.2022.
[19]ZHOU Y,SHARPEE T O.Using global t-SNE to preserve inter-cluster data structure[J].bioRxiv,2018:331611.
[20]DAMRICH S, BÖHM N, HAMPRECHT F A,et al.From t-SNE to UMAP with contrastive learning[C]//The Eleventh International Conference on Learning Representations.2022.
[21]CAO Y,WANG L.Automatic selection of t-SNE perplexity[J].arXiv:1708.03229,2017.
[22]GARE S,CHEL S,KURUBA M,et al.Dimension reduction and clustering of single cell calcium spiking:comparison of t-SNE and UMAP[C]//2021 National Conference on Communications(NCC).IEEE,2021:1-6.
[23]ROBINSON I,PIERCE-HOFFMAN E.Tree-sne:Hierarchicalclustering and visualization using t-sne[J].arXiv:2002.05687,2020.
[24]GISBRECHT A,MOKBEL B,HAMMER B.Linear basis-function t-SNE for fast nonlinear dimensionality reduction[C]//The 2012 International Joint Conference on Neural Networks(IJCNN).IEEE,2012:1-8.
[25]BAJAL E,KATARA V,BHATIA M,et al.A review of clustering algorithms:comparison of DBSCAN and K-mean with oversampling and t-SNE[J].Recent Patents on Engineering,2022,16(2):17-31.
[26]LIU W,SHAO W,XIN Y.Method based on t-sne reduction and K-means clustering to identify the household-transformer relationship in low-voltage distribution network[C]//Second International Conference on Electronic Information Technology(EIT 2023).SPIE,2023:74-79.
[27]ZHANG D,LI S W,XIAO W,et al.Pairwise supervised contrastive learning of sentence representations[J].arXiv:2109.05424,2021.
[28]LI Z Y.K-SimCSE:Research on Text Retrieval Integrating Domain Knowledge [D].Wuhan:Central China Normal University,2022.
[29]AI A W S.Pairwise Supervised Contrastive Learning of Sen-tence Representations[J].arXiv:2109.05424,2021.
[30]FEI Y,NIE P,MENG Z,et al.Beyond prompting:Making pre-trained language models better zero-shot learners by clustering representations[J].arXiv:2210.16637,2022.
[31]POTRATZ G L,CANCHUMUNI S W A,CASTRO J D B,et al.Automatic lithofacies classification with t-SNE and K-nearest neighbors algorithm[J].Anuário Do Instituto De Geociências,2021,44.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!