Computer Science ›› 2024, Vol. 51 ›› Issue (11A): 240400011-6.doi: 10.11896/jsjkx.240400011

• Big Data & Data Science • Previous Articles     Next Articles

STK:Clustering Method Based on Contrastive Learning Embedding

LIU Jinxia, ZHANG Xi   

  1. School of Economics and Management,Taiyuan University of Science and Technology,Taiyuan 030024,China
  • Online:2024-11-16 Published:2024-11-13
  • About author:LIU Jinxia,born in 1973,Ph.D,asso-ciate professor,master's supervisor.Her main research interests include intelligent decision-making,data analysis,and innovative management.
    ZHANG Xi,born in 2000,postgraduate.His main research interests include big data-driven management and decision-making.
  • Supported by:
    Education Reform Projects(JG2023092).

Abstract: SimCSE,as a contrastive learning method,has shown good performance in text embedding and clustering.The aim of this paper is to optimize the sentence embedding generated by SimCSE training models to make them suitable for clustering tasks.By combining multiple algorithms and adjusting training parameters,the problems of clustering algorithm selection,noise,and outliers can be solved.This paper proposes an unsupervised clustering model SimCSE t-SNE KMeans(STK) that combines KL divergence and K-Means algorithm.SimCSE is used to encode the text,and then the t-SNE algorithm is used to reduce the dimensionality of high-dimensional embeddings.By minimizing KL divergence and preserving the similarity relationship between high-dimensional data points in low dimensional space,the dimensionality is reduced while improving the text embedding representation.Finally,the KMeans algorithm is used to cluster the reduced embeddings and obtain clustering results.By comparing the clustering results of this study with those obtained by algorithms such as Bert,UMAP,HDBSCAN,etc.,it is found that the model proposed in the paper showed better clustering performance in the field of hydrogen productionpatent and paper datasets,especially in the evaluation index of Silhouette coefficient.

Key words: SimCSE, Sentence embedding, KL divergence, Clustering, Silhouette coefficient

CLC Number: 

  • TP391
[1]MOHANTY I,GOYAL A,DOTTERWEICH A.Emotions are subtle:Learning sentiment based text representations using contrastive learning[J].arXiv:2112.01054,2021.
[2]ZHANG J.S-SimCSE:sampled sub-networks for contrastivelearning of sentence embedding[J].arXiv:2111.11750,2021.
[3]RETSINAS G,STAMATOPOULOS N,LOULOUDIS G,et al.Nonlinear manifold embedding on keyword spotting using t-SNE[C]//2017 14th IAPR International Conference on Document Analysis and Recognition(ICDAR).IEEE,2017:487-492.
[4]CAI T T,MA R.Theoretical foundations of t-sne for visualizing high-dimensional clustered data[J].The Journal of Machine Learning Research,2022,23(1):13581-13634.
[5]WANG Y,HUANG H,RUDIN C,et al.Understanding how dimension reduction tools work:an empirical approach to deciphering t-SNE,UMAP,TriMAP,and PaCMAP for data visualization[J].The Journal of Machine Learning Research,2021,22(1):9129-9201.
[6]HAO H B.Disease Knowledge Graph Q&A System Based onSimCSE [J].Computer and Information Technology,2023,31(2):97-100.
[7]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced samplebuilding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021.
[8]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[9]CAO R,WANG Y,LIANG Y,et al.Exploring the impact of negative samples of contrastive learning:A case study of sentence embedding[J].arXiv:2202.13093,2022.
[10]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023.
[11]GUO J H,YUAN Y C,WANG K J,et al.Unsupervised sentence embedding method based on improved SimCSE [J].Computer Engineering and Design,2023,44(8):2382-2388.
[12]ZHANG J,LAN Z,HE J.Contrastive Learning of Sentence Embeddings from Scratch[J].arXiv:2305.15077,2023.
[13]HE W H,WU C J,ZHOU S J,et al.Short text clustering research using unsupervised SimCSE fusion [J].Computer Science,2023,50(11):71-76.
[14]WANG X H,WANG X,WANG S F,et al.A Disruptive Technology Identification Method Based on SimCSE-LDA and Anomaly Detection:Taking Agricultural Robots as an Example [J].Intelligence Theory and Practice,2023,46(5):135-143.
[15]MELIT DEVASSY B,GEORGE S,NUSSBAUM P.Unsuper-vised clustering of hyperspectral paper data using t-SNE[J].Journal of Imaging,2020,6(5):29.
[16]HE H,ZHANG J,LAN Z,et al.Instance smoothed contrastive learning for unsupervised sentence embedding[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12863-12871.
[17]POLIČAR P G,STRAŽAR M,ZUPAN B.openTSNE:a modular Python library for t-SNE dimensionality reduction and embedding[J].BioRxiv,2019:731877.
[18]GONZÁLEZ-MÁRQUEZ R,BERENS P,KOBAK D.Two-di-mensional visualization of large document libraries using t-SNE[C]//ICLR 2022 Workshop on Geometrical and Topological Representation Learning.2022.
[19]ZHOU Y,SHARPEE T O.Using global t-SNE to preserve inter-cluster data structure[J].bioRxiv,2018:331611.
[20]DAMRICH S, BÖHM N, HAMPRECHT F A,et al.From t-SNE to UMAP with contrastive learning[C]//The Eleventh International Conference on Learning Representations.2022.
[21]CAO Y,WANG L.Automatic selection of t-SNE perplexity[J].arXiv:1708.03229,2017.
[22]GARE S,CHEL S,KURUBA M,et al.Dimension reduction and clustering of single cell calcium spiking:comparison of t-SNE and UMAP[C]//2021 National Conference on Communications(NCC).IEEE,2021:1-6.
[23]ROBINSON I,PIERCE-HOFFMAN E.Tree-sne:Hierarchicalclustering and visualization using t-sne[J].arXiv:2002.05687,2020.
[24]GISBRECHT A,MOKBEL B,HAMMER B.Linear basis-function t-SNE for fast nonlinear dimensionality reduction[C]//The 2012 International Joint Conference on Neural Networks(IJCNN).IEEE,2012:1-8.
[25]BAJAL E,KATARA V,BHATIA M,et al.A review of clustering algorithms:comparison of DBSCAN and K-mean with oversampling and t-SNE[J].Recent Patents on Engineering,2022,16(2):17-31.
[26]LIU W,SHAO W,XIN Y.Method based on t-sne reduction and K-means clustering to identify the household-transformer relationship in low-voltage distribution network[C]//Second International Conference on Electronic Information Technology(EIT 2023).SPIE,2023:74-79.
[27]ZHANG D,LI S W,XIAO W,et al.Pairwise supervised contrastive learning of sentence representations[J].arXiv:2109.05424,2021.
[28]LI Z Y.K-SimCSE:Research on Text Retrieval Integrating Domain Knowledge [D].Wuhan:Central China Normal University,2022.
[29]AI A W S.Pairwise Supervised Contrastive Learning of Sen-tence Representations[J].arXiv:2109.05424,2021.
[30]FEI Y,NIE P,MENG Z,et al.Beyond prompting:Making pre-trained language models better zero-shot learners by clustering representations[J].arXiv:2210.16637,2022.
[31]POTRATZ G L,CANCHUMUNI S W A,CASTRO J D B,et al.Automatic lithofacies classification with t-SNE and K-nearest neighbors algorithm[J].Anuário Do Instituto De Geociências,2021,44.
[1] ZHOU Yu, YANG Junling, DANG Kelin. Change Detection in SAR Images Based on Evolutionary Multi-objective Clustering [J]. Computer Science, 2024, 51(9): 140-146.
[2] LI Zekai, ZHONG Jiaqing, FENG Shaojun, CHEN Juan, DENG Rongyu, XU Tao, TAN Zhengyuan, ZHOU Kexing, ZHU Pengzhi, MA Zhaoyang. CPU Power Modeling Accuracy Improvement Method Based on Training Set Clustering Selection [J]. Computer Science, 2024, 51(9): 59-70.
[3] WANG Yiyang, LIU Fagui, PENG Lingxia, ZHONG Guoxiang. Out-of-Distribution Hard Disk Failure Prediction with Affinity Propagation Clustering and Broad Learning Systems [J]. Computer Science, 2024, 51(8): 63-74.
[4] WANG Xingeng, DU Tao, ZHOU Jin, CHEN Di, WU Yunzheng. Adaptive Density Peak Clustering Algorithm Based on Shared Nearest Neighbor [J]. Computer Science, 2024, 51(8): 97-105.
[5] SUN Haowen, DING Jiaman, LI Bowen, JIA Lianyin. Clustering Algorithm Based on Attribute Similarity and Distributed Structure Connectivity [J]. Computer Science, 2024, 51(7): 124-132.
[6] CHEN Jie, JIN Linjiang, ZHENG Hongbo, QIN Xujia. Deep Feature Learning and Feature Clustering of Streamlines in 3D Flow Fields [J]. Computer Science, 2024, 51(7): 221-228.
[7] LI Shuai, YU Juan, WU Shaocheng. Cross-lingual Text Topic Discovery Based on Ensemble Learning [J]. Computer Science, 2024, 51(6A): 230300201-8.
[8] SU Ruqi, BIAN Xiong, ZHU Songhao. Few-shot Images Classification Based on Clustering Optimization Learning [J]. Computer Science, 2024, 51(6A): 230300227-7.
[9] LI Zi, ZHOU Yu. Sequence-based Program Semantic Rule Mining and Violation Detection [J]. Computer Science, 2024, 51(6): 78-84.
[10] HE Yifan, HE Yulin, CUI Laizhong, HUANG Zhexue. Subspace-based I-nice Clustering Algorithm [J]. Computer Science, 2024, 51(6): 153-160.
[11] ZHANG Zhiyuan, ZHANG Weiyan, SONG Yuqiu, RUAN Tong. Multilingual Event Detection Based on Cross-level and Multi-view Features Fusion [J]. Computer Science, 2024, 51(5): 208-215.
[12] CHEN Haoyang, ZHANG Lei. Very Short Texts Hierarchical Classification Combining Semantic Interpretation and DeBERTa [J]. Computer Science, 2024, 51(5): 250-257.
[13] KANG Wei, LI Lihui, WEN Yimin. Semi-supervised Classification of Data Stream with Concept Drift Based on Clustering Model Reuse [J]. Computer Science, 2024, 51(4): 124-131.
[14] WANG Hancheng, DAI Haipeng, CHEN Zhipeng, CHEN Shusen, CHEN Guihai. Large-scale Network Community Detection Algorithm Based on MapReduce [J]. Computer Science, 2024, 51(4): 11-18.
[15] QIAO Fan, WANG Peng, WANG Wei. Multivariate Time Series Classification Algorithm Based on Heterogeneous Feature Fusion [J]. Computer Science, 2024, 51(2): 36-46.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!