计算机科学 ›› 2024, Vol. 51 ›› Issue (11A): 240400011-6.doi: 10.11896/jsjkx.240400011
刘晋霞, 张曦
LIU Jinxia, ZHANG Xi
摘要: SimCSE作为一种对比学习方法,在文本嵌入和聚类中表现出了良好的性能。文中旨在优化SimCSE训练模型生成的句子嵌入使其适用于聚类任务,通过多个算法组合和训练参数调整,解决聚类算法选择、噪声及异常值的影响等问题。文中提出一种联合KL散度和KMeans算法的无监督聚类模型STK(SimCSE t-SNE KMeans),使用SimCSE对文本进行编码;随后采用t-SNE算法对高维嵌入进行降维,通过最小化KL散度保留低维空间中高维数据点之间的相似性关系,降维的同时改善文本嵌入表示;最后使用KMeans算法对降维后的嵌入进行聚类,得到聚类结果。通过将本研究的聚类结果与Bert,UMAP,HDBSCAN等算法得到的结果进行比较,发现文中提出的模型在制氢领域专利和论文数据集上表现出更好的聚类效果,尤其在轮廓系数这一评价指标上。
中图分类号:
[1]MOHANTY I,GOYAL A,DOTTERWEICH A.Emotions are subtle:Learning sentiment based text representations using contrastive learning[J].arXiv:2112.01054,2021. [2]ZHANG J.S-SimCSE:sampled sub-networks for contrastivelearning of sentence embedding[J].arXiv:2111.11750,2021. [3]RETSINAS G,STAMATOPOULOS N,LOULOUDIS G,et al.Nonlinear manifold embedding on keyword spotting using t-SNE[C]//2017 14th IAPR International Conference on Document Analysis and Recognition(ICDAR).IEEE,2017:487-492. [4]CAI T T,MA R.Theoretical foundations of t-sne for visualizing high-dimensional clustered data[J].The Journal of Machine Learning Research,2022,23(1):13581-13634. [5]WANG Y,HUANG H,RUDIN C,et al.Understanding how dimension reduction tools work:an empirical approach to deciphering t-SNE,UMAP,TriMAP,and PaCMAP for data visualization[J].The Journal of Machine Learning Research,2021,22(1):9129-9201. [6]HAO H B.Disease Knowledge Graph Q&A System Based onSimCSE [J].Computer and Information Technology,2023,31(2):97-100. [7]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced samplebuilding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021. [8]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021. [9]CAO R,WANG Y,LIANG Y,et al.Exploring the impact of negative samples of contrastive learning:A case study of sentence embedding[J].arXiv:2202.13093,2022. [10]DENG J,WAN F,YANG T,et al.Clustering-Aware Negative Sampling for Unsupervised Sentence Representation[J].arXiv:2305.09892,2023. [11]GUO J H,YUAN Y C,WANG K J,et al.Unsupervised sentence embedding method based on improved SimCSE [J].Computer Engineering and Design,2023,44(8):2382-2388. [12]ZHANG J,LAN Z,HE J.Contrastive Learning of Sentence Embeddings from Scratch[J].arXiv:2305.15077,2023. [13]HE W H,WU C J,ZHOU S J,et al.Short text clustering research using unsupervised SimCSE fusion [J].Computer Science,2023,50(11):71-76. [14]WANG X H,WANG X,WANG S F,et al.A Disruptive Technology Identification Method Based on SimCSE-LDA and Anomaly Detection:Taking Agricultural Robots as an Example [J].Intelligence Theory and Practice,2023,46(5):135-143. [15]MELIT DEVASSY B,GEORGE S,NUSSBAUM P.Unsuper-vised clustering of hyperspectral paper data using t-SNE[J].Journal of Imaging,2020,6(5):29. [16]HE H,ZHANG J,LAN Z,et al.Instance smoothed contrastive learning for unsupervised sentence embedding[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2023:12863-12871. [17]POLIČAR P G,STRAAR M,ZUPAN B.openTSNE:a modular Python library for t-SNE dimensionality reduction and embedding[J].BioRxiv,2019:731877. [18]GONZÁLEZ-MÁRQUEZ R,BERENS P,KOBAK D.Two-di-mensional visualization of large document libraries using t-SNE[C]//ICLR 2022 Workshop on Geometrical and Topological Representation Learning.2022. [19]ZHOU Y,SHARPEE T O.Using global t-SNE to preserve inter-cluster data structure[J].bioRxiv,2018:331611. [20]DAMRICH S, BÖHM N, HAMPRECHT F A,et al.From t-SNE to UMAP with contrastive learning[C]//The Eleventh International Conference on Learning Representations.2022. [21]CAO Y,WANG L.Automatic selection of t-SNE perplexity[J].arXiv:1708.03229,2017. [22]GARE S,CHEL S,KURUBA M,et al.Dimension reduction and clustering of single cell calcium spiking:comparison of t-SNE and UMAP[C]//2021 National Conference on Communications(NCC).IEEE,2021:1-6. [23]ROBINSON I,PIERCE-HOFFMAN E.Tree-sne:Hierarchicalclustering and visualization using t-sne[J].arXiv:2002.05687,2020. [24]GISBRECHT A,MOKBEL B,HAMMER B.Linear basis-function t-SNE for fast nonlinear dimensionality reduction[C]//The 2012 International Joint Conference on Neural Networks(IJCNN).IEEE,2012:1-8. [25]BAJAL E,KATARA V,BHATIA M,et al.A review of clustering algorithms:comparison of DBSCAN and K-mean with oversampling and t-SNE[J].Recent Patents on Engineering,2022,16(2):17-31. [26]LIU W,SHAO W,XIN Y.Method based on t-sne reduction and K-means clustering to identify the household-transformer relationship in low-voltage distribution network[C]//Second International Conference on Electronic Information Technology(EIT 2023).SPIE,2023:74-79. [27]ZHANG D,LI S W,XIAO W,et al.Pairwise supervised contrastive learning of sentence representations[J].arXiv:2109.05424,2021. [28]LI Z Y.K-SimCSE:Research on Text Retrieval Integrating Domain Knowledge [D].Wuhan:Central China Normal University,2022. [29]AI A W S.Pairwise Supervised Contrastive Learning of Sen-tence Representations[J].arXiv:2109.05424,2021. [30]FEI Y,NIE P,MENG Z,et al.Beyond prompting:Making pre-trained language models better zero-shot learners by clustering representations[J].arXiv:2210.16637,2022. [31]POTRATZ G L,CANCHUMUNI S W A,CASTRO J D B,et al.Automatic lithofacies classification with t-SNE and K-nearest neighbors algorithm[J].Anuário Do Instituto De Geociências,2021,44. |
|