融合无监督SimCSE的短文本聚类研究

doi:10.11896/jsjkx.220900214

Abstract

Abstract: Traditional shallow text clustering methods face challenges such as limited context information,irregular use of words,and few words with actual meaning when clustering short texts,resulting in sparse embedding representations of the text and difficulty in extracting key features.To address these issues,a deep clustering model SSKU(SBERT SimCSE Kmeans Umap) incorporating simple data augmentation methods is proposed in the paper.The model uses SBERT to embed short texts and fine-tunes the text embedding model using the unsupervised SimCSE method in conjunction with the deep clustering KMeans algorithm to improve the embedding representation of short texts to make them suitable for clustering.To improve the sparse features of short text and optimize the embedding results,Umap manifold dimension reduction method is used to learn the local manifold structure.Using K-Means algorithm to cluster the dimensionality-reduced embeddings,and the clustering results are obtained.Extensive experiments are carried out on four publicly available short text datasets,such as StackOverFlow and Biomedical, and compared with the latest deep clustering algorithms.The results show that the proposed model exhibits good clustering performance in terms of both accuracy and standard mutual information evaluation metrics.

Key words: Short text, Deep clustering, Pre-training model, Dimension reduction, Natural language processing

CLC Number:

TP391

HE Wenhao, WU Chunjiang, ZHOU Shijie, HE Chaoxin. Study on Short Text Clustering with Unsupervised SimCSE[J].Computer Science, 2023, 50(11): 71-76.

References

[1]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[2]GAO T,YAO X,CHEN D.Simcse:Simple contrastive learning of sentence embeddings[J].arXiv:2104.08821,2021.
[3]HU X,ZHANG X,LU C,et al.Exploiting wikipedia as external knowledge for document clustering[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery Anddata Mining.2009:389-396.
[4]BANERJEE S,RAMANATHAN K,GUPTA A.Clusteringshort texts using Wikipedia[C]//Proceedings of the 30th An-nual International ACM SIGIR Conference on Research and development in Information Retrieval.2007:787-788.
[5]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781,2013.
[6]REIMERS N,GUREVYCH I.Sentence-BERT:Sentence Em-beddings using Siamese BERT-Networks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3982-3992.
[7]MACQUEEN J.Classification and analysis of multivariate observations[C]//5th Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[8]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793.
[9]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487.
[10]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199.
[11]ZHANG D,NAN F,WEI X,et al.Supporting Clustering with Contrastive Learning[C]//NAACL-HLT.2021.
[12]WANG D,LI T,DENG P,et al.A Generalized Deep Learning Algorithm based on NMF for Multi-view Clustering[J].IEEE Transactions on Big Data,2022.
[13]PUGACHEV L,BURTSEV M.Short text clustering withtransformers[J].arXiv:2102.00541,2021.
[14]MCCONVILLE R,SANTOS-RODRIGUEZ R,PIECHOCKI R J,et al.N2d:(not too) deep clustering via clustering the local manifold of an autoencoded embedding[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:5145-5152.
[15]MCINNES L,HEALY J,MELVILLE J.Umap:Uniform manifold approximation and projection for dimension reduction[J].arXiv:1802.03426,2018.
[16]GUO X F.A Study on Image Clustering Algorithms with Deep Neural Networks[D].Changsha:National University of Defense Technology,2020.
[17]TENENBAUM J B,SILVA V,LANGFORD J C.A global geometric framework for nonlinear dimensionality reduction[J].Science,2000,290(5500):2319-2323.
[18]PHAN X H,NGUYEN L M,HORIGUCHI S.Learning to classify short and sparse text & web with hidden topics from large-scale data collections[C]//Proceedings of the 17th International Conference on World Wide Web.2008:91-100.
[19]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[20]RAKIB M R H,ZEH N,JANKOWSKA M,et al.Enhancementof short text clustering by iterative classification[C]//International Conference on Applications of Natural Language to Information Systems.Cham:Springer,2020:105-117.
[21]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//International Conference on Learning Representations.2017.
[22]WU X,GAO C,ZANG L,et al.Esimcse:Enhanced sample buil-ding method for contrastive learning of unsupervised sentence embedding[J].arXiv:2109.04380,2021.

Related Articles 15

[1]	YI Liu, GENG Xinyu, BAI Jing. Hierarchical Multi-label Text Classification Algorithm Based on Parallel Convolutional Network Information Fusion [J]. Computer Science, 2023, 50(9): 278-286.
[2]	ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[3]	ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[4]	CAI Shaotian, CHEN Xiaojun, CHEN Longteng, QIU Liping. Stratified Pseudo-label Based Image Clustering [J]. Computer Science, 2023, 50(6): 225-235.
[5]	WEI Tao, LI Zhihua, WANG Changjie, CHENG Shunhang. Cybersecurity Threat Intelligence Mining Algorithm for Open Source Heterogeneous Data [J]. Computer Science, 2023, 50(6): 330-337.
[6]	WANG Lin, MENG Zuqiang, YANG Lina. Chinese Sentiment Analysis Based on CNN-BiLSTM Model of Multi-level and Multi-scale Feature Extraction [J]. Computer Science, 2023, 50(5): 248-254.
[7]	ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[8]	LIU Zhe, YIN Chengfeng, LI Tianrui. Chinese Spelling Check Based on BERT and Multi-feature Fusion Embedding [J]. Computer Science, 2023, 50(3): 282-290.
[9]	DU Qiming, LI Nan, LIU Wenfu, YANG Shudan, YUE Feng. Sentiment Analysis of Chinese Short Text Combining Context and Dependent Syntactic Information [J]. Computer Science, 2023, 50(3): 307-314.
[10]	SHAO Wenqiang, CAI Ruijie, SONG Enzhou, GUO Xixi, LIU Shengli. Semantic-based Multi-architecture Binary Function Name Prediction Method [J]. Computer Science, 2023, 50(10): 369-376.
[11]	LIANG Haowei, WANG Shi, CAO Cungen. Study on Short Text Classification with Imperfect Labels [J]. Computer Science, 2023, 50(1): 185-193.
[12]	ZHENG Cheng, MEI Liang, ZHAO Yiyan, ZHANG Suhang. Text Classification Method Based on Bidirectional Attention and Gated Graph Convolutional Networks [J]. Computer Science, 2023, 50(1): 221-228.
[13]	LYU Xiao-feng, ZHAO Shu-liang, GAO Heng-da, WU Yong-liang, ZHANG Bao-qi. Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network [J]. Computer Science, 2022, 49(9): 92-100.
[14]	YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[15]	HOU Yu-tao, ABULIZI Abudukelimu, ABUDUKELIMU Halidanmu. Advances in Chinese Pre-training Models [J]. Computer Science, 2022, 49(7): 148-163.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Study on Short Text Clustering with Unsupervised SimCSE

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0