Computer Science ›› 2025, Vol. 52 ›› Issue (8): 171-179.doi: 10.11896/jsjkx.240700008

• Database & Big Data 6 Data Science • Previous Articles     Next Articles

Text Clustering Approach Based on Key Semantic Driven and Contrastive Learning

ZHANG Shiju1, GUO Chaoyang2, WU Chengliang2, WU Lingjun2, YANG Fengyu1,2   

  1. 1 College of Software,Nanchang Hangkong University,Nanchang 330000,China
    2 Jiangxi Province Aviation Manufacturing Digital Simulation Engineering Research Center,Nanchang 330000,China
  • Received:2024-07-01 Revised:2024-09-25 Online:2025-08-15 Published:2025-08-08
  • About author:ZHANG Shiju,born in 1997,postgra-duate,is a member of CCF(No.I8208G).His main research interest is natural language processing.
    YANG Fengyu,born in 1980,associate professor,is a member of CCF(No.37982S).His main research interest is analysis and mining of physical quality data for aviation products.
  • Supported by:
    Key Research and Development Program of Jiangxi Province(20202BBEL53002).

Abstract: Text clustering is the process of grouping a large amount of text data according to their similarities,which can help to understand the structure and content of text data,and discover patterns and trends in it,and is usually used in the fields of information retrieval and document management.Existing text clustering models have the problems of over-reliance on the quality of original data and insufficient extraction of key information,and data of different categories overlap each other in the representation space.To solve the above problems,a text clustering method based on key semantic-driven and comparative learning(KSD-CLTC) is proposed.In the process of data processing,a data enhancement module is used to enrich the original data to improve the generalization,and a key semantic-driven module is designed to extract keywords from the text to make up for the loss of key semantic information.In the feature extraction process,the pre-trained model and automatic encoder are used to characterize the data with high quality.Then,in the cluster learning process,the cluster loss is combined with the reconstruction loss of key semantic-driven modules to further learn the feature representation more suitable for clusters,and the contrast learning module is used to achieve better classification results.KSD-CLTC outperforms the comparative clustering algorithms on both public and industrial datasets,improving ACC by an average of 2.92% and NMI by an average of 1.99% across all datasets as compared to the state-of-the-art SCCL method.The clustering results also demonstrate the importance of key semantic drivers for text clustering.

Key words: Information extraction, Denote space, Text clustering, Key semantic-driven, Contrastive learning

CLC Number: 

  • TP391.9
[1]SAEEDI EMADI H,MAZINANI S M.A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks[J].Wireless Personal Communications,2018,98:2025-2035.
[2]WIBISONO S,ANWAR M T,SUPRIYANTO A,et al.Multivariate weather anomaly detection using DBSCAN clustering algorithm[C]//Journal of Physics:Conference Series.IOP Publi-shing,2021.
[3]LIU F,XUE S,WU J,et al.Deep learning for community detection:progress,challenges and opportunities[J].arXiv:2005.08225,2020.
[4]MENG Y,ZHANG Y,HUANG J,et al.Hierarchical topic min-ing via joint spherical tree and text embedding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Know-ledge Discovery & Data Mining.2020:1908-1917.
[5]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[6]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793.
[7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).1996:226-231.
[8]DING C,HE X,SIMON H D.On the equivalence of nonnegative matrix factorization and spectral clustering[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2005:606-610.
[9]NG A,JORDAN M,WEISS Y.On spectral clustering:Analysis and an algorithm[C]//NIPS.2002.
[10]JIANG B,YE L Y,PAN W F,et al.Service Clustering Based on the Functional Semantics of Requirements.[J].Chinese Journal of Computers,2018,41(6):1035-1046.
[11]QIAO S J,HAN N,JIN C Q,et al.A Distributed Text Clustering Model Based on Multi-Agent[J].Chinese Journal of Computers,2018,41(8):1709-1721.
[12]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487.
[13]ZHANG D,SUN Y,ERIKSSON B,et al.Deep unsupervisedclustering using mixture of autoencoders[J].arXiv:1712.07788,2017.
[14]SHAHAM U,STANTON K,LI H,et al.Spectralnet:Spectral clustering using deep neural networks[J].arXiv:1801.01587,2018.
[15]ZHOU S,XU H,ZHENG Z,et al.A comprehensive survey on deep clustering:Taxonomy,challenges,and future directions[J].arXiv:2206.07579,2022.
[16]CAI X Y,HUANG J J,BIAN Y C,et al.Isotropy in the Contextual Embedding Space:Clusters and Manifolds[C]//International Conference on Learning Representations.2021.
[17]JIANG Z,ZHENG Y,TAN H,et al.Variational deep embedding:A generative approach to clustering[J].arXiv:1611.05145,2016.
[18]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199.
[19]ZHANG W,DONG C,YIN J,et al.Attentive representationlearning with adversarial training for short text clustering[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(11):5196-5210.
[20]ZHANG D,NAN F,WEI X,et al.Supporting clustering with contrastive learning[J].arXiv:2103.12953,2021.
[21]BAI R N,HUANGR Z,ZHENG L Y,et al.Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder[J].Neural Networks,2022(155):144-154.
[22]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[23]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[24]WANG D,LIU P,ZHENG Y,et al.Heterogeneous graph neural networks for extractive document summarization[J].arXiv:2004.12393,2020.
[25]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Procedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[26]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[27]TAN J Y,DIAO Y F,QI R H,et al.Automatic summary generation of Chinese news text based on BERT-PGN mode[J].Journal of Computer Applications,2021,41(1):127-132.
[28]LEWIS M,LIU Y,GOYAL N,et al.Bart:Denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[J].arXiv:1910.13461,2019.
[29]RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].The Journal of Machine Learning Research,2020,21(1):5485-5551.
[30]YU W,LU N,QI X,et al.PICK:processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:4363-4370.
[31]YI Z L,ZHANG H L,NA R L,el al.Deep text clustering algorithm based on key Semantic Information [J].Application Research of Computers,2023,40(6):1653-1659.
[32]ROSE S,ENGEL D,CRAMER N,et al.Automatic Keyword Extraction from Individual Documents[J].text Mining:Application and Theory,2010,4:1-20.
[33]MAATEN L V D,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(86):2579-2605.
[34]REB S,DENG Y,HE K,et al.Generating natural language adversarial examples through probability weighted word saliency[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1085-1097.
[35]KOBAYASHI S.Contextual augmentation:Data augmentationby words with paradigmatic relations[J].arXiv:1805.06201,2018.
[36]SHEN T,OTT M,AULI M,et al.Mixture models for diverse machine translation:Tricks of the trade[C]//International Conference on Machine Learning.PMLR,2019:5719-5728.
[37]LUO C J,ZHAN J F,WANG L,et al.Cosine normalization:Using cosine similarity instead of dot product in neural networks[C]//Artificial Neural Networks and Machine Learning-ICANN 2018.Springer International Publishing,2018:382-391.
[38]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[39]ZHANG X,LECUN Y.Text understanding from scratch[J].arXiv:1502.01710,2015.
[40]YIN J,WANG J.A model-based approach for text clustering with outlier detection[C]//2016 IEEE 32nd International Conference on Data Engineering(ICDE).IEEE,2016:625-636.
[41]RASHADUL H R M,ZEH N,JANKOWSKA M,et al.En-hancement of Short Text Clustering by Iterative Classification[J].arXiv:2001.11631,2020.
[42]LI H.Statistical learning methods(VersionII)[M].Beijing:Tsinghua University Press,2019.
[1] ZHANG Taotao, XIE Jun, QIAO Pingjuan. Specific Emitter Identification Based on Progressive Self-training Open Set Domain Adaptation [J]. Computer Science, 2025, 52(7): 279-286.
[2] YE Jiale, PU Yuanyuan, ZHAO Zhengpeng, FENG Jue, ZHOU Lianmin, GU Jinjing. Multi-view CLIP and Hybrid Contrastive Learning for Multimodal Image-Text Sentiment Analysis [J]. Computer Science, 2025, 52(6A): 240700060-7.
[3] WANG Baohui, XU Boren, LI Chang’ao, YE Zihao. Study on Algorithm for Keyword Extraction from WeChat Conversation Text [J]. Computer Science, 2025, 52(6A): 240700105-8.
[4] FU Shufan, WANG Zhongqing, JIANG Xiaotong. Zero-shot Stance Detection in Chinese by Fusion of Emotion Lexicon and Graph ContrastiveLearning [J]. Computer Science, 2025, 52(6A): 240500051-7.
[5] LI Jianghui, DING Haiyan, LI Weihua. Prediction of Influenza A Antigenicity Based on Few-shot Contrastive Learning [J]. Computer Science, 2025, 52(6A): 240800053-6.
[6] LIU Yufei, XIAO Yanhui, TIAN Huawei. PRNU Fingerprint Purification Algorithm for Open Environment [J]. Computer Science, 2025, 52(6): 187-199.
[7] CHEN Yadang, GAO Yuxuan, LU Chuhan, CHE Xun. Saliency Mask Mixup for Few-shot Image Classification [J]. Computer Science, 2025, 52(6): 256-263.
[8] WU Pengyuan, FANG Wei. Study on Graph Collaborative Filtering Model Based on FeatureNet Contrastive Learning [J]. Computer Science, 2025, 52(5): 139-148.
[9] MIAO Zhuang, CUI Haoran, ZHANG Qiyang, WANG Jiabao, LI Yang. Restoration of Atmospheric Turbulence-degraded Images Based on Contrastive Learning [J]. Computer Science, 2025, 52(5): 171-178.
[10] TIAN Qing, KANG Lulu, ZHOU Liangyu. Class-incremental Source-free Domain Adaptation Based on Multi-prototype Replay andAlignment [J]. Computer Science, 2025, 52(3): 206-213.
[11] YUAN Ye, CHEN Ming, WU Anbiao, WANG Yishu. Graph Anomaly Detection Model Based on Personalized PageRank and Contrastive Learning [J]. Computer Science, 2025, 52(2): 80-90.
[12] LIU Yanlun, XIAO Zheng, NIE Zhenyu, LE Yuquan, LI Kenli. Case Element Association with Evidence Extraction for Adjudication Assistance [J]. Computer Science, 2025, 52(2): 222-230.
[13] YE Lishuo, HE Zhixue. Multi-granularity Time Series Contrastive Learning Method Incorporating Time-Frequency Features [J]. Computer Science, 2025, 52(1): 170-182.
[14] TIAN Sicheng, HUANG Shaobin, WANG Rui, LI Rongsheng, DU Zhijuan. Contrastive Learning-based Prompt Generation Method for Large-scale Language Model ReverseDictionary Task [J]. Computer Science, 2024, 51(8): 256-262.
[15] TIAN Qing, LU Zhanghu, YANG Hong. Unsupervised Domain Adaptation Based on Entropy Filtering and Class Centroid Optimization [J]. Computer Science, 2024, 51(7): 345-353.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!