计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 171-179.doi: 10.11896/jsjkx.240700008

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于关键语义驱动和对比学习的文本聚类方法

张士举1, 郭朝阳2, 吴承亮2, 吴凌俊2, 杨丰玉1,2   

  1. 1 南昌航空大学软件学院 南昌 330000
    2 江西省航空制造数字化仿真工程研究中心 南昌 330000
  • 收稿日期:2024-07-01 修回日期:2024-09-25 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 杨丰玉(99770277@qq.com)
  • 作者简介:(2804286469@qq.com)
  • 基金资助:
    江西省重点研发计划(20202BBEL53002)

Text Clustering Approach Based on Key Semantic Driven and Contrastive Learning

ZHANG Shiju1, GUO Chaoyang2, WU Chengliang2, WU Lingjun2, YANG Fengyu1,2   

  1. 1 College of Software,Nanchang Hangkong University,Nanchang 330000,China
    2 Jiangxi Province Aviation Manufacturing Digital Simulation Engineering Research Center,Nanchang 330000,China
  • Received:2024-07-01 Revised:2024-09-25 Online:2025-08-15 Published:2025-08-08
  • About author:ZHANG Shiju,born in 1997,postgra-duate,is a member of CCF(No.I8208G).His main research interest is natural language processing.
    YANG Fengyu,born in 1980,associate professor,is a member of CCF(No.37982S).His main research interest is analysis and mining of physical quality data for aviation products.
  • Supported by:
    Key Research and Development Program of Jiangxi Province(20202BBEL53002).

摘要: 文本聚类是指将大量文本数据按照它们的相似性进行分组的过程,其可以帮助理解文本数据的结构和内容,发现其中的模式和趋势,通常用于信息检索、文档管理等。现有文本聚类模型在信息抽取过程中存在过度依赖原始数据质量和容易造成关键信息提取不充分的问题,而且不同类别的数据在表示空间中会相互重叠。针对以上问题,提出了一种基于关键语义驱动和对比学习的文本聚类方法(KSD-CLTC)。该方法在数据处理环节通过数据增强模块丰富原始数据来提高泛化性,并设计了一个关键语义驱动模块提取文本中的关键词,补足关键语义信息的丢失;在特征提取环节借助预训练模型和自动编码器对数据进行高质量表征;然后,在聚类学习环节借助聚类模块将聚类损失与关键语义驱动模块的重构损失相结合,进一步学习更适用于聚类的特征表示,并利用对比学习模块来实现更好的类别划分效果。实验结果表明,KSD-CLTC在公共数据集和工业数据集上的聚类效果优于对比的聚类算法,相比先进的SCCL方法,其在所有数据集上的ACC平均提高了2.92%,NMI平均提高了1.99%。聚类结果也证明了关键语义驱动模块对文本聚类的重要性。

关键词: 信息抽取, 表示空间, 文本聚类, 关键语义驱动, 对比学习

Abstract: Text clustering is the process of grouping a large amount of text data according to their similarities,which can help to understand the structure and content of text data,and discover patterns and trends in it,and is usually used in the fields of information retrieval and document management.Existing text clustering models have the problems of over-reliance on the quality of original data and insufficient extraction of key information,and data of different categories overlap each other in the representation space.To solve the above problems,a text clustering method based on key semantic-driven and comparative learning(KSD-CLTC) is proposed.In the process of data processing,a data enhancement module is used to enrich the original data to improve the generalization,and a key semantic-driven module is designed to extract keywords from the text to make up for the loss of key semantic information.In the feature extraction process,the pre-trained model and automatic encoder are used to characterize the data with high quality.Then,in the cluster learning process,the cluster loss is combined with the reconstruction loss of key semantic-driven modules to further learn the feature representation more suitable for clusters,and the contrast learning module is used to achieve better classification results.KSD-CLTC outperforms the comparative clustering algorithms on both public and industrial datasets,improving ACC by an average of 2.92% and NMI by an average of 1.99% across all datasets as compared to the state-of-the-art SCCL method.The clustering results also demonstrate the importance of key semantic drivers for text clustering.

Key words: Information extraction, Denote space, Text clustering, Key semantic-driven, Contrastive learning

中图分类号: 

  • TP391.9
[1]SAEEDI EMADI H,MAZINANI S M.A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks[J].Wireless Personal Communications,2018,98:2025-2035.
[2]WIBISONO S,ANWAR M T,SUPRIYANTO A,et al.Multivariate weather anomaly detection using DBSCAN clustering algorithm[C]//Journal of Physics:Conference Series.IOP Publi-shing,2021.
[3]LIU F,XUE S,WU J,et al.Deep learning for community detection:progress,challenges and opportunities[J].arXiv:2005.08225,2020.
[4]MENG Y,ZHANG Y,HUANG J,et al.Hierarchical topic min-ing via joint spherical tree and text embedding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Know-ledge Discovery & Data Mining.2020:1908-1917.
[5]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297.
[6]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793.
[7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).1996:226-231.
[8]DING C,HE X,SIMON H D.On the equivalence of nonnegative matrix factorization and spectral clustering[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2005:606-610.
[9]NG A,JORDAN M,WEISS Y.On spectral clustering:Analysis and an algorithm[C]//NIPS.2002.
[10]JIANG B,YE L Y,PAN W F,et al.Service Clustering Based on the Functional Semantics of Requirements.[J].Chinese Journal of Computers,2018,41(6):1035-1046.
[11]QIAO S J,HAN N,JIN C Q,et al.A Distributed Text Clustering Model Based on Multi-Agent[J].Chinese Journal of Computers,2018,41(8):1709-1721.
[12]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487.
[13]ZHANG D,SUN Y,ERIKSSON B,et al.Deep unsupervisedclustering using mixture of autoencoders[J].arXiv:1712.07788,2017.
[14]SHAHAM U,STANTON K,LI H,et al.Spectralnet:Spectral clustering using deep neural networks[J].arXiv:1801.01587,2018.
[15]ZHOU S,XU H,ZHENG Z,et al.A comprehensive survey on deep clustering:Taxonomy,challenges,and future directions[J].arXiv:2206.07579,2022.
[16]CAI X Y,HUANG J J,BIAN Y C,et al.Isotropy in the Contextual Embedding Space:Clusters and Manifolds[C]//International Conference on Learning Representations.2021.
[17]JIANG Z,ZHENG Y,TAN H,et al.Variational deep embedding:A generative approach to clustering[J].arXiv:1611.05145,2016.
[18]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199.
[19]ZHANG W,DONG C,YIN J,et al.Attentive representationlearning with adversarial training for short text clustering[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(11):5196-5210.
[20]ZHANG D,NAN F,WEI X,et al.Supporting clustering with contrastive learning[J].arXiv:2103.12953,2021.
[21]BAI R N,HUANGR Z,ZHENG L Y,et al.Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder[J].Neural Networks,2022(155):144-154.
[22]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411.
[23]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022.
[24]WANG D,LIU P,ZHENG Y,et al.Heterogeneous graph neural networks for extractive document summarization[J].arXiv:2004.12393,2020.
[25]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Procedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010.
[26]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[27]TAN J Y,DIAO Y F,QI R H,et al.Automatic summary generation of Chinese news text based on BERT-PGN mode[J].Journal of Computer Applications,2021,41(1):127-132.
[28]LEWIS M,LIU Y,GOYAL N,et al.Bart:Denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[J].arXiv:1910.13461,2019.
[29]RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].The Journal of Machine Learning Research,2020,21(1):5485-5551.
[30]YU W,LU N,QI X,et al.PICK:processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:4363-4370.
[31]YI Z L,ZHANG H L,NA R L,el al.Deep text clustering algorithm based on key Semantic Information [J].Application Research of Computers,2023,40(6):1653-1659.
[32]ROSE S,ENGEL D,CRAMER N,et al.Automatic Keyword Extraction from Individual Documents[J].text Mining:Application and Theory,2010,4:1-20.
[33]MAATEN L V D,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(86):2579-2605.
[34]REB S,DENG Y,HE K,et al.Generating natural language adversarial examples through probability weighted word saliency[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1085-1097.
[35]KOBAYASHI S.Contextual augmentation:Data augmentationby words with paradigmatic relations[J].arXiv:1805.06201,2018.
[36]SHEN T,OTT M,AULI M,et al.Mixture models for diverse machine translation:Tricks of the trade[C]//International Conference on Machine Learning.PMLR,2019:5719-5728.
[37]LUO C J,ZHAN J F,WANG L,et al.Cosine normalization:Using cosine similarity instead of dot product in neural networks[C]//Artificial Neural Networks and Machine Learning-ICANN 2018.Springer International Publishing,2018:382-391.
[38]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31.
[39]ZHANG X,LECUN Y.Text understanding from scratch[J].arXiv:1502.01710,2015.
[40]YIN J,WANG J.A model-based approach for text clustering with outlier detection[C]//2016 IEEE 32nd International Conference on Data Engineering(ICDE).IEEE,2016:625-636.
[41]RASHADUL H R M,ZEH N,JANKOWSKA M,et al.En-hancement of Short Text Clustering by Iterative Classification[J].arXiv:2001.11631,2020.
[42]LI H.Statistical learning methods(VersionII)[M].Beijing:Tsinghua University Press,2019.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!