计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 171-179.doi: 10.11896/jsjkx.240700008
张士举1, 郭朝阳2, 吴承亮2, 吴凌俊2, 杨丰玉1,2
ZHANG Shiju1, GUO Chaoyang2, WU Chengliang2, WU Lingjun2, YANG Fengyu1,2
摘要: 文本聚类是指将大量文本数据按照它们的相似性进行分组的过程,其可以帮助理解文本数据的结构和内容,发现其中的模式和趋势,通常用于信息检索、文档管理等。现有文本聚类模型在信息抽取过程中存在过度依赖原始数据质量和容易造成关键信息提取不充分的问题,而且不同类别的数据在表示空间中会相互重叠。针对以上问题,提出了一种基于关键语义驱动和对比学习的文本聚类方法(KSD-CLTC)。该方法在数据处理环节通过数据增强模块丰富原始数据来提高泛化性,并设计了一个关键语义驱动模块提取文本中的关键词,补足关键语义信息的丢失;在特征提取环节借助预训练模型和自动编码器对数据进行高质量表征;然后,在聚类学习环节借助聚类模块将聚类损失与关键语义驱动模块的重构损失相结合,进一步学习更适用于聚类的特征表示,并利用对比学习模块来实现更好的类别划分效果。实验结果表明,KSD-CLTC在公共数据集和工业数据集上的聚类效果优于对比的聚类算法,相比先进的SCCL方法,其在所有数据集上的ACC平均提高了2.92%,NMI平均提高了1.99%。聚类结果也证明了关键语义驱动模块对文本聚类的重要性。
中图分类号:
[1]SAEEDI EMADI H,MAZINANI S M.A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks[J].Wireless Personal Communications,2018,98:2025-2035. [2]WIBISONO S,ANWAR M T,SUPRIYANTO A,et al.Multivariate weather anomaly detection using DBSCAN clustering algorithm[C]//Journal of Physics:Conference Series.IOP Publi-shing,2021. [3]LIU F,XUE S,WU J,et al.Deep learning for community detection:progress,challenges and opportunities[J].arXiv:2005.08225,2020. [4]MENG Y,ZHANG Y,HUANG J,et al.Hierarchical topic min-ing via joint spherical tree and text embedding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Know-ledge Discovery & Data Mining.2020:1908-1917. [5]MACQUEEN J.Some methods for classification and analysis of multivariate observations[C]//Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability.1967:281-297. [6]CELEUX G,GOVAERT G.Gaussian parsimonious clusteringmodels[J].Pattern Recognition,1995,28(5):781-793. [7]ESTER M,KRIEGEL H P,SANDER J,et al.A density-based algorithm for discovering clusters in large spatial databases with noise [C]//Proceedings of the Second International Conference on Knowledge Discovery and Data Mining(KDD'96).1996:226-231. [8]DING C,HE X,SIMON H D.On the equivalence of nonnegative matrix factorization and spectral clustering[C]//Proceedings of the 2005 SIAM International Conference on Data Mining.Society for Industrial and Applied Mathematics,2005:606-610. [9]NG A,JORDAN M,WEISS Y.On spectral clustering:Analysis and an algorithm[C]//NIPS.2002. [10]JIANG B,YE L Y,PAN W F,et al.Service Clustering Based on the Functional Semantics of Requirements.[J].Chinese Journal of Computers,2018,41(6):1035-1046. [11]QIAO S J,HAN N,JIN C Q,et al.A Distributed Text Clustering Model Based on Multi-Agent[J].Chinese Journal of Computers,2018,41(8):1709-1721. [12]XIE J,GIRSHICK R,FARHADI A.Unsupervised deep embedding for clustering analysis[C]//International Conference on Machine Learning.PMLR,2016:478-487. [13]ZHANG D,SUN Y,ERIKSSON B,et al.Deep unsupervisedclustering using mixture of autoencoders[J].arXiv:1712.07788,2017. [14]SHAHAM U,STANTON K,LI H,et al.Spectralnet:Spectral clustering using deep neural networks[J].arXiv:1801.01587,2018. [15]ZHOU S,XU H,ZHENG Z,et al.A comprehensive survey on deep clustering:Taxonomy,challenges,and future directions[J].arXiv:2206.07579,2022. [16]CAI X Y,HUANG J J,BIAN Y C,et al.Isotropy in the Contextual Embedding Space:Clusters and Manifolds[C]//International Conference on Learning Representations.2021. [17]JIANG Z,ZHENG Y,TAN H,et al.Variational deep embedding:A generative approach to clustering[J].arXiv:1611.05145,2016. [18]HADIFAR A,STERCKX L,DEMEESTER T,et al.A self-training approach for short text clustering[C]//Proceedings of the 4th Workshop on Representation Learning for NLP(Rep-L4NLP-2019).2019:194-199. [19]ZHANG W,DONG C,YIN J,et al.Attentive representationlearning with adversarial training for short text clustering[J].IEEE Transactions on Knowledge and Data Engineering,2021,34(11):5196-5210. [20]ZHANG D,NAN F,WEI X,et al.Supporting clustering with contrastive learning[J].arXiv:2103.12953,2021. [21]BAI R N,HUANGR Z,ZHENG L Y,et al.Structure enhanced deep clustering network via a weighted neighbourhood auto-encoder[J].Neural Networks,2022(155):144-154. [22]MIHALCEA R,TARAU P.Textrank:Bringing order into text[C]//Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing.2004:404-411. [23]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation[J].Journal of Machine Learning Research,2003,3:993-1022. [24]WANG D,LIU P,ZHENG Y,et al.Heterogeneous graph neural networks for extractive document summarization[J].arXiv:2004.12393,2020. [25]VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Procedings of the 31st International Confe-rence on Neural Information Processing Systems.2017:6000-6010. [26]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018. [27]TAN J Y,DIAO Y F,QI R H,et al.Automatic summary generation of Chinese news text based on BERT-PGN mode[J].Journal of Computer Applications,2021,41(1):127-132. [28]LEWIS M,LIU Y,GOYAL N,et al.Bart:Denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[J].arXiv:1910.13461,2019. [29]RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].The Journal of Machine Learning Research,2020,21(1):5485-5551. [30]YU W,LU N,QI X,et al.PICK:processing key information extraction from documents using improved graph learning-convolutional networks[C]//2020 25th International Conference on Pattern Recognition(ICPR).IEEE,2021:4363-4370. [31]YI Z L,ZHANG H L,NA R L,el al.Deep text clustering algorithm based on key Semantic Information [J].Application Research of Computers,2023,40(6):1653-1659. [32]ROSE S,ENGEL D,CRAMER N,et al.Automatic Keyword Extraction from Individual Documents[J].text Mining:Application and Theory,2010,4:1-20. [33]MAATEN L V D,HINTON G.Visualizing data using t-SNE[J].Journal of Machine Learning Research,2008,9(86):2579-2605. [34]REB S,DENG Y,HE K,et al.Generating natural language adversarial examples through probability weighted word saliency[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:1085-1097. [35]KOBAYASHI S.Contextual augmentation:Data augmentationby words with paradigmatic relations[J].arXiv:1805.06201,2018. [36]SHEN T,OTT M,AULI M,et al.Mixture models for diverse machine translation:Tricks of the trade[C]//International Conference on Machine Learning.PMLR,2019:5719-5728. [37]LUO C J,ZHAN J F,WANG L,et al.Cosine normalization:Using cosine similarity instead of dot product in neural networks[C]//Artificial Neural Networks and Machine Learning-ICANN 2018.Springer International Publishing,2018:382-391. [38]XU J,XU B,WANG P,et al.Self-taught convolutional neural networks for short text clustering[J].Neural Networks,2017,88:22-31. [39]ZHANG X,LECUN Y.Text understanding from scratch[J].arXiv:1502.01710,2015. [40]YIN J,WANG J.A model-based approach for text clustering with outlier detection[C]//2016 IEEE 32nd International Conference on Data Engineering(ICDE).IEEE,2016:625-636. [41]RASHADUL H R M,ZEH N,JANKOWSKA M,et al.En-hancement of Short Text Clustering by Iterative Classification[J].arXiv:2001.11631,2020. [42]LI H.Statistical learning methods(VersionII)[M].Beijing:Tsinghua University Press,2019. |
|