计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 137-144.doi: 10.11896/jsjkx.200800190

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于高斯场和自适应图正则的半监督聚类

赵敏, 刘惊雷   

  1. 烟台大学计算机与控制工程学院 山东 烟台264005
  • 收稿日期:2020-08-28 修回日期:2020-10-19 出版日期:2021-07-15 发布日期:2021-07-02
  • 通讯作者: 刘惊雷(jinglei_liu@sina.com)
  • 基金资助:
    国家自然科学基金(61572419,61773331,61801414,62072391)

Semi-supervised Clustering Based on Gaussian Fields and Adaptive Graph Regularization

ZHAO Min, LIU Jing-lei   

  1. School of Computer and Control Engineering,Yantai University,Yantai,Shandong 264005,China
  • Received:2020-08-28 Revised:2020-10-19 Online:2021-07-15 Published:2021-07-02
  • About author:ZHAO Min,born in 1995,postgraduate.Her main research interests include semi-supervised clustering and so on.(ytdxzhaomin@163.com)
    LIU Jing-lei,born in 1970,Ph.D,professor,master supervisor.His main research interests include artificial intelligent and theoretical computer science.
  • Supported by:
    National Natural Science Foundation of China(61572419,61773331,61801414,62072391).

摘要: 聚类是将给定的样本分成几个不同的簇,它在机器学习、数据挖掘等领域得到了广泛应用,并受到研究人员的广泛关注。但是,传统的聚类方法仍然存在3个方面的不足。首先,由于一些数据中存在噪声和异常值,传统的聚类方法容易产生误差较大的目标函数。其次,传统的聚类方法没有使用监督信息来指导构建相似矩阵。最后,加入图正则的聚类方法在计算相似度矩阵时,邻居关系都是确定的,一旦计算错误就会导致构造图的质量低,进而影响聚类性能。因此,提出了一种基于高斯场和自适应图正则化的半监督聚类(SCGFAG)模型。该模型通过高斯场及谐波函数法引入监督信息,来指导构建相似度矩阵,实现半监督学习,还引入稀疏误差矩阵来表示稀疏噪声,如脉冲噪声、死线和条纹,并且使用l1范数来缓解稀疏噪声。此外,所提模型还引入l2,1范数来处理异常值的影响。因此,SCGFAG对数据噪声和异常值不敏感。更重要的是,SCGFAG通过引入自适应图的正则化提高了聚类性能。为了实现优化聚类的目标,提出了一种迭代更新算法—增广拉格朗日法(Augmented Lagrangian Method,ALM),分别对优化变量进行更新。在4个数据集上进行的实验表明,所提方法优于相比较的8种经典聚类方法获得了更好的聚类性能。

关键词: l2,1的旋转不变性, 半监督聚类, 噪声和异常值, 增广拉格朗日法, 自适应图正则

Abstract: Clustering is to divide a given sample into several different clusters,which is a widely used tool,has been applied in machine learning,data mining and so on,and has received extensive concern by researchers.However,there are still three main limitations.Firstly,usually there are noises and outliers in the data,which will bring about significant errors in the clustering results.Secondly,traditional clustering methods do not use supervision information to guide the construction of similarity matrices.Finally,in the graph-based clustering method,when constructing graphs,the neighbor relationship is determined.Once the calculation is wrong, it will result in poor quality of the constructed graph,which will affect the clustering performance.Therefore,a semi-supervised clustering model based on Gaussian field and adaptive graph regularization (SCGFAG) is proposed in this paper. In this model,supervised information is introduced by gaussian field and harmonic function to guide the construction of similarity matrix to realize semi-supervised learning.Sparse error matrix is introduced to represent sparse noise,such as impulse noise,dead line,stripes,and l1 norm is introduced to alleviate the sparse noise.In addition,the l2,1 norm is also introduced by the proposed model to mitigate the effects of outliers.Therefore,our SCGFAG is insensitive to data noise and outliers.More importantly,the regularization of adaptive graph is introduced into SCGFAG to improve the clustering performance.In order to realize the goal of optimization clustering,an iterative updating algorithm-Augmented Lagrangian Method (ALM) is proposed to update the optimization variables respectively.Experimental results on four datasets show that the proposed method is superior to the eight classical clustering methods,and has better clustering performance.

Key words: Adaptive graph regularization, Augmented Lagrangian method, Noise and outliers, Rotation invariance property of l2,1, Semi-supervised clustering

中图分类号: 

  • TP311
[1]RAO S R,TRON R,VIDAL R,et al.Motion segmentation in the presence of outlying,incomplete,or corrupted trajectories[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(10):1832-1845.
[2]SHI J,MALIK J.Normalized cuts and image segmentation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
[3]LIU G,LIN Z,YAN S,et al.Robust Recovery of SubspaceStructures by Low-Rank Representation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):171-184.
[4]HARTIGAN J A,WONG M A.A K-means Clustering Algorithm:Algorithm AS 136[J].Applied Statistics,1979,28(1):100-108.
[5]VON LUXBURG U.A tutorial on spectral clustering[J].Statistics and Computing,2007,17(4):395-416.
[6]ZHOU S H,ZHU E,LIU X W,et al.Subspace segmentation-based robust multiple kernel clustering[J].Information Fusion,2020,53:145-154
[7]ZHOU S,LIU X,ZHU C,et al.Spectral clustering-based local and global structure preservation for feature selection[C]//International Joint Conference on Neural Networks.IEEE,2014.
[8]DING C,LI T,JORDAN M.Convex and semi-nonnegative matrix factorizations[J].IEEE Transactions on Software Enginee-ring,2010,32(1):45-55.
[9]ZHANG Y,KONG X W,WANG Z F,et al.Cluster analysis based on multi-view matrix Decomposition[J].Acta Automata Sinica,2018,44(12):2160-2169.
[10]DING Y,LI Y Z.Intrusion detection Algorithm based on PCA and Semi-supervised clustering[J].Journal of Shandong University (Engineering Science),2012,42(5):41-46.
[11]BASU S,BILENKO M,MOONEY R J,et al.A probabilisticframework for semi-supervised clustering[C]//Knowledge Discovery and Data Mining.2004:59-68.
[12]LIU H.Constrained nonnegative matrix factorization for image representation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2012,34(7):1299-1311.
[13]CAI D,HE X,HAN J,et al.Graph regularized non-negative matrix factorization for data representation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(8):1548-1560.
[14]HUANG J, NIE F P,HUANG H,et al.Robust Manifold Nonnegative Matrix Factorization[J].ACM Transactions on Know-ledge Discovery from Data,2014,8(3):1-21.
[15]ZENG K,YU J,LI C,et al.Image clustering by hyper-graphregularized non-negative matrix factorization[J].Neurocompu-ting,2014,138:209-217.
[16]ZHANG X.Non-negative low rank and sparse graph for semi-supervised learning[C]//Computer Vision & Pattern Recognition.IEEE,2012.
[17]NIE F,XU D,TSANG W H,et al.Flexible manifold embedding:a framework for semi-Supervised and unsupervised dimension reduction[J].IEEE Transactions on Image Processing,2010,19(7):1921-1932.
[18]ZHU X,GHAHRAMANI Z,LAFFERTY J D.Semi-supervised learning using Gaussian fields and Harmonic functions[C]//Machine Learning,Proceedings of the Twentieth International Conference (ICML 2003).Washington DC,USA,2003.
[19]LU G,WANG Y,ZOU J.Low-rank matrix factorization withadaptive graph regularizer [J].IEEE Transactions on Image Processing,2016,25(5):2196-2205.
[20]ZHANG L,ZHANG Q,DU B,et al.Adaptive manifold regulari-zed matrix factorization for data clustering[C]//Twenty-sixth International Joint Conference on Artificial Intelligence.2017.
[21]HE F,NIE F,WANG R,et al.Fast Semi-supervised learning with bipartite graph for large-scale data[J].IEEE Transactionson Neural Networks,2020,31(2):626-638.
[22]LI Z,LIU J,TANG J.Robust Structured Subspace Learning for Data Representation [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,37(10):2085-2098.
[23]BHARDWAJ A,RAMAN S.Robust PCA-based solution to ima-ge composition using augmented Lagrange multiplier (ALM)[J].Visual Computer,2016,32(5):591-600.
[24]TOH K C,YUN S.An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems[J].Pacific Journal of Optimization,2010,6(3):615-640.
[25]YANG J,YUAN X.Linearized augmented Lagrangian and alternating direction methods for nuclear norm minimization[J].Mathematics of Computation,2012,82(281):301-329.
[26]LEE D D,SEUNG H S.Algorithms for Non-negative MatrixFactorization[C]//Neural Information Processing Systems.2000:556-562.
[27]PENG C,KANG Z,HU Y,et al.Robust graph regularized nonnegative matrix factorization for clustering[J].ACM Transactions on Knowledge Discovery from Data,2017,11(3):33.
[28]HE R,ZHENG W S,HU B G,et al.Nonnegative sparse coding for discriminative semi-supervised learning[C]//IEEE Confe-rence on Computer Vision & Pattern Recognition.IEEE,2011.
[29]JIA Y H,KWONG S,HOU J H,et al.Semi-Supervised Non-Negative Matrix Factorization with Dissimilarity and Similarity Regularization [J].IEEE Transaction on Neural Networks and Learning Systems,2020,31(7):2510-2521.
[1] 杨帆, 王俊斌, 白亮.
基于安全性的成对约束扩充算法
Extended Algorithm of Pairwise Constraints Based on Security
计算机科学, 2020, 47(9): 324-329. https://doi.org/10.11896/jsjkx.200700092
[2] 秦悦, 丁世飞.
半监督聚类综述
Survey of Semi-supervised Clustering
计算机科学, 2019, 46(9): 15-21. https://doi.org/10.11896/j.issn.1002-137X.2019.09.002
[3] 王楠, 孙善武.
基于半监督聚类分析的无人机故障识别
UAV Fault Recognition Based on Semi-supervised Clustering
计算机科学, 2019, 46(6A): 192-195.
[4] 贾洪杰, 王良君, 宋和平.
HMRF半监督近似核k-means算法
HMRF Semi-supervised Approximate Kernel k-means Algorithm
计算机科学, 2019, 46(12): 31-37. https://doi.org/10.11896/jsjkx.190600159
[5] 程雪梅,杨秋辉,翟宇鹏,陈伟.
基于半监督聚类方法的测试用例选择技术
Test Case Selection Technique Based on Semi-supervised Clustering Method
计算机科学, 2018, 45(1): 249-254. https://doi.org/10.11896/j.issn.1002-137X.2018.01.044
[6] 梁辰,李成海.
一种新的半监督入侵检测方法
Novel Intrusion Detection Method Based on Semi-supervised Clustering
计算机科学, 2016, 43(5): 87-90. https://doi.org/10.11896/j.issn.1002-137X.2016.05.016
[7] 冯晨菲,杨燕,王红军,徐英歌,王韬.
一种基于数据相关性的半监督模糊聚类集成方法
Semi-supervised Fuzzy Clustering Ensemble Approach with Data Correlation
计算机科学, 2015, 42(6): 41-45. https://doi.org/10.11896/j.issn.1002-137X.2015.06.009
[8] 苏赢彬,杜学绘,夏春涛,曹利峰,陈华成.
基于半监督聚类的文档敏感信息推导方法
Sensitive Information Inference Method Based on Semi-supervised Document Clustering
计算机科学, 2015, 42(10): 132-137.
[9] 邱 烨,何振峰.
面向限制K-means算法的迭代学习分配次序策略
Iterative Learning Assignment Order for Constrained K-means Algorithm
计算机科学, 2012, 39(8): 196-198.
[10] 兰远东,邓辉舫,陈 涛.
基于图收缩的半监督聚类算法
Semi-supervised Clustering Algorithm Based on Graph Contraction
计算机科学, 2012, 39(4): 236-239.
[11] 李岩波.宋琼,郭新辰.
基于流形距离的人工免疫半监督聚类算法
Artificial Immune Clustering Semi-supervised Algorithm Based on Manifold Distance
计算机科学, 2012, 39(11): 204-207.
[12] 崔鹏,张汝波.
一种用于处理高维稀疏数据的半监督聚类算法
Novel Semi-supervised Clustering for High Dimensional Data
计算机科学, 2010, 37(7): 205-207.
[13] 李琳娜,陈海蕊,王映龙.
基于高阶逻辑的复杂结构数据半监督聚类
Semi-supervised Clustering of Complex Structured Data Based on Higher-order Logic
计算机科学, 2009, 36(9): 196-200.
[14] 陆伟宙 余顺争.
基于半监督聚类的Web流量分类

计算机科学, 2009, 36(2): 90-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!