计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 31-37.doi: 10.11896/jsjkx.190600159

• 大数据与数据科学 • 上一篇    下一篇

HMRF半监督近似核k-means算法

贾洪杰, 王良君, 宋和平   

  1. (江苏大学计算机科学与通信工程学院 江苏 镇江212013)
  • 收稿日期:2019-04-15 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 贾洪杰(1988-),男,博士,讲师,CCF会员,主要研究方向为机器学习、数据挖掘,E-mail:jiahj@ujs.edu.cn。
  • 作者简介:王良君(1982-),男,博士,讲师,主要研究方向为图像编码、压缩感知;宋和平(1983-),男,博士,副教授,主要研究方向为智能信息处理。
  • 基金资助:
    本文受江苏省高校自然科学研究面上项目(18KJB520009,16KJB520008),国家自然科学基金青年基金项目(61906077,61601202),江苏省自然科学基金青年基金项目(BK20190838,BK20170558)资助。

HMRF Semi-supervised Approximate Kernel k-means Algorithm

JIA Hong-jie, WANG Liang-jun, SONG He-ping   

  1. (School of Computer Science and Communication Engineering,Jiangsu University,Zhenjiang,Jiangsu 212013,China)
  • Received:2019-04-15 Online:2019-12-15 Published:2019-12-17

摘要: 信息技术的发展催生了海量数据。聚类有助于发现数据的内在联系,从中挖掘有价值的信息。在对数据进行分析时,容易获得一些关于数据的背景知识,使用这些有限的先验信息指导聚类,可以显著改善聚类的结果。基于隐马尔可夫随机场(Hidden Markov Random Fields,HMRF)的半监督聚类使用成对约束作为监督信息,虽然在很多应用场景中有较好的聚类效果,但是其时间和空间复杂度很高,无法满足大规模数据处理的需要。针对该问题,文中首先分析了HMRF半监督聚类与核k-means的数学联系,使用矩阵的迹将两者的目标函数统一起来;然后,为了降低HMRF半监督聚类的复杂度,提出HMRF半监督近似核k-means算法(HMRF semi-supervised Approximate Kernel K-Means,HMRF-AKKM),通过采样构造近似核矩阵,使用近似核k-means优化聚类的目标函数;最后,在基准数据集上将HMRF-AKKM算法与相关的聚类算法进行对比,分析不同算法在实验中的聚类表现。实验结果表明,在相同的聚类任务上,HMRF-AKKM算法与原始的HMRF半监督聚类具有类似的聚类质量,但是HMRF-AKKM算法的聚类时间更短,说明HMRF-AKKM算法继承了HMRF半监督聚类与近似核k-means的优点。该算法一方面可以充分利用成对约束信息改善聚类质量,另一方面通过采样和矩阵近似提高了聚类效率,而且聚类质量和聚类效率可以通过调节采样比例和成对约束数量来平衡。因此,所提出的HMRF-AKKM算法具有良好的可扩展性,适合处理大规模非线性数据的聚类问题。

关键词: 半监督聚类, HMRF模型, 近似核k-means, 矩阵的迹, 成对约束

Abstract: Massive data are produced with the development of information technology.Clustering can help to discover the intrinsic links of data and extract valuable information from them.In data analyzing,it is easy to get some background knowledge about data.Using these limited prior information to guide clustering can significantly improve the clustering results.The semi-supervised clustering based on Hidden Markov Random Fields (HMRF) uses pairwise constraints as the supervision information.Although it has good clustering results in many applications,its time and space complexity are very high,which cannot meet the needs of large-scale data processing.To solve this problem,this paper first analyzed the mathematical relationship between HMRF semi-supervised clustering and kernel k-means,and used matrix trace to unify the objective functions of the two clustering methods.In order to reduce the complexity of HMRF semi-supervised clustering,this paper proposed a HMRF semi-supervised approximate kernel k-means algorithm (HMRF-AKKM),which constructs an approximate kernel matrix by sampling,and used the approximate kernel k-means to optimize the clustering objective function.Finally,the HMRF-AKKM algorithm was compared with the related clustering algorithms on several benchmark datasets and the clustering performances of different algorithms were analyzed in the experiments.The experimental results show that the HMRF-AKKM algorithm has similar clustering quality to the original HMRF semi-supervised clustering on the same clustering task,but the HMRF-AKKM algorithm has shorter clustering time.This indicates that the HMRF-AKKM algorithm inherits the advantages of HMRF semi-supervised clustering and approximate kernel k-means.On the one hand,HMRF-AKKM can make full use of pairwise constraints to achieve high clustering quality.On the other hand,it improves the clustering efficiency by sampling and matrix approximation.Moreover,the clustering quality and clustering efficiency can be balanced by adjusting the sampling ratio and the number of pairwise constraints.Therefore,the proposed HMRF-AKKM algorithm has good scalability and it is suitable for the clustering problems of large-scale nonlinear data.

Key words: Semi-supervised clustering, HMRF model, Approximate kernel k-means, Matrix trace, Pairwise constraints

中图分类号: 

  • TP391
[1] ZHANG X T,ZHANG X C,LIU H.Weighed Multi-Task Clustering by Feature and Instance Transfer [J].Chinese Journal of Computers,2019,42(36):1-17.(in Chinese)张晓彤,张宪超,刘晗.基于特征和实例迁移的加权多任务聚类[J].计算机学报,2019,42(36):1-17.
[2] MARIN D,TANG M,AYED I B,et al.Kernel clustering:density biases and solutions [J].IEEE Transactions on Pattern Ana-lysis and Machine Intelligence,2019,41(1):136-147.
[3] LIU X,ZHU X,LI M,et al.Multiple kernel k-means with incomplete kernels [OL].https://doi.org/10.1109/TPAMI.2019.2892416.
[4] CHITTA R,JIN R,HAVENS T C,et al.Scalable kernel clustering:Approximate kernel k-means [J].arXiv:1402.3849.
[5] MEHRKANOON S,ALZATE C,MALL R,et al.Multiclass Semisupervised Learning Based Upon Kernel Spectral Clustering [J].IEEE Transactions on Neural Networks & Learning Systems,2015,26(4):720-733.
[6] GAN H,FAN Y,LUO Z,et al.Local homogeneous consistent safe semi-supervised clustering [J].Expert Systems with Applications,2018,97:384-393.
[7] WANG W,YANG C,CHEN H,et al.Unified Discriminative and Coherent Semi-Supervised Subspace Clustering [J].IEEE Transactions on Image Processing,2018,27(5):2461-2470.
[8] YU Z,LUO P,LIU J,et al.Semi-supervised ensemble clustering based on selected constraint projection [J].IEEE Transactions on Knowledge and Data Engineering,2018,30(12):2394-2407.
[9] REN Y,HU K,DAI X,et al.Semi-supervised deep embedded clustering [J].Neurocomputing,2019,325:121-130.
[10] MEI J P.Semi-supervised fuzzy clustering with partition information of subsets[OL].https://doi.org/10.1109/TFUZZ.2018.2889010.
[11] NGUYEN B,DE BAETS B.Kernel-Based Distance Metric Learning for Supervised k-Means Clustering [OL].https://doi.org/10.1109/TNNLS.2018.2890021.
[12] WU B,ZHANG Y,HU B G,et al.Constrained clustering and its application to face clustering in videos[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2013:3507-3514.
[13] WANG S,GITTENS A,MAHONEY M W.Scalable kernel K-means clustering with Nyström approximation:relative-error bounds [J].The Journal of Machine Learning Research,2019,20(1):431-479.
[14] KULIS B,BASU S,DHILLON I,et al.Semi-supervised graph clustering:a kernel approach [J].Machine Learning,2009,74(1):1-22.
[15] BÜHLER T,HEIN M.Spectral clustering based on the graph p-Laplacian[C]//Proceedings of the 26th International Confe-rence on Machine Learning.ACM,2009:81-88.
[16] CAI D,HE X,HAN J,et al.Graph regularized nonnegative matrix factorization for data representation [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2011,33(8):1548-1560.
[17] BAGHSHAH M S,AFSARI F,SHOURAKI S B,et al.Scalable semi-supervised clustering by spectral kernel learning [J].Pattern Recognition Letters,2014,45:161-171.
[18] BHATT R B,DHALL A.Skin Segmentation Dataset.UCI Machine Learning Repository[OL].https://archive.ics.uci.edu/ml/index.html.
[19] KLEIN D,KAMVAR S D,MANNING C D.From Instance-le- vel Constraints to Space-Level Constraints:Making the Most of Prior Knowledge in Data Clustering[C]//Proceedings of the 19th International Conference on Machine Learning.ACM,2002:307-314.
[20] JIA H J,DING S F,SHI Z Z.Approximate weighted kernel k-means for large-scale spectral clustering [J].Journal of Software,2015,26(11):2836-2846.(in Chinese) 贾洪杰,丁世飞,史忠植.求解大规模谱聚类的近似加权核k-means算法[J].软件学报,2015,26(11):2836-2846.
[21] YANG Y,TAN W,LI T,et al.Consensus clustering based on constrained self-organizing map and improved Cop-Kmeans ensemble in intelligent decision support systems [J].Knowledge-Based Systems,2012,32:101-115.
[22] DING S,JIA H,DU M,et al.A semi-supervised approximate spectral clustering algorithm based on HMRF model [J].Information Sciences,2018,429:215-228. SHI Y,OTTO C,JAIN A K.Face clustering:representation and pairwise constraints .IEEE Transactions on Information Forensics and Security,2018,13(7):1626-1640. HE L,RAY N,GUAN Y,et al.Fast large-scale spectral clustering via explicit feature mapping .IEEE Transactions on Cybernetics,2018,49(3):1058-1071. ZHAO X,LIANG J,DANG C.A stratified sampling based clustering algorithm for large-scale data . Knowledge-Based Systems,2019,163:416-428.
[1] 杨帆, 王俊斌, 白亮. 基于安全性的成对约束扩充算法[J]. 计算机科学, 2020, 47(9): 324-329.
[2] 秦悦, 丁世飞. 半监督聚类综述[J]. 计算机科学, 2019, 46(9): 15-21.
[3] 王楠, 孙善武. 基于半监督聚类分析的无人机故障识别[J]. 计算机科学, 2019, 46(6A): 192-195.
[4] 程雪梅,杨秋辉,翟宇鹏,陈伟. 基于半监督聚类方法的测试用例选择技术[J]. 计算机科学, 2018, 45(1): 249-254.
[5] 梁辰,李成海. 一种新的半监督入侵检测方法[J]. 计算机科学, 2016, 43(5): 87-90.
[6] 王纵虎,刘速. 一种成对约束限制的半监督文本聚类算法[J]. 计算机科学, 2016, 43(12): 183-188.
[7] 冯晨菲,杨燕,王红军,徐英歌,王韬. 一种基于数据相关性的半监督模糊聚类集成方法[J]. 计算机科学, 2015, 42(6): 41-45.
[8] 苏赢彬,杜学绘,夏春涛,曹利峰,陈华成. 基于半监督聚类的文档敏感信息推导方法[J]. 计算机科学, 2015, 42(10): 132-137.
[9] 邱 烨,何振峰. 面向限制K-means算法的迭代学习分配次序策略[J]. 计算机科学, 2012, 39(8): 196-198.
[10] 兰远东,邓辉舫,陈 涛. 基于图收缩的半监督聚类算法[J]. 计算机科学, 2012, 39(4): 236-239.
[11] 齐鸣鸣,向阳. 融合稀疏保持的成对约束投影[J]. 计算机科学, 2012, 39(11): 212-215.
[12] 李岩波.宋琼,郭新辰. 基于流形距离的人工免疫半监督聚类算法[J]. 计算机科学, 2012, 39(11): 204-207.
[13] 崔鹏,张汝波. 一种用于处理高维稀疏数据的半监督聚类算法[J]. 计算机科学, 2010, 37(7): 205-207.
[14] 李琳娜,陈海蕊,王映龙. 基于高阶逻辑的复杂结构数据半监督聚类[J]. 计算机科学, 2009, 36(9): 196-200.
[15] 陆伟宙 余顺争. 基于半监督聚类的Web流量分类[J]. 计算机科学, 2009, 36(2): 90-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .