计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 55-61.doi: 10.11896/j.issn.1002-137X.2018.01.008

• CRSSC-CWI-CGrC-3WD 2017 • 上一篇    下一篇

基于Nystrm采样和凸NMF的偏好聚类

杨美姣,刘惊雷   

  1. 烟台大学计算机与控制工程学院 山东 烟台264005,烟台大学计算机与控制工程学院 山东 烟台264005
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61572419,8,61403328,9),山东省自然科学基金(ZR2014FQ016,ZR2014FQ026,5GSF115009,ZR2013FM011)资助

Preference Clustering Based on Nystrm Sampling and Convex-NMF

YANG Mei-jiao and LIU Jing-lei   

  • Online:2018-01-15 Published:2018-11-13

摘要: 大规模的稀疏图数据在现实中大量出现,例如协同图、拉普拉斯矩阵等。非负矩阵分解(NMF)已经成为数据挖掘、信息检索和信号处理的一个非常重要的工具。随着数据量的不断增大,如何实现大规模数据的偏好聚类是一个重要的问题。采用两阶段的方法来实现大规模的偏好聚类,即首先利用Nystrm的近似采样方法,从大数据上获得数据的初始轮廓,获得部分用户-用户相似矩阵或电影-电影相似矩阵,从而可以将原始的高维空间降低到一个低维子空间;然后通过对低维相似矩阵进行凸的非负矩阵分解,从而得到聚类的中心和指示器,聚类的中心表示电影或用户的特征,指示器表示用户或电影特征的权重。该两阶段偏好聚类方法的优点是,初始数据轮廓的近似获取以及凸的非负矩阵分解,使得该方法具有较好的鲁棒性和抗噪性;另外,子空间的数据来源于真实的矩阵行列数据,使得偏好聚类结果具有良好的可解释性。采用Nystrm方法解决了大规模的数据无法在内存中存储的问题,从而大大节省了内存,提高了运行效率。最后在含有100000条电影的数据集上进行偏好聚类,结果表明了该聚类算法的有效性。

关键词: Nystrm方法,凸的非负矩阵分解,偏好聚类,聚类中心,聚类指示器

Abstract: Large-scale sparse graph data appear in reality heavily,for example,collaborative graph,Laplacian matrix,and so on.Non-negative matrix factorization (NMF) has become a useful tool in data mining,information retrieval and signal processing.How to achieve the data clustering in large-scale data is an important issue.This paper used the two-stage method to realize data clustering.First of all,the Nystrm approximate sampling method is used.The initial profile of data is obtained from large data,and the similar matrix of user-user or movie-movie is obtained.The purpose of doing that is to reduce the original high dimensional space to a low dimensional subspace.Then convex non-negative matrix decomposition of low dimensional similarity matrix is used to get the center of the cluster and indicator.The center of the cluster represents the features of movies or users,and the indicator represents the weight of the features of mo-vies or users.The advantage of two-stage preference clustering method is that the approximation of initial data contour and convex non-negative matrix factorization have better robustness and anti-noise.On the other hand,the data of the subspace are derived from the real matrix,which makes the results of clustering preference have good interpretability.This paper utilized Nystrm method to solve the problem that large-scale data cannot be stored in the memory,saving memory and improving operation efficiency.Lastly,the test on movie data sets which containing 100000 ratings shows the effectiveness of the clustering algorithm.

Key words: Nystrm method,Convex non-negative matrix factorization,Preference clustering,Clustering center,Clustering indicator

[1] WANG L,BEZDEK J C,LECKIE C.Selective sampling for approximate clustering of very large data sets[J].International Journal of Intelligent Systems,2008,3(3):313-331.
[2] XU R,II D C W.IEEE,Survey of clustering algorithms[J].IEEE Transactions on Neural Networks,2005,16(3):645-678.
[3] SUN J G,LIU J,ZHAO L Y.Clustering algorithm research[J].Journal of Software,2008,19(1):48-61.(in Chinese) 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61.
[4] DING S F,JIA H J,SHI Z Z.Spectral clustering algorithm for large data spectrum based on adaptive Nystrm sampling[J].Journal of Software,2014,25(9):2037-2049.(in Chinese) 丁世飞,贾洪杰,史忠植.基于自适应Nystrm采样的大数据谱聚类算法[J].软件学报,2014,25(9):2037-2049.
[5] SUN S,ZHAO J,ZHU J.A review of Nystrm methods for large-scale machine learning[J].Information Fusion,2015,26(1):36-48.
[6] WANG S,ZHANG Z.Improving CUR matrix decompositionand the Nystrm approximation via adaptive sampling [J].Journal of Machine Learning Research,2013,14(1):2729-2769.
[7] CHEN X,CAI D.Large scale spectral clustering with landmark-based representation [C]∥AAAI Conferenceon Artificial Intelligence(AAAI 2011).San Francisco,California,USA,2011:55-115.
[8] WANG W W,LI X P,FENG X C.Summarization of sparse subspaces [J].Journal of Acta Automatica Sinica,2015,1(8):1373-1384.(in Chinese) 王卫卫,李小平,冯象初.稀疏子空间聚类综述[J].自动化学报,2015,41(8):1373-1384.
[9] WANG S,LUO L,ZHANG Z.SPSD matrix approximation vis column selection:theories,algorithms,and extensions [J].Journal of Machine Learning Research,2016,17(5):1-49.
[10] WANG S,ZHANG Z,ZHANG T.Towards more efficient SPSD matrix approximation and CUR matrix decomposition [J].Journal of Machine Learning Research,2016,17(210):1-49.
[11] ZHU L,LEI J S,BI Z Q.A soft subspace clustering algorithm based on data stream[J].Journal of Software,2013,24(11):2610-2627.(in Chinese) 朱林,雷景生,毕忠勤.一种基于数据流的软子空间聚类算法[J].软件学报,2013,24(11):2610-2627.
[12] BACH F R,JORDAN M I.Learning spectral clustering [J].Advances in Neural Information Processing Systems,2004,16(2):2006.
[13] DEMPSTER A.Maximum likelihood from incomplete data via theEM algorithm [J].Journal of the Royal Statistical Society,1977,39(1):1-38.
[14] SPIELMAN D A.Spectral graph theory and its applications [C]∥IEEE Symposium on Foundations of Computer Science.2007:29-38.
[15] CAI X Y,DAI G Z,YANG L B.A review of spectral clustering algorithm [J].Journal of Computer Science,2008,35(7):14-18.(in Chinese) 蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述[J].计算机科学,2008,35(7):14-18.
[16] ALZATE C,SUYKENS J A K.Multiway Spectral Clusteringwith Out-of-Sample Extensions through Weighted Kernel PCA [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(2):335-347.
[17] LAUER F,SCHNOOR C.Spectral clustering of linear sub-spaces for motion segmentation [J].IEEE International Confe-rence on Computer Vision,2009,30(2):678-685.
[18] JIA H,DING S,DU M.Approximate normalized cuts without eigen-decomposition [J].Journal of Information Sciences,2016,374(5):135-150.
[19] SHI J,MALIK J.Normalised cuts and image segmenerarion[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905.
[20] DING C,LI T,JORDAN M I.Convex and semi-nonnegative matrix factorizations[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(1):45-55.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 编辑部. 新网站开通,欢迎大家订阅![J]. 计算机科学, 2018, 1(1): 1 .
[2] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[3] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[4] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[5] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[6] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[7] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[8] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[9] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[10] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .