摘要: 大规模的稀疏图数据在现实中大量出现,例如协同图、拉普拉斯矩阵等。非负矩阵分解(NMF)已经成为数据挖掘、信息检索和信号处理的一个非常重要的工具。随着数据量的不断增大,如何实现大规模数据的偏好聚类是一个重要的问题。采用两阶段的方法来实现大规模的偏好聚类,即首先利用Nystrm的近似采样方法,从大数据上获得数据的初始轮廓,获得部分用户-用户相似矩阵或电影-电影相似矩阵,从而可以将原始的高维空间降低到一个低维子空间;然后通过对低维相似矩阵进行凸的非负矩阵分解,从而得到聚类的中心和指示器,聚类的中心表示电影或用户的特征,指示器表示用户或电影特征的权重。该两阶段偏好聚类方法的优点是,初始数据轮廓的近似获取以及凸的非负矩阵分解,使得该方法具有较好的鲁棒性和抗噪性;另外,子空间的数据来源于真实的矩阵行列数据,使得偏好聚类结果具有良好的可解释性。采用Nystrm方法解决了大规模的数据无法在内存中存储的问题,从而大大节省了内存,提高了运行效率。最后在含有100000条电影的数据集上进行偏好聚类,结果表明了该聚类算法的有效性。
[1] WANG L,BEZDEK J C,LECKIE C.Selective sampling for approximate clustering of very large data sets[J].International Journal of Intelligent Systems,2008,3(3):313-331. [2] XU R,II D C W.IEEE,Survey of clustering algorithms[J].IEEE Transactions on Neural Networks,2005,16(3):645-678. [3] SUN J G,LIU J,ZHAO L Y.Clustering algorithm research[J].Journal of Software,2008,19(1):48-61.(in Chinese) 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61. [4] DING S F,JIA H J,SHI Z Z.Spectral clustering algorithm for large data spectrum based on adaptive Nystrm sampling[J].Journal of Software,2014,25(9):2037-2049.(in Chinese) 丁世飞,贾洪杰,史忠植.基于自适应Nystrm采样的大数据谱聚类算法[J].软件学报,2014,25(9):2037-2049. [5] SUN S,ZHAO J,ZHU J.A review of Nystrm methods for large-scale machine learning[J].Information Fusion,2015,26(1):36-48. [6] WANG S,ZHANG Z.Improving CUR matrix decompositionand the Nystrm approximation via adaptive sampling [J].Journal of Machine Learning Research,2013,14(1):2729-2769. [7] CHEN X,CAI D.Large scale spectral clustering with landmark-based representation [C]∥AAAI Conferenceon Artificial Intelligence(AAAI 2011).San Francisco,California,USA,2011:55-115. [8] WANG W W,LI X P,FENG X C.Summarization of sparse subspaces [J].Journal of Acta Automatica Sinica,2015,1(8):1373-1384.(in Chinese) 王卫卫,李小平,冯象初.稀疏子空间聚类综述[J].自动化学报,2015,41(8):1373-1384. [9] WANG S,LUO L,ZHANG Z.SPSD matrix approximation vis column selection:theories,algorithms,and extensions [J].Journal of Machine Learning Research,2016,17(5):1-49. [10] WANG S,ZHANG Z,ZHANG T.Towards more efficient SPSD matrix approximation and CUR matrix decomposition [J].Journal of Machine Learning Research,2016,17(210):1-49. [11] ZHU L,LEI J S,BI Z Q.A soft subspace clustering algorithm based on data stream[J].Journal of Software,2013,24(11):2610-2627.(in Chinese) 朱林,雷景生,毕忠勤.一种基于数据流的软子空间聚类算法[J].软件学报,2013,24(11):2610-2627. [12] BACH F R,JORDAN M I.Learning spectral clustering [J].Advances in Neural Information Processing Systems,2004,16(2):2006. [13] DEMPSTER A.Maximum likelihood from incomplete data via theEM algorithm [J].Journal of the Royal Statistical Society,1977,39(1):1-38. [14] SPIELMAN D A.Spectral graph theory and its applications [C]∥IEEE Symposium on Foundations of Computer Science.2007:29-38. [15] CAI X Y,DAI G Z,YANG L B.A review of spectral clustering algorithm [J].Journal of Computer Science,2008,35(7):14-18.(in Chinese) 蔡晓妍,戴冠中,杨黎斌.谱聚类算法综述[J].计算机科学,2008,35(7):14-18. [16] ALZATE C,SUYKENS J A K.Multiway Spectral Clusteringwith Out-of-Sample Extensions through Weighted Kernel PCA [J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(2):335-347. [17] LAUER F,SCHNOOR C.Spectral clustering of linear sub-spaces for motion segmentation [J].IEEE International Confe-rence on Computer Vision,2009,30(2):678-685. [18] JIA H,DING S,DU M.Approximate normalized cuts without eigen-decomposition [J].Journal of Information Sciences,2016,374(5):135-150. [19] SHI J,MALIK J.Normalised cuts and image segmenerarion[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2000,22(8):888-905. [20] DING C,LI T,JORDAN M I.Convex and semi-nonnegative matrix factorizations[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2010,32(1):45-55. |
No related articles found! |
|