计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 98-102.
张士翔, 李汪根, 李童, 朱楠楠
ZHANG Shi-xiang, LI Wang-geng, LI Tong, ZHU Nan-nan
摘要: 随着互联网的高速发展,新型信息发布方式不断涌现,由此所产生的数据正以前所未有的速度“爆炸式”增长。如何处理和分析庞大的原始数据,并将之变成可用知识加以学习和利用,已成为国内外科学家和技术专家共同关注的重要课题。贝叶斯方法提供了丰富的分层模型、不确定的量化及预先的规范,因此其在大规模数据背景下的使用十分具有吸引力。限制迭代的二分K-means算法保留了近似标准二分K-means算法的聚类质量且拥有更高的计算效率,更适用于需要处理速度更快的大型数据集。针对原有核心集构建算法执行效率低的问题,对限制迭代的二分k-means算法进行改进,使其在保证聚类效果的情况下更快速地得到聚类结果并计算相关数据点权值,从而构建出核心集。实验证明,与原算法相比,改进后算法的计算效率更高,近似性能相近且在部分情况下近似效果更优。
中图分类号:
[1]BRODERICK T,BOYD N,WIBISONO A,et al.Streaming Va-riational Bayes[C]∥Advances In Neural Information Proces-sing Systems.Necada,USA:MIT Press,2013:1727-1735. [2]CAMPBELL T,STRAUB J,III J W F,et al.Streaming,Distri-buted Variational Inference for Bayesian Nonparametrics[C]∥Advances in Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:280-288. [3]AHN S,KORATTIKARA A,WELLING M.Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.[C]∥International Conference on Machine Learning.Edinburgh,Scotland:ACM,2012:1591-1598. [4]BARDENET R,DOUCET A,HOLMES C.Towards scaling up Markov chain Monte Carlo:An adaptive subsampling approach[C]∥International Conference on Machine Learning.Beijing:ACM,2014:405-413. [5]ENTEZARI R,CRAIU R V,ROSENTHAL J S.Likelihood inflating sampling algorithm[J].Canadian Journal of Statistics,2017,46:147-175. [6]RABINOVICH M,ANGELINO E,JORDAN M I.Variationalconsensus Monte Carlo[C]∥Advances in Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:1207-1215[7]HOFFMAN M D,BLEI D M,WANG C,et al.Stochastic variational inference[J].Journal of Machine Learning Research,2013,14(1):1303-1347. [8]ALQUIER P,FRIEL N,EVERITT R,et al.Noisy MonteCarlo:convergence of Markov chains with approximate transition kernels[J].Statistics and Computing,2016,26(1/2):29-47. [9]BARDENET R,DOUCET A,HOLMES C.On Markov chainMonte Carlo methods for tall data[J].Journal of Machine Learning Research,2016,18:1-43. [10]SCOTT S L,BLOCKER A W,BONASSI F V,et al.Bayes and big data:the consensus Monte Carlo algorithm[J].International Journal of Management Science and Engineering Management,2016,11(2):78-88. [11]SRIVASTAVA S,CEVHER V,DINH Q,et al.WASP:Scalable Bayes via barycenters of subset posteriors[C]∥Proceedings of the International Conference on Artificial Intelligence and Statistics.San Diego,California,USA:JMLR,2015:912-920. [12]HUGGINS J H,CAMPBELL T,BRODERICK T.Coresets for Scalable Bayesian Logistic Regression[C]∥Advances in Neural Information Processing Systems.Barcelona,Spain:MIT Press,2016:4080-4088. [13]STEINBACH M,KARYPIS G,KUMAR V,et al.A comparison of document clustering techniques[C]∥KDD Workshop on Text Mining.Boston,USA:2000:525-526[14]SAVARESI S M,BOLEY D L.On the Performance of Bisecting K-Means and PDDP[J].Intelligent Data Analysis,2004,8(4):345-362. [15]LIU G C,HUANG T T,CHEN H N.Improved Bisecting K-means Clustering Algorithm[J].Computer Application and Software,2015,32(2):261-263. [16]ZHUANG Y,MAO Y,CHEN X.A Limited-Iteration Bisecting K-Means for Fast Clustering Large Datasets[C]∥IEEE Trust Com-Big Data SE-ISPA.Tianjin,China:IEEE,2017:2257-2262. [17]HAARIO H.An Adaptive Metropolis Algorithm[J].Bernoulli,2001,7(2):223-242. [18]ROBERTS G O,TWEEDIE R L.Exponential Convergence ofLangevin Distributions and Their Discrete Approximations[J].Bernoulli,1996,2(4):341-363. |
[1] | 丛伟杰,刘红卫. 基于积极集策略的最小闭包球问题算法研究 Study on Algorithm of Minimum Enclosing Ball Problem Based on Active Set Strategy 计算机科学, 2013, 40(9): 234-236. |
[2] | 应文豪,王士同. 基于相似度差的大间隔快速学习模型 Large Margin and Fast Learning Model Based on Difference of Similarity 计算机科学, 2013, 40(8): 239-244. |
[3] | 陈丽敏,杨静,张健沛. 一种基于加速迭代的大数据集谱聚类方法 Spectral Clustering Algorithm for Large Scale Data Set Based on Accelerating Iterative Method 计算机科学, 2012, 39(5): 172-176. |
[4] | . 一种聚簇消减大规模数据的支持向量分类算法 计算机科学, 2009, 36(3): 184-188. |
[5] | 曾志强,廖备水,高济. 面向超大数据集的SVM近似训练算法 Approximate Approach to Train SVM on Very Large Data Sets 计算机科学, 2009, 36(11): 208-212. |
|