计算机科学 ›› 2019, Vol. 46 ›› Issue (11A): 98-102.

• 智能计算 • 上一篇    下一篇

一种改进的贝叶斯逻辑回归核心集构建算法

张士翔, 李汪根, 李童, 朱楠楠   

  1. (安徽师范大学计算机与信息学院 安徽 芜湖241000)
  • 出版日期:2019-11-10 发布日期:2019-11-20
  • 通讯作者: 李汪根(1973-),男,博士,教授,主要研究方向为DNA计算、智能计算和模式识别等,E-mail:xchen@mail.ahnu.edu。
  • 作者简介:张士翔(1991-),男,硕士,主要研究方向为机器学习。
  • 基金资助:
    本文受国家自然科学基金(111023),高校领军人才引进与培育计划项目(051619)资助。

Improved CoreSets Construction Algorithm for Bayesian Logistic Regression

ZHANG Shi-xiang, LI Wang-geng, LI Tong, ZHU Nan-nan   

  1. (School of Computer and Information,Anhui Normal University,Wuhu,Anhui 241000,China)
  • Online:2019-11-10 Published:2019-11-20

摘要: 随着互联网的高速发展,新型信息发布方式不断涌现,由此所产生的数据正以前所未有的速度“爆炸式”增长。如何处理和分析庞大的原始数据,并将之变成可用知识加以学习和利用,已成为国内外科学家和技术专家共同关注的重要课题。贝叶斯方法提供了丰富的分层模型、不确定的量化及预先的规范,因此其在大规模数据背景下的使用十分具有吸引力。限制迭代的二分K-means算法保留了近似标准二分K-means算法的聚类质量且拥有更高的计算效率,更适用于需要处理速度更快的大型数据集。针对原有核心集构建算法执行效率低的问题,对限制迭代的二分k-means算法进行改进,使其在保证聚类效果的情况下更快速地得到聚类结果并计算相关数据点权值,从而构建出核心集。实验证明,与原算法相比,改进后算法的计算效率更高,近似性能相近且在部分情况下近似效果更优。

关键词: 贝叶斯逻辑回归, 大规模数据集, 核心集, 限制迭代二分k-means

Abstract: With the rapid development of the Internet,new types of information dissemination methods are emerging.It leads to an explosion of data at an unprecedented rate.How to process and analyze huge raw data and turn it into usable knowledge for learning and utilization,has become an important topic of common concern for scientists and technical experts at home and abroad.The Bayesian approach provides rich hierarchical models,uncertainty quantification and prior specification,so in large-scale data background it is very attractive.The limited-iteration bisecting K-means algorithm preserves the clustering quality of the approximate standard bisecting K-means algorithm with higher computational efficiency,and it is more suitable for large data sets requiring faster processing speeds.Aiming at the low execution efficiency problem of the original coresets construction algorithm,the limited-iteration bisecting K-means algorithm is improved,making the clustering result obtained at a faster speed and the weight of the relevant data points calculated under the condition of ensuring the clustering effect,thus constructing the coresets.Experiments show that compared with the original algorithm,the improved algorithm has higher computational efficiency,similar approximation performance and better approximation effect in some cases.

Key words: Bayesian logistic regression, Coresets, Large-scale dataset, Limited-iteration bisecting K-means

中图分类号: 

  • TP391
[1]BRODERICK T,BOYD N,WIBISONO A,et al.Streaming Va-riational Bayes[C]∥Advances In Neural Information Proces-sing Systems.Necada,USA:MIT Press,2013:1727-1735.
[2]CAMPBELL T,STRAUB J,III J W F,et al.Streaming,Distri-buted Variational Inference for Bayesian Nonparametrics[C]∥Advances in Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:280-288.
[3]AHN S,KORATTIKARA A,WELLING M.Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring.[C]∥International Conference on Machine Learning.Edinburgh,Scotland:ACM,2012:1591-1598.
[4]BARDENET R,DOUCET A,HOLMES C.Towards scaling up Markov chain Monte Carlo:An adaptive subsampling approach[C]∥International Conference on Machine Learning.Beijing:ACM,2014:405-413.
[5]ENTEZARI R,CRAIU R V,ROSENTHAL J S.Likelihood inflating sampling algorithm[J].Canadian Journal of Statistics,2017,46:147-175.
[6]RABINOVICH M,ANGELINO E,JORDAN M I.Variationalconsensus Monte Carlo[C]∥Advances in Neural Information Processing Systems.Montreal,Canada:MIT Press,2015:1207-1215[7]HOFFMAN M D,BLEI D M,WANG C,et al.Stochastic variational inference[J].Journal of Machine Learning Research,2013,14(1):1303-1347.
[8]ALQUIER P,FRIEL N,EVERITT R,et al.Noisy MonteCarlo:convergence of Markov chains with approximate transition kernels[J].Statistics and Computing,2016,26(1/2):29-47.
[9]BARDENET R,DOUCET A,HOLMES C.On Markov chainMonte Carlo methods for tall data[J].Journal of Machine Learning Research,2016,18:1-43.
[10]SCOTT S L,BLOCKER A W,BONASSI F V,et al.Bayes and big data:the consensus Monte Carlo algorithm[J].International Journal of Management Science and Engineering Management,2016,11(2):78-88.
[11]SRIVASTAVA S,CEVHER V,DINH Q,et al.WASP:Scalable Bayes via barycenters of subset posteriors[C]∥Proceedings of the International Conference on Artificial Intelligence and Statistics.San Diego,California,USA:JMLR,2015:912-920.
[12]HUGGINS J H,CAMPBELL T,BRODERICK T.Coresets for Scalable Bayesian Logistic Regression[C]∥Advances in Neural Information Processing Systems.Barcelona,Spain:MIT Press,2016:4080-4088.
[13]STEINBACH M,KARYPIS G,KUMAR V,et al.A comparison of document clustering techniques[C]∥KDD Workshop on Text Mining.Boston,USA:2000:525-526[14]SAVARESI S M,BOLEY D L.On the Performance of Bisecting K-Means and PDDP[J].Intelligent Data Analysis,2004,8(4):345-362.
[15]LIU G C,HUANG T T,CHEN H N.Improved Bisecting K-means Clustering Algorithm[J].Computer Application and Software,2015,32(2):261-263.
[16]ZHUANG Y,MAO Y,CHEN X.A Limited-Iteration Bisecting K-Means for Fast Clustering Large Datasets[C]∥IEEE Trust Com-Big Data SE-ISPA.Tianjin,China:IEEE,2017:2257-2262.
[17]HAARIO H.An Adaptive Metropolis Algorithm[J].Bernoulli,2001,7(2):223-242.
[18]ROBERTS G O,TWEEDIE R L.Exponential Convergence ofLangevin Distributions and Their Discrete Approximations[J].Bernoulli,1996,2(4):341-363.
[1] 丛伟杰,刘红卫.
基于积极集策略的最小闭包球问题算法研究
Study on Algorithm of Minimum Enclosing Ball Problem Based on Active Set Strategy
计算机科学, 2013, 40(9): 234-236.
[2] 应文豪,王士同.
基于相似度差的大间隔快速学习模型
Large Margin and Fast Learning Model Based on Difference of Similarity
计算机科学, 2013, 40(8): 239-244.
[3] 陈丽敏,杨静,张健沛.
一种基于加速迭代的大数据集谱聚类方法
Spectral Clustering Algorithm for Large Scale Data Set Based on Accelerating Iterative Method
计算机科学, 2012, 39(5): 172-176.
[4] .
一种聚簇消减大规模数据的支持向量分类算法

计算机科学, 2009, 36(3): 184-188.
[5] 曾志强,廖备水,高济.
面向超大数据集的SVM近似训练算法
Approximate Approach to Train SVM on Very Large Data Sets
计算机科学, 2009, 36(11): 208-212.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!