基于Spark的压缩近邻算法

计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 406-410.

基于Spark的压缩近邻算法

张素芳¹,翟俊海²,王婷婷²,郝璞²,王聪²,赵春玲²

中国气象局气象干部培训学院河北分院河北保定071000¹
河北大学数学与信息科学学院河北省机器学习与计算智能重点实验室河北保定071002²

出版日期:2018-06-20 发布日期:2018-08-03
作者简介:张素芳(1966－),女,硕士,副教授,主要研究方向为机器学习;翟俊海(1964－),男,博士,教授,主要研究方向为机器学习,E-mail:mczjh@hbu.com(通信作者);王婷婷(1991-),女,硕士生,主要研究方向为云计算与大数据处理。
基金资助:
国家自然科学基金项目(71371063),河北省自然科学基金项目(F2017201026),河北大学自然科学研究计划项目(799207217071),河北大学大学生创新训练项目(2017071)资助

Spark Based Condensed Nearest Neighbor Algorithm

ZHANG Su-fang¹,ZHAI Jun-hai²,WANG Ting-ting²,HAO Pu²,WANG Cong²,ZHAO Chun-ling²

Hebei Branch of China Meteorological Administration Training Centre,China Meteorological Administration,Baoding,Hebei 071000,China¹
Key Lab.of Machine Learning and Computational Intelligence,College of Mathematics and Information Science, Hebei University,Baoding,Hebei 071002,China²

Online:2018-06-20 Published:2018-08-03

摘要/Abstract

摘要： K-近邻(K-Nearest Neighbors,K-NN)是一种懒惰学习算法,用K-NN对数据分类时,不需要训练分类模型。K-NN算法的优点是思想简单、易于实现;缺点是计算量大,原因是在对测试样例进行分类时,其需要计算测试样例与训练集中每一个训练样例之间的距离。压缩近邻算法(Condensed Nearest Neighbors,CNN)可以克服K-NN算法的不足。但是,在面对大数据集时,由于自身的迭代计算特性,CNN的运算效率会变得非常低。针对这一问题,提出一种名为Spark CNN的压缩近邻算法。在大数据环境下,与基于MapReduce的CNN算法相比,Spark CNN的效率大幅提高,在5个大数据集上的实验证明了这一结论。

关键词: 大数据, 迭代计算, 懒惰学习, 压缩近邻, 样例选择

Abstract: K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.

Key words: Big data, Condensed nearest neighbors, Instance selection, Iterative calculation, Lazy learning

中图分类号:

TP181

张素芳,翟俊海,王婷婷,郝璞,王聪,赵春玲. 基于Spark的压缩近邻算法[J]. 计算机科学, 2018, 45(6A): 406-410. https://doi.org/

ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong,ZHAO Chun-ling. Spark Based Condensed Nearest Neighbor Algorithm[J]. Computer Science, 2018, 45(6A): 406-410. https://doi.org/

参考文献

[1]COVER T,HART P.Nearest neighbor pattern classification [J]. IEEE Transactions on Information Theory,1967,13(1):21-27. [2]HART P.The condensed nearest neighbor rule[J].IEEE Transaction on Information Theory,1968,14(5):15-516. [3]ZHAI J H,LI T,WANG X Z.A cross-selection instance algorithm [J].Journal of Intelligent & Fuzzy Systems,2016,30 (2):717-728. [4]SONG Y S,LIANG J Y,LU J,et al.An effcient instance selection algorithm for k nearest neighbor regression[J].Neurocomputing,2017,251:26-34. [5]ONAN A.A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer[J].Expert Systems with Applications,2015,42(20):6844-6852. [6]ALVAR A G,JOSE-FRANCISCO D P,RODRíGUEZ J J,et al.Instance selection of linear complexity for big data[J].Know-ledge-Based Systems,2016,107(C):83-95. [7]HOU G,CUI R,PAN Z,et al.Tree-based compact hashing for approximate nearest neighbor search[J].Neurocomputing,2015,166(C):271-281. [8]WAN J,TANG S,ZHANG D D,et al.HDIdx:High-dimensional indexing for efficient approximate nearest neighbor search [J].Neurocomputing,2017,237:401-404. [9]文庆福,王建民,朱晗,等.面向近似近邻查询的分布式哈希学习方法[J].计算机学报,2017,40(1):192-206. [10]刘义,景宁,陈荦,等.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851. [11]MUJA M,LOWE D G.Scalable nearest neighbor algorithms for high dimensional data[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,36(11):2227-2240. [12]MAILLO J,RAM REZ S,TRIGUERO I,et al.kNN-IS:An Itera- tive Spark-based design of the k-nearest neighbors classifier for big data [J].Knowledge-Based Systems,2017,117:3-15. [13]ZHAI J H,WANG X Z,PANG X H.Voting-based instance selection from large data sets with mapreduce and random weight networks[J].Information Sciences,2016,367:1066-1077. [14]SONG G,ROCHAS J,BEZE L E,et al.K nearest neighbour joins for big data on mapreduce:a theoretical and experimentalanalysis[J].IEEE Transactions on Knowledge & Data Engineering,2016,28(9):2376-2392. [15]刘军,林文辉,方澄.Spark大数据处理-原理、算法与实例[M].北京:清华大学出版社,2016. [16]翟俊海,郝璞,王婷婷,张明阳.MapReduce并行化压缩近邻算法[J].小型微型计算机系统,2017(12):2678-2682.

相关文章 15

[1]	何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[2]	陈晶, 吴玲玲. 多源异构环境下的车联网大数据混合属性特征检测方法 Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment 计算机科学, 2022, 49(8): 108-112. https://doi.org/10.11896/jsjkx.220300273
[3]	王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[4]	孙轩, 王焕骁. 政务大数据安全防护能力建设:基于技术和管理视角的探讨 Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives 计算机科学, 2022, 49(4): 67-73. https://doi.org/10.11896/jsjkx.211000010
[5]	王俊, 王修来, 庞威, 赵鸿飞. 面向科技前瞻预测的大数据治理研究 Research on Big Data Governance for Science and Technology Forecast 计算机科学, 2021, 48(9): 36-42. https://doi.org/10.11896/jsjkx.210500207
[6]	余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳. 面向大数据分析的智能交互向导系统 Smart Interactive Guide System for Big Data Analytics 计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[7]	王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓. 基于深度学习的民事案件判决结果分类方法研究 Study on Judicial Data Classification Method Based on Natural Language Processing Technologies 计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[8]	王雪岑, 张昱, 刘迎婕, 于戈. 基于表示学习的在线学习交互质量评价方法 Evaluation of Quality of Interaction in Online Learning Based on Representation Learning 计算机科学, 2021, 48(2): 207-211. https://doi.org/10.11896/jsjkx.201000042
[9]	滕建, 滕飞, 李天瑞. 基于3D卷积和LSTM编码解码的出行需求预测 Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder 计算机科学, 2021, 48(12): 195-203. https://doi.org/10.11896/jsjkx.210400022
[10]	张育龙, 王强, 陈明康, 孙静涛. 图像去雨算法在云物联网应用中的研究综述 Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems 计算机科学, 2021, 48(12): 231-242. https://doi.org/10.11896/jsjkx.201000055
[11]	曹萌, 于洋, 梁英, 史红周. 基于区块链的大数据交易关键技术与发展趋势 Key Technologies and Development Trends of Big Data Trade Based on Blockchain 计算机科学, 2021, 48(11A): 184-190. https://doi.org/10.11896/jsjkx.210100163
[12]	刘亚臣, 黄雪莹. 卫星监测时空大数据蠕变特征提取及预警算法 Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data 计算机科学, 2021, 48(11A): 258-264. https://doi.org/10.11896/jsjkx.201000071
[13]	张光君, 张翔. 应用“大数据+区块链”优化立法评估制度的机理与路径 Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” 计算机科学, 2021, 48(10): 324-333. https://doi.org/10.11896/jsjkx.201200105
[14]	叶雅珍, 刘国华, 朱扬勇. 数据产品流通的两阶段授权模式 Two-step Authorization Pattern of Data Product Circulation 计算机科学, 2021, 48(1): 119-124. https://doi.org/10.11896/jsjkx.191100217
[15]	赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed