基于Spark的压缩近邻算法

Abstract

Abstract: K-nearest neighbors (K-NN) is a lazy learning algorithm.It is unnecessary to train classification models,when one uses K-NN for data classification.K-NN algorithm is simple and easy to implement.The disadvantages of K-NN is that it requires large number of computations,which is introduced by calculating distances between testing instance and every training instance.Condensed nearest neighbors (CNN) can overcome the drawback of K-NN mentioned above.However,CNN is an iterative algorithm,when it is applied in big data scenario,its efficiency becomes very low.In order to deal with this problem,this paper proposed an algorithm named Spark CNN.In big data circumstances,Spark CNN can significantly improve the efficiency of CNN.This paper experimentally compared the Spark CNN with MapReduce CNN on 5 big data sets,the experimental results show that the Spark CNN is very effective.

Key words: Big data, Condensed nearest neighbors, Instance selection, Iterative calculation, Lazy learning

CLC Number:

TP181

ZHANG Su-fang,ZHAI Jun-hai,WANG Ting-ting,HAO Pu,WANG Cong,ZHAO Chun-ling. Spark Based Condensed Nearest Neighbor Algorithm[J].Computer Science, 2018, 45(6A): 406-410.

References

[1]COVER T,HART P.Nearest neighbor pattern classification [J]. IEEE Transactions on Information Theory,1967,13(1):21-27. [2]HART P.The condensed nearest neighbor rule[J].IEEE Transaction on Information Theory,1968,14(5):15-516. [3]ZHAI J H,LI T,WANG X Z.A cross-selection instance algorithm [J].Journal of Intelligent & Fuzzy Systems,2016,30 (2):717-728. [4]SONG Y S,LIANG J Y,LU J,et al.An effcient instance selection algorithm for k nearest neighbor regression[J].Neurocomputing,2017,251:26-34. [5]ONAN A.A fuzzy-rough nearest neighbor classifier combined with consistency-based subset evaluation and instance selection for automated diagnosis of breast cancer[J].Expert Systems with Applications,2015,42(20):6844-6852. [6]ALVAR A G,JOSE-FRANCISCO D P,RODRíGUEZ J J,et al.Instance selection of linear complexity for big data[J].Know-ledge-Based Systems,2016,107(C):83-95. [7]HOU G,CUI R,PAN Z,et al.Tree-based compact hashing for approximate nearest neighbor search[J].Neurocomputing,2015,166(C):271-281. [8]WAN J,TANG S,ZHANG D D,et al.HDIdx:High-dimensional indexing for efficient approximate nearest neighbor search [J].Neurocomputing,2017,237:401-404. [9]文庆福,王建民,朱晗,等.面向近似近邻查询的分布式哈希学习方法[J].计算机学报,2017,40(1):192-206. [10]刘义,景宁,陈荦,等.MapReduce框架下基于R-树的k-近邻连接算法[J].软件学报,2013,24(8):1836-1851. [11]MUJA M,LOWE D G.Scalable nearest neighbor algorithms for high dimensional data[J].IEEE Transactions on Pattern Analysis & Machine Intelligence,2014,36(11):2227-2240. [12]MAILLO J,RAM REZ S,TRIGUERO I,et al.kNN-IS:An Itera- tive Spark-based design of the k-nearest neighbors classifier for big data [J].Knowledge-Based Systems,2017,117:3-15. [13]ZHAI J H,WANG X Z,PANG X H.Voting-based instance selection from large data sets with mapreduce and random weight networks[J].Information Sciences,2016,367:1066-1077. [14]SONG G,ROCHAS J,BEZE L E,et al.K nearest neighbour joins for big data on mapreduce:a theoretical and experimentalanalysis[J].IEEE Transactions on Knowledge & Data Engineering,2016,28(9):2376-2392. [15]刘军,林文辉,方澄.Spark大数据处理-原理、算法与实例[M].北京:清华大学出版社,2016. [16]翟俊海,郝璞,王婷婷,张明阳.MapReduce并行化压缩近邻算法[J].小型微型计算机系统,2017(12):2678-2682.

Related Articles 15

[1]	HE Qiang, YIN Zhen-yu, HUANG Min, WANG Xing-wei, WANG Yuan-tian, CUI Shuo, ZHAO Yong. Survey of Influence Analysis of Evolutionary Network Based on Big Data [J]. Computer Science, 2022, 49(8): 1-11.
[2]	CHEN Jing, WU Ling-ling. Mixed Attribute Feature Detection Method of Internet of Vehicles Big Datain Multi-source Heterogeneous Environment [J]. Computer Science, 2022, 49(8): 108-112.
[3]	WANG Mei-shan, YAO Lan, GAO Fu-xiang, XU Jun-can. Study on Differential Privacy Protection for Medical Set-Valued Data [J]. Computer Science, 2022, 49(4): 362-368.
[4]	SUN Xuan, WANG Huan-xiao. Capability Building for Government Big Data Safety Protection:Discussions from Technologicaland Management Perspectives [J]. Computer Science, 2022, 49(4): 67-73.
[5]	WANG Jun, WANG Xiu-lai, PANG Wei, ZHAO Hong-fei. Research on Big Data Governance for Science and Technology Forecast [J]. Computer Science, 2021, 48(9): 36-42.
[6]	YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[7]	WANG Li-mei, ZHU Xu-guang, WANG De-jia, ZHANG Yong, XING Chun-xiao. Study on Judicial Data Classification Method Based on Natural Language Processing Technologies [J]. Computer Science, 2021, 48(8): 80-85.
[8]	WANG Xue-cen, ZHANG Yu, LIU Ying-jie, YU Ge. Evaluation of Quality of Interaction in Online Learning Based on Representation Learning [J]. Computer Science, 2021, 48(2): 207-211.
[9]	TENG Jian, TENG Fei, LI Tian-rui. Travel Demand Forecasting Based on 3D Convolution and LSTM Encoder-Decoder [J]. Computer Science, 2021, 48(12): 195-203.
[10]	ZHANG Yu-long, WANG Qiang, CHEN Ming-kang, SUN Jing-tao. Survey of Intelligent Rain Removal Algorithms for Cloud-IoT Systems [J]. Computer Science, 2021, 48(12): 231-242.
[11]	LIU Ya-chen, HUANG Xue-ying. Research on Creep Feature Extraction and Early Warning Algorithm Based on Satellite MonitoringSpatial-Temporal Big Data [J]. Computer Science, 2021, 48(11A): 258-264.
[12]	ZHANG Guang-jun, ZHANG Xiang. Mechanism and Path of Optimizing Institution of Legislative Evaluation by Applying “Big Data+Blockchain” [J]. Computer Science, 2021, 48(10): 324-333.
[13]	YE Ya-zhen, LIU Guo-hua, ZHU Yang-yong. Two-step Authorization Pattern of Data Product Circulation [J]. Computer Science, 2021, 48(1): 119-124.
[14]	ZHAO Hui-qun, WU Kai-feng. Big Data Valuation Algorithm [J]. Computer Science, 2020, 47(9): 110-116.
[15]	MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Spark Based Condensed Nearest Neighbor Algorithm

PDF (PC)