Computer Science ›› 2018, Vol. 45 ›› Issue (11A): 349-352.

• Information Security • Previous Articles     Next Articles

Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor

FENG Gui-lan1, ZHOU Wen-gang2   

  1. Modern Education Technology Center,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China1
    Institute of Flight Technology,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China2
  • Online:2019-02-26 Published:2019-02-26

Abstract: With the advent of big data era,outlier detection has attracted extensive attention.Computational resources of the traditional K-nearest neighbor outlier detection dealing with massive high dimensional data with single machine are insufficient,and the MapReduce in Hadoop cannot effectively deal with frequent iteration calculation problem.According to the above problems,this paper put forward a Spark-based parallel outlier detection algorithm of K-nearest neighbor,named SPKNN.Firstly,in the stage of map,the algorithm tries to find the local K nearest neighbors for each partition of the data in all data set.Then in the reduce stage,it determines the global K nearest neighbors according to the local K nearest neighbors of each partition.Finally,it calculates the degrees of outliers by using global K nearest neighbors and select outliers.Compared with the traditional K-nearest neighbor outlier detection,the performance of the SPKNN has an approximate linear relationship with computing resources in the premise of ensuring the detection accuracy.And compared with other outlier detection methods,it doesn’t need additional extension data,support iteration calculation and can reduce I/O costs by using memory cache.Experiment results of SPKNN show that it has high efficiency and scalability for massive data sets.

Key words: K-nearest neighbors, Outlier detection, Parallel, Spark

CLC Number: 

  • TP311
[1]陈运文,吴飞,吴庐山,等.基于异常检测的时间序列研究[J].计算机技术与发展,2015(4):166-170.
[2]HODGE V J,AUSTIN J.A survey of outlier detection metho-dologies[J].Artificial Intelligence Review,2004,22(2):85-126.
[3]邹云峰,张昕,宋世渊,等.基于局部密度的快速离群点检测算法[J].计算机应用,2017,37(10):2932-2937.
[4]KNORR,EDWIN M,NG,et al.Distance-based outliers:algorithms and applications[J].Vldb Journal,2000,8(3-4):237-253.
[5]BREUNIG M M.LOF:identifying density-based local outliers[J].ACM Sigmod Record,2000,29(2):93-104.
[6]辛丽玲.基于密度差异的离群点检测研究[D].北京:北京交通大学,2015.
[7]PAN Y,ZHANG J.Parallel Programming on Cloud Computing Platforms[J].Journal of Convergence Volume,2012,3:23-28.
[8]SUBRAMANYAM R B V,SONAM G.Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programing Model[J].International Journal of Computer Science and Electronics Engineering,2015,3(1):81-86.
[9]郭一鹏,梁吉业,赵兴旺.基于MapReduce的混合数据孤立点检测算法[J].小型微型计算机系统,2014,35(9):1961-1966.
[10]苟杰,马自堂,张喆程.PODKNN:面向大数据集的并行离群点检测算法[J].计算机科学,2016,43(7):251-254.
[11]KNORR E M,NG R T.A Unified Notion of Outliers:Properties and Computation[C]∥International Conference on Knowledge Discovery & Data Mining.1997:219-222.
[12]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2000:427-438.
[13]高彦杰.Spark 大数据处理技术、应用与性能优化[M].北京:机械工业出版社,2014.
[14]AGGARWAL C C,YU P S.Outlier Detection for High Dimensional Data[J].ACM Sigmod Record,2001,30(2):37-46.
[15]ZHANG X,GONG K,ZHAO G.Parallel K-Medoids algorithm based on MapReduce[J].Journal of Computer Applications,2013,33(4):1023-1005.
[16]ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository,Integration of Algorithms and Experimental Analysis Framework[J].Journal of Multiple-Valued Logic & Soft Computing,2011,17:255-287.
[1] WANG Zi-kai, ZHU Jian, ZHANG Bo-jun, HU Kai. Research and Implementation of Parallel Method in Blockchain and Smart Contract [J]. Computer Science, 2022, 49(9): 312-317.
[2] WEI Kai-xuan, FU Ying. Re-parameterized Multi-scale Fusion Network for Efficient Extreme Low-light Raw Denoising [J]. Computer Science, 2022, 49(8): 120-126.
[3] ZONG Di-di, XIE Yi-wu. Model Medial Axis Generation Method Based on Normal Iteration [J]. Computer Science, 2022, 49(6A): 764-770.
[4] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[5] ZHAO Geng, SONG Xin-yu, MA Ying-jie. Secure Data Link of Unmanned Aerial Vehicle Based on Chaotic Sub-carrier Modulation [J]. Computer Science, 2022, 49(3): 322-328.
[6] LIU Jiang, LIU Wen-bo, ZHANG Ju. Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam [J]. Computer Science, 2022, 49(3): 3-10.
[7] LIU Yan, XIONG De-yi. Construction Method of Parallel Corpus for Minority Language Machine Translation [J]. Computer Science, 2022, 49(1): 41-46.
[8] LIU Yi, MAO Ying-chi, CHENG Yang-kun, GAO Jian, WANG Long-bao. Locality and Consistency Based Sequential Ensemble Method for Outlier Detection [J]. Computer Science, 2022, 49(1): 146-152.
[9] LI Jia-zhen, JI Qing-ge. Dynamic Low-sampling Ambient Occlusion Real-time Ray Tracing for Molecular Rendering [J]. Computer Science, 2022, 49(1): 175-180.
[10] DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[11] YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[12] FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen. Performance Skeleton Analysis Method Towards Component-based Parallel Applications [J]. Computer Science, 2021, 48(6): 1-9.
[13] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[14] JI Yan, DAI Hua, JIANG Ying-ying, YANG Geng, Yi Xun. Parallel Multi-keyword Top-k Search Scheme over Encrypted Data in Hybrid Clouds [J]. Computer Science, 2021, 48(5): 320-327.
[15] FENG Kai, MA Xin-yu. Subnetwork Reliability of (n,k)-bubble-sort Networks [J]. Computer Science, 2021, 48(4): 43-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!