计算机科学 ›› 2018, Vol. 45 ›› Issue (11A): 349-352.

• 信息安全 • 上一篇    下一篇

基于Spark平台的并行KNN异常检测算法

冯贵兰1, 周文刚2   

  1. 中国民航飞行学院现代教育技术中心 四川 广汉6183071
    中国民航飞行学院飞行技术学院 四川 广汉6183072
  • 出版日期:2019-02-26 发布日期:2019-02-26
  • 作者简介:冯贵兰(1988-),女,硕士生,工程师,主要研究领域为云计算、信息安全,E-mail:fengguilan1016@sina.com;周文刚(1981-),男,博士生,讲师,主要研究领域为网络管理、机器学习、人工智能等。
  • 基金资助:
    本文受民航飞行数据分析研究项目(XM2852)资助。

Spark-based Parallel Outlier Detection Algorithm of K-nearest Neighbor

FENG Gui-lan1, ZHOU Wen-gang2   

  1. Modern Education Technology Center,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China1
    Institute of Flight Technology,Civil Aviation Flight University of China,Guanghan,Sichuan 618307,China2
  • Online:2019-02-26 Published:2019-02-26

摘要: 随着大数据时代的到来,异常检测受到了广泛关注。针对传统KNN异常检测算法处理速度和计算资源的瓶颈,以及Hadoop平台上的MapReduce不能友好支持迭代计算和基于内存计算等问题,提出了一种基于Spark平台的并行KNN异常检测算法。该算法首先对数据集进行分区和广播,然后用map函数计算数据集在每个分区的K近邻,使用reduce函数归并map函数的输出计算全局K近邻得到异常度,将异常度前n个对象视为异常。与传统KNN异常检测算法相比,在保证检测精度的前提下该算法的性能与计算资源呈近似线性关系;与其他并行异常检测算法相比,该算法无需额外扩展数据,支持迭代,而且通过在内存中缓存中间结果来减少I/O花销。实验结果证明,该算法可以提高KNN算法在大规模数据上的异常检测效率。

关键词: K近邻, Spark平台, 并行, 异常检测

Abstract: With the advent of big data era,outlier detection has attracted extensive attention.Computational resources of the traditional K-nearest neighbor outlier detection dealing with massive high dimensional data with single machine are insufficient,and the MapReduce in Hadoop cannot effectively deal with frequent iteration calculation problem.According to the above problems,this paper put forward a Spark-based parallel outlier detection algorithm of K-nearest neighbor,named SPKNN.Firstly,in the stage of map,the algorithm tries to find the local K nearest neighbors for each partition of the data in all data set.Then in the reduce stage,it determines the global K nearest neighbors according to the local K nearest neighbors of each partition.Finally,it calculates the degrees of outliers by using global K nearest neighbors and select outliers.Compared with the traditional K-nearest neighbor outlier detection,the performance of the SPKNN has an approximate linear relationship with computing resources in the premise of ensuring the detection accuracy.And compared with other outlier detection methods,it doesn’t need additional extension data,support iteration calculation and can reduce I/O costs by using memory cache.Experiment results of SPKNN show that it has high efficiency and scalability for massive data sets.

Key words: K-nearest neighbors, Outlier detection, Parallel, Spark

中图分类号: 

  • TP311
[1]陈运文,吴飞,吴庐山,等.基于异常检测的时间序列研究[J].计算机技术与发展,2015(4):166-170.
[2]HODGE V J,AUSTIN J.A survey of outlier detection metho-dologies[J].Artificial Intelligence Review,2004,22(2):85-126.
[3]邹云峰,张昕,宋世渊,等.基于局部密度的快速离群点检测算法[J].计算机应用,2017,37(10):2932-2937.
[4]KNORR,EDWIN M,NG,et al.Distance-based outliers:algorithms and applications[J].Vldb Journal,2000,8(3-4):237-253.
[5]BREUNIG M M.LOF:identifying density-based local outliers[J].ACM Sigmod Record,2000,29(2):93-104.
[6]辛丽玲.基于密度差异的离群点检测研究[D].北京:北京交通大学,2015.
[7]PAN Y,ZHANG J.Parallel Programming on Cloud Computing Platforms[J].Journal of Convergence Volume,2012,3:23-28.
[8]SUBRAMANYAM R B V,SONAM G.Map-Reduce Algorithm for Mining Outliers in the Large Data Sets using Twister Programing Model[J].International Journal of Computer Science and Electronics Engineering,2015,3(1):81-86.
[9]郭一鹏,梁吉业,赵兴旺.基于MapReduce的混合数据孤立点检测算法[J].小型微型计算机系统,2014,35(9):1961-1966.
[10]苟杰,马自堂,张喆程.PODKNN:面向大数据集的并行离群点检测算法[J].计算机科学,2016,43(7):251-254.
[11]KNORR E M,NG R T.A Unified Notion of Outliers:Properties and Computation[C]∥International Conference on Knowledge Discovery & Data Mining.1997:219-222.
[12]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]∥ACM SIGMOD International Conference on Management of Data.ACM,2000:427-438.
[13]高彦杰.Spark 大数据处理技术、应用与性能优化[M].北京:机械工业出版社,2014.
[14]AGGARWAL C C,YU P S.Outlier Detection for High Dimensional Data[J].ACM Sigmod Record,2001,30(2):37-46.
[15]ZHANG X,GONG K,ZHAO G.Parallel K-Medoids algorithm based on MapReduce[J].Journal of Computer Applications,2013,33(4):1023-1005.
[16]ALCALÁ-FDEZ J,FERNÁNDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository,Integration of Algorithms and Experimental Analysis Framework[J].Journal of Multiple-Valued Logic & Soft Computing,2011,17:255-287.
[1] 王子凯, 朱健, 张伯钧, 胡凯.
区块链与智能合约并行方法研究与实现
Research and Implementation of Parallel Method in Blockchain and Smart Contract
计算机科学, 2022, 49(9): 312-317. https://doi.org/10.11896/jsjkx.210800102
[2] 徐天慧, 郭强, 张彩明.
基于全变分比分隔距离的时序数据异常检测
Time Series Data Anomaly Detection Based on Total Variation Ratio Separation Distance
计算机科学, 2022, 49(9): 101-110. https://doi.org/10.11896/jsjkx.210600174
[3] 李其烨, 邢红杰.
基于最大相关熵的KPCA异常检测方法
KPCA Based Novelty Detection Method Using Maximum Correntropy Criterion
计算机科学, 2022, 49(8): 267-272. https://doi.org/10.11896/jsjkx.210700175
[4] 王馨彤, 王璇, 孙知信.
基于多尺度记忆残差网络的网络流量异常检测模型
Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network
计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[5] 魏恺轩, 付莹.
基于重参数化多尺度融合网络的高效极暗光原始图像降噪
Re-parameterized Multi-scale Fusion Network for Efficient Extreme Low-light Raw Denoising
计算机科学, 2022, 49(8): 120-126. https://doi.org/10.11896/jsjkx.220200179
[6] 杜航原, 李铎, 王文剑.
一种面向电商网络的异常用户检测方法
Method for Abnormal Users Detection Oriented to E-commerce Network
计算机科学, 2022, 49(7): 170-178. https://doi.org/10.11896/jsjkx.210600092
[7] 宗迪迪, 谢益武.
基于法线迭代的模型中轴生成方法
Model Medial Axis Generation Method Based on Normal Iteration
计算机科学, 2022, 49(6A): 764-770. https://doi.org/10.11896/jsjkx.210400050
[8] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[9] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[10] 刘江, 刘文博, 张矩.
OpenFoam中多面体网格生成的MPI+OpenMP混合并行方法
Hybrid MPI+OpenMP Parallel Method on Polyhedral Grid Generation in OpenFoam
计算机科学, 2022, 49(3): 3-10. https://doi.org/10.11896/jsjkx.210700060
[11] 冷佳旭, 谭明圮, 胡波, 高新波.
基于隐式视角转换的视频异常检测
Video Anomaly Detection Based on Implicit View Transformation
计算机科学, 2022, 49(2): 142-148. https://doi.org/10.11896/jsjkx.210900266
[12] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[13] 李家振, 纪庆革.
动态低采样环境光遮蔽的实时光线追踪分子渲染
Dynamic Low-sampling Ambient Occlusion Real-time Ray Tracing for Molecular Rendering
计算机科学, 2022, 49(1): 175-180. https://doi.org/10.11896/jsjkx.210200042
[14] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[15] 郭奕杉, 刘漫丹.
基于时空轨迹数据的异常检测
Anomaly Detection Based on Spatial-temporal Trajectory Data
计算机科学, 2021, 48(6A): 213-219. https://doi.org/10.11896/jsjkx.201100193
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!