PODKNN:面向大数据集的并行离群点检测算法

doi:10.11896/j.issn.1002-137X.2016.07.045

摘要/Abstract

摘要： 针对现有离群点检测算法在运用于大规模数据集时时间效率较低的问题,提出一种基于K近邻的并行离群点检测算法PODKNN (Parallel Outlier Detection Based on K-nearest Neighborhood)。该算法利用划分策略对数据集进行预处理,在规模较小的子集中寻找K近邻并计算离群度,最后合并结果并遴选出离群点,设计算法过程使其符合MapReduce的编程模型,实现并行化,从而提高了离群点检测算法处理大规模数据的计算效率。实验结果表明,PODKNN具有较高的加速比及较好的扩展性。

关键词: 数据挖掘,离群点检测,K近邻,MapReduce

Abstract: In order to improve the outlier detection algorithm’s efficiency of dealing with large-scale data set,a parallel outlier detection based on K-nearest neighborhood was put forward.This algorithm can find the K-nearest neighborhood and calculate the degrees of outliers by using partitioning strategy for pretreatment of data sets,and then it merges the results and selects outliers.The algorithm is designed to suit for the MapReduce programming model to implement parallelization and improve the computational efficiency of dealing with large-scale data sets.The experimental results show that the PODKNN has the advantages of high speedup and good scalability.

Key words: Data mining,Outlier detection,K-nearest neighborhood,MapReduce

苟杰,马自堂,张喆程. PODKNN:面向大数据集的并行离群点检测算法[J]. 计算机科学, 2016, 43(7): 251-254. https://doi.org/10.11896/j.issn.1002-137X.2016.07.045

GOU Jie, MA Zi-tang and ZHANG Zhe-cheng. PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J]. Computer Science, 2016, 43(7): 251-254. https://doi.org/10.11896/j.issn.1002-137X.2016.07.045

参考文献

[1] Wang Zhen-jia,Li Guo-jun, Robinson Robert W,et al.UniBic:Sequential Row-based Biclustering Algorithm for Analysis of Gene Expression Data[J].Scientific Reports,2016,6
[2] Hartigan J A.Direct Clustering of a Data Matrix[J].Journal of the American Statistical Association,1972,67(337):123-129
[3] Cheng Yi-zong,Church George M.Biclustering of ExpressionData [C]∥Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB).AAAI,2000:93-103
[4] Amir B D,Benny C,Richard K,et al.Discovering Local Structure in Gene Expression Data:the Order-Preserving Submatrix Problem [J].Journal of Computational Biology,2003,10(3/4):373-384
[5] Li C H,Sun Z H.GridOF:An Efficient Outlier Detection Algorithm for very Large Datasets[J].Journal of Computer Research and Development,2003,0(11):1586-1592(in Chinese) 李存华,孙志挥.GridOF:面向大规模数据集的高效离群点检测算法[J].计算机研究与发展,2003,40(11):1586-1592
[6] Yu Dan-tong,Gholambosein S,Zhang Ai-dong.FindOut:Finding Outliers in Very Large Datasets[J].Knowledge and Information Systems,2002,4(4):387-412
[7] Xue A R,Ju S G.Outlier Mining Based on Spatial Constraint[J].Computer Science,2007,4(6):207-210(in Chinese) 薛安荣,鞠时光.基于空间约束的离群点挖掘算法[J].计算机科学,2007,4(6):207-210
[8] Wang J C,Zhang J C,Jiang X Y.An Effective and Efficient Approach to Detect and Predict Outliers visually[J].Computer Scien-ce,2007,4(6):200-203(in Chinese) 汪加才,张金城,江效尧.一种有效的可视化孤立点发现与预测新途径[J].计算机科学,2007,4(6):200-203
[9] Liu H,Wu J J,Su J Q.Differentiation Distance-based Outliers Detection Algorithm[J].Application Research of Computers,2010,7(9):3316-3318(in Chinese) 刘欢,吴介军,苏锦旗.基于分化距离的离群点检测算法[J].计算机应用研究,2010,7(9):3316-3318
[10] Jiang F,Du J W,et al.Outlier Detection Based on Boundary and Distance[J].Acta Electronica Sinica,2010,8(3):700-705(in Chinese) 江峰,杜军威,等.基于边界和距离的离群点检测[J].电子学报,2010,8(3):700-705
[11] Hu Y,Shi J,Wang C J,et al.Outlier Detection Algorithm based on Global Nearest Neighborhood[J].Journal of Computer Applications,2011,1(10):2778-2781(in Chinese) 胡云,施珺,王崇骏,等.基于全局最近邻的离群点检测算法[J].计算机应用,2011,1(10):2778-2781
[12] Pan Y,Zhang J B.Parallel Programming on Cloud ComputingPlatforms:Challenges and Solution [J].KITCS/FTRA Journal of Convergence,2012,3(4):23-28
[13] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters [J].Communications of the ACM,2008,1(1):107-113
[14] Yi X W,Li T R,et al.Performance Testing and Analysis among Different MapReduce Runtime Systems[J].Computer Science,2015,2(5):24-27(in Chinese) 易修文,李天瑞,等.不同MapReduce运行系统的性能测试与分析[J].计算机科学,2015,2(5):24-27
[15] Subramanyam R B V,Sonam G.Map-Reduce Algorithm forMining Outliers in the Large Data Sets using Twister Programming Model[J].International Journal of Computer Science and Electronics Engineering,2015,3(1):81-86
[16] Liu X Y,Li J W,Yu H,et al.Adaptive Spectral Clustering Based on Shared Nearest Neighbors[J].Journal of Chinese Computer Systems,2011,2(9):1876-1880(in Chinese) 刘馨月,李静伟,于红,等.基于共享近邻的自适应谱聚类[J].小型微型计算机系统,2011,2(9):1876-1880
[17] Lu W,Shen Y,Chen S,et al.Efficient processing of k nearest neighbor joins using mapreduce[J].PVLDB,2012,5(10):1016-1027
[18] Zhang C,Li F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proc.of the 15th Int’l Conf.on Extending Database Technology.ACM Press,2012:38-49

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed