PODKNN:面向大数据集的并行离群点检测算法

doi:10.11896/j.issn.1002-137X.2016.07.045

Abstract

Abstract: In order to improve the outlier detection algorithm’s efficiency of dealing with large-scale data set,a parallel outlier detection based on K-nearest neighborhood was put forward.This algorithm can find the K-nearest neighborhood and calculate the degrees of outliers by using partitioning strategy for pretreatment of data sets,and then it merges the results and selects outliers.The algorithm is designed to suit for the MapReduce programming model to implement parallelization and improve the computational efficiency of dealing with large-scale data sets.The experimental results show that the PODKNN has the advantages of high speedup and good scalability.

Key words: Data mining,Outlier detection,K-nearest neighborhood,MapReduce

GOU Jie, MA Zi-tang and ZHANG Zhe-cheng. PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J].Computer Science, 2016, 43(7): 251-254.

References

[1] Wang Zhen-jia,Li Guo-jun, Robinson Robert W,et al.UniBic:Sequential Row-based Biclustering Algorithm for Analysis of Gene Expression Data[J].Scientific Reports,2016,6
[2] Hartigan J A.Direct Clustering of a Data Matrix[J].Journal of the American Statistical Association,1972,67(337):123-129
[3] Cheng Yi-zong,Church George M.Biclustering of ExpressionData [C]∥Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology (ISMB).AAAI,2000:93-103
[4] Amir B D,Benny C,Richard K,et al.Discovering Local Structure in Gene Expression Data:the Order-Preserving Submatrix Problem [J].Journal of Computational Biology,2003,10(3/4):373-384
[5] Li C H,Sun Z H.GridOF:An Efficient Outlier Detection Algorithm for very Large Datasets[J].Journal of Computer Research and Development,2003,0(11):1586-1592(in Chinese) 李存华,孙志挥.GridOF:面向大规模数据集的高效离群点检测算法[J].计算机研究与发展,2003,40(11):1586-1592
[6] Yu Dan-tong,Gholambosein S,Zhang Ai-dong.FindOut:Finding Outliers in Very Large Datasets[J].Knowledge and Information Systems,2002,4(4):387-412
[7] Xue A R,Ju S G.Outlier Mining Based on Spatial Constraint[J].Computer Science,2007,4(6):207-210(in Chinese) 薛安荣,鞠时光.基于空间约束的离群点挖掘算法[J].计算机科学,2007,4(6):207-210
[8] Wang J C,Zhang J C,Jiang X Y.An Effective and Efficient Approach to Detect and Predict Outliers visually[J].Computer Scien-ce,2007,4(6):200-203(in Chinese) 汪加才,张金城,江效尧.一种有效的可视化孤立点发现与预测新途径[J].计算机科学,2007,4(6):200-203
[9] Liu H,Wu J J,Su J Q.Differentiation Distance-based Outliers Detection Algorithm[J].Application Research of Computers,2010,7(9):3316-3318(in Chinese) 刘欢,吴介军,苏锦旗.基于分化距离的离群点检测算法[J].计算机应用研究,2010,7(9):3316-3318
[10] Jiang F,Du J W,et al.Outlier Detection Based on Boundary and Distance[J].Acta Electronica Sinica,2010,8(3):700-705(in Chinese) 江峰,杜军威,等.基于边界和距离的离群点检测[J].电子学报,2010,8(3):700-705
[11] Hu Y,Shi J,Wang C J,et al.Outlier Detection Algorithm based on Global Nearest Neighborhood[J].Journal of Computer Applications,2011,1(10):2778-2781(in Chinese) 胡云,施珺,王崇骏,等.基于全局最近邻的离群点检测算法[J].计算机应用,2011,1(10):2778-2781
[12] Pan Y,Zhang J B.Parallel Programming on Cloud ComputingPlatforms:Challenges and Solution [J].KITCS/FTRA Journal of Convergence,2012,3(4):23-28
[13] Dean J,Ghemawat S.MapReduce:simplified data processing on large clusters [J].Communications of the ACM,2008,1(1):107-113
[14] Yi X W,Li T R,et al.Performance Testing and Analysis among Different MapReduce Runtime Systems[J].Computer Science,2015,2(5):24-27(in Chinese) 易修文,李天瑞,等.不同MapReduce运行系统的性能测试与分析[J].计算机科学,2015,2(5):24-27
[15] Subramanyam R B V,Sonam G.Map-Reduce Algorithm forMining Outliers in the Large Data Sets using Twister Programming Model[J].International Journal of Computer Science and Electronics Engineering,2015,3(1):81-86
[16] Liu X Y,Li J W,Yu H,et al.Adaptive Spectral Clustering Based on Shared Nearest Neighbors[J].Journal of Chinese Computer Systems,2011,2(9):1876-1880(in Chinese) 刘馨月,李静伟,于红,等.基于共享近邻的自适应谱聚类[J].小型微型计算机系统,2011,2(9):1876-1880
[17] Lu W,Shen Y,Chen S,et al.Efficient processing of k nearest neighbor joins using mapreduce[J].PVLDB,2012,5(10):1016-1027
[18] Zhang C,Li F,Jestes J.Efficient parallel kNN joins for large data in MapReduce[C]∥Proc.of the 15th Int’l Conf.on Extending Database Technology.ACM Press,2012:38-49

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0