计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 121-127.doi: 10.11896/jsjkx.191100141

• 数据库&大数据&数据科学 • 上一篇    下一篇

高维大数据分析的无监督异常检测方法

邹承明1,2,3, 陈德2   

  1. 1 交通物联网技术湖北省重点实验室 武汉4300702
    2 武汉理工大学计算机科学与技术学院 武汉4300703
    3 鹏城实验室 广东 深圳518000
  • 收稿日期:2019-11-19 修回日期:2020-04-02 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 陈德(chandler@whut.edu.cn)
  • 作者简介:zoucm@whut.edu.cn
  • 基金资助:
    国家重点研发计划(2018YFC0704300)

Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis

ZOU Cheng-ming1,2,3, CHEN De2   

  1. 1 Hubei Key Laboratory of Transportation Internet of Things Technology,Wuhan 430070,China
    2 School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430070,China
    3 Peng Cheng Laboratory,Shenzhen,Guangdong 518000,China
  • Received:2019-11-19 Revised:2020-04-02 Online:2021-02-15 Published:2021-02-04
  • About author:ZOU Cheng-ming,born in 1975,Ph.D,professor,is a member of China Computer Federation.His main research interests include computer vision,embedded system,software theory and method.
    CHEN De,born in 1995,postgraduate.His main research interests include deep learning,data mining and so on.
  • Supported by:
    The National Key R&D Program of China(2018YFC0704300).

摘要: 高维数据的无监督异常检测是机器学习的重要挑战之一。虽然先前基于单一深度自动编码器和密度估计的方法已经取得了显著的进展,但是其仅通过一个深度自编码器来生成低维表示,这表明没有足够的信息来执行后续的密度估计任务。为了解决上述问题,文中提出了一种混合自动编码器高斯混合模型(Mixed Auto-encoding Gaussian Mixture Model,MAGMM)。MAGMM使用混合自动编码器来代替单一深度自动编码器生成串联的低维表示,因此它可以保存来自输入样本的特定集群的关键信息。此外,其利用分配网络来约束混合自动编码器,这样每个样本都可以分配给一个占主导地位的自动编码器。利用上述机制,MAGMM避免了陷入局部最优,降低了重构误差,从而可以促进密度估计任务的完成,提高高维数据异常检测的准确性。实验结果表明,该方法优于DAGMM,并在标准F1分数上提高了29%。

关键词: 高斯混合模型, 降维, 密度估计, 数据挖掘, 无监督异常检测

Abstract: Unsupervised anomaly detection on high-dimensional data is one of the most significant challenges in machine learning.Although previous approaches based on single deep auto-encoder and density estimations have made significant progress,they generate low-dimensional representations as they use only a single deep auto-encoder,indicating that there is insufficient information to perform the subsequent density estimation task.To address the above challenge,a mixed auto-encoding gaussian mixture model (MAGMM) is proposed in this paper.MAGMM substitutes a single deep auto-encoder with a mixture of auto-encoders to generate concatenated low-dimensional representations,so that it can preserve key information from a specific cluster of the input sample.In addition,it utilizes an allocation network to constrain the mixture of auto-encoders,so that each sample can be assigned to a dominant auto-encoder.With the above mechanisms,MAGMM avoids from trapping into local optima and reduces the recons-truction errors,which can facilitate completing the density estimation tasks and improve the accuracy of high-dimensional data anomaly detection.Experimental results show that the proposed method performs better than DAGMM and achieves up to 29% improvement based on the standard F1 score.

Key words: Data mining, Density estimation, Dimensionality reduction, Gaussian mixture model, Unsupervised anomaly detection

中图分类号: 

  • TP391
[1] HUANG D,MU D,YANG L,et al.CoDetect:financial fraud detection with anomaly feature detection[J].IEEE Access,2018,6:19161-19174.
[2] VIEGAS E,SANTIN A,BESSANI A,et al.BigFlow:Real-time and reliable anomaly-based intrusion detection for high-speed networks[J].Future Generation Computer Systems,2019,93:473-485.
[3] SANEJA B,RANI R.An efficient approach for outlier detection in big sensor data of health care[J].International Journal of Communication Systems,2017,30(17):e3352.
[4] CHEN Z,HUANG Y,ZOU H.Anomaly Detection of Industrial Control System Based on Outlier Mining[J].Computer Science,2014,41(5):178-181.
[5] ZIMEK A,SCHUBERT E,KRIEGEL H P.A survey on un-supervised outlier detection in high dimensional numerical data[J].Statistical Analysis and Data Mining:The ASA Data Science Journal,2012,5(5):363-387.
[6] RADOVANOVI M,NANOPOULOS A,IVANOVI M.Reverse nearest neighbors in unsupervised distance-based outlier detection[J].IEEE Transactions on Knowledge and Data Enginee-ring,2014,27(5):1369-1382.
[7] YANG B,FU X,SIDIROPOULOS N D,et al.Towardsk-means-friendly spaces:Simultaneous deep learning and clustering[C]//Proceedings of the 34th International Conference on Machine Learning.2017:3861-3870.
[8] CAND$\tilde{\mathrm{E}}$S E J,LI X,MA Y,et al.Robust principal componentanalysis?[J].Journal of the ACM,2011,58(3):1-37.
[9] ZONG B,SONG Q,MIN M R,et al.Deep autoencoding gaussian mixture model for unsupervised anomaly detection[C]//International Conference on Learning Representations.2018:781-795.
[10] EHSAN A M,DICK A,VAN D H A.Infinite variational autoencoder for semi-supervised learning[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2017:5888-5897.
[11] ZHANG D,SUN Y,ERIKSSON B,et al.Deep unsupervisedclustering using mixture of autoencoders[J].arXiv:1712.07788,2017.
[12] CHANDOLA V,BANERJEE A,KUMAR V.Anomaly detection:A survey[J].ACM Computing Surveys (CSUR),2009,41(3):15.1-15.58.
[13] AGGARWAL C C.Outlier analysis[C]//Data mining.Springer,Cham,2015:237-263.
[14] WU J F,JIN Y D,TANG P.Survey on Monitoring Techniques for Data Abnormalities[J].Computer Science,2017,44(Z11):24-28.
[15] JOLLIFFE I.Principal component analysis[M].Berlin Heidelberg:Springer,2011.
[16] SCHÖLKOPF B,SMOLA A,Müller K R.Kernel principal component analysis[C]//International conference on artificial neural networks.Berlin,Heidelberg:Springer,1997:583-588.
[17] XIA Y,CAO X,WEN F,et al.Learning discriminative reconstructions for unsupervised outlier removal[C]//Proceedings of the IEEE International Conference on Computer Vision.2015:1511-1519.
[18] AN J,CHO S.Variational autoencoder based anomaly detection using reconstruction probability[J].Special Lecture on IE,2015,2(1):216-234.
[19] ZHAI S,CHENG Y,LU W,et al.Deep structured energy based models for anomaly detection[J].arXiv:1605.07717,2016.
[20] ZHOU C,PAFFENROTH R C.Anomaly detection with robust deep autoencoders[C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2017:665-674.
[21] DUDA R O,HART P E,STORK D G.Pattern classification[M].John Wiley & Sons,2012.
[22] BISHOP C M.Neural networks for pattern recognition[M].Oxford University Press,1995.
[23] YANG X,HUANG K,GOULERMAS J Y,et al.Joint learning of unsupervised dimensionality reduction and gaussian mixture model[J].Neural Processing Letters,2017,45(3):791-806.
[24] SCHÖLKOPF B,PLATT J C,SHAWE T J,et al.Estimating the support of a high-dimensional distribution[J].Neural computation,2001,13(7):1443-1471.
[25] TAX D M J,DUIN R P W.Support vector data description[J].Machine learning,2004,54(1):45-66.
[26] YANG X,HUANG K,ZHANG R.Unsupervised dimensionality reduction for gaussian mixture model[C]//InternationalConfe-rence on Neural Information Processing.Springer,Cham,2014:84-92.
[27] TÜSKE Z,TAHIR M A,SCHLÜTER R,et al.Integrating Gaussian mixtures into deep neural networks:Softmax layer with hidden variables[C]//2015 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP).IEEE,2015:4285-4289.
[28] HUBER P J.Robust statistics[M].Berlin,Heidelberg:Springer,2011.
[1] 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平.
基于时空注意力克里金的边坡形变数据插值方法
Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation
计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161
[2] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[3] 杨辉, 陶力宏, 朱建勇, 聂飞平.
基于锚点的快速无监督图嵌入
Fast Unsupervised Graph Embedding Based on Anchors
计算机科学, 2022, 49(4): 116-123. https://doi.org/10.11896/jsjkx.210200098
[4] 孔钰婷, 谭富祥, 赵鑫, 张正航, 白璐, 钱育蓉.
基于差分隐私的K-means算法优化研究综述
Review of K-means Algorithm Optimization Based on Differential Privacy
计算机科学, 2022, 49(2): 162-173. https://doi.org/10.11896/jsjkx.201200008
[5] 张亚迪, 孙悦, 刘锋, 朱二周.
结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究
Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index
计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[6] 马董, 李新源, 陈红梅, 肖清.
星型高影响的空间co-location模式挖掘
Mining Spatial co-location Patterns with Star High Influence
计算机科学, 2022, 49(1): 166-174. https://doi.org/10.11896/jsjkx.201000186
[7] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[8] 赵志强, 易秀双, 李婕, 王兴伟.
基于GR-AD-KNN算法的IPv6网络DoS入侵检测技术研究
Research on DoS Intrusion Detection Technology of IPv6 Network Based on GR-AD-KNN Algorithm
计算机科学, 2021, 48(6A): 524-528. https://doi.org/10.11896/jsjkx.200500001
[9] 徐慧慧, 晏华.
基于相对危险度的儿童先心病风险因素分析算法
Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children
计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082
[10] 张岩金, 白亮.
一种基于符号关系图的快速符号数据聚类算法
Fast Symbolic Data Clustering Algorithm Based on Symbolic Relation Graph
计算机科学, 2021, 48(4): 111-116. https://doi.org/10.11896/jsjkx.200800011
[11] 张寒烁, 杨冬菊.
基于关系图谱的科技数据分析算法
Technology Data Analysis Algorithm Based on Relational Graph
计算机科学, 2021, 48(3): 174-179. https://doi.org/10.11896/jsjkx.191200154
[12] 刘嘉琛, 秦小麟, 朱润泽.
基于LSTM-Attention的RFID移动对象位置预测
Prediction of RFID Mobile Object Location Based on LSTM-Attention
计算机科学, 2021, 48(3): 188-195. https://doi.org/10.11896/jsjkx.200600134
[13] 刘新斌, 王丽珍, 周丽华.
MLCPM-UC:一种基于模式实例分布均匀系数的多级co-location模式挖掘算法
MLCPM-UC:A Multi-level Co-location Pattern Mining Algorithm Based on Uniform Coefficient of Pattern Instance Distribution
计算机科学, 2021, 48(11): 208-218. https://doi.org/10.11896/jsjkx.201000097
[14] 王卫东, 徐金慧, 张志峰, 杨习贝.
基于密度峰值聚类的高斯混合模型算法
Gaussian Mixture Models Algorithm Based on Density Peaks Clustering
计算机科学, 2021, 48(10): 191-196. https://doi.org/10.11896/jsjkx.200800191
[15] 刘晓楠, 宋慧超, 王洪, 江舵, 安家乐.
Grover算法改进与应用综述
Survey on Improvement and Application of Grover Algorithm
计算机科学, 2021, 48(10): 315-323. https://doi.org/10.11896/jsjkx.201100141
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!