计算机科学 ›› 2026, Vol. 53 ›› Issue (1): 115-127.doi: 10.11896/jsjkx.241000163

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于隔离森林集成策略的分类型属性分组离群检测

宋亦静, 张继福   

  1. 太原科技大学计算机科学与技术学院 太原 030024
  • 收稿日期:2024-10-29 修回日期:2025-02-14 发布日期:2026-01-08
  • 通讯作者: 张继福(jifuzh@sina.com)
  • 作者简介:(b202115310016@stu.tyust.edu.cn)
  • 基金资助:
    国家自然科学基金(62172293)

Attribute Grouping-based Categorical Outlier Detection Using Isolation Forest Ensemble Strategy

SONG Yijing, ZHANG Jifu   

  1. School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China
  • Received:2024-10-29 Revised:2025-02-14 Online:2026-01-08
  • About author:SONG Yijing,born in 1992,Ph.D candidate,is a member of CCF(No.I9481G).Her main research interest is data mining.
    ZHANG Jifu,born in 1963,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.05740D).His main research interests include big data analysis and parallel computing.
  • Supported by:
    National Natural Science Foundation of China(62172293).

摘要: 属性分组是高维离群检测的有效途径之一,但现有的属性组离群检测集成策略仅利用了各属性组内的局部离群信息,忽略了属性组的全局离群信息,导致属性组离群信息集成出现偏差。为此,利用属性组局部与全局离群信息,提出了一种基于隔离森林集成策略的分类型属性分组离群检测方法。该方法根据属性之间的相关性,将属性自动划分为若干属性组,获得数据对象在各属性组中的离群信息;理论分析了现有离群信息集成策略存在集成偏差,并定义了属性组集成偏差系数;利用隔离森林设计了一种离群信息集成策略,有效地刻画了属性组局部与全局离群信息,降低了属性组离群检测集成偏差,并在此基础上提出了一种分类型属性分组离群检测算法。实验结果表明,与对比方法相比,该算法的 AUC 指标、效率分别平均提高了7.83%和48.43%。

关键词: 离群检测, 属性分组, 集成偏差系数, 隔离森林集成策略, 全局离群信息路径

Abstract: Attribute grouping is one of the effective steps in high-dimensional outlier detection,but the current ensemble strategies in attribute grouping-based outlier detection only take into account the local outlier information within each attribute group,and ignore the global outlier information of all attribute groups,which can lead to a biased ensemble of attribute group outlier information.This paper proposes an attribute grouping outlier detection approach based on Isolated Forest ensemble strategy by using the local and global outlier information of attribute groups.Firstly,attributes are automatically divided into several attribute groups based on the local and global correlation among attributes,and the outlier information of data objects is obtained in each attribute group.Secondly,from the perspective of attribute grouping,the ensemble bias of the current outlier information ensemble strategy is theoretically analyzed,and the ensemble deviation coefficient are defined as the evaluation index of the outlier information ensemble strategy.Then an attribute grouping-based isolation forest ensemble strategy for categorical outlier detection is proposed,this strategy effectively depicts the local and global outlier information of attribute groups and lowers the ensemble bias of attribute group outlier detection.In the end,experimental results on the UCI validate that the ensemble strategy effectively alleviates the ensemble bias and improves the outlier detection performance.Importantly,compared with the competing methods,the algorithm bolsters the AUC index and the detection efficiency by averages of 7.83% and 48.43%.

Key words: Outlier detection, Attribute grouping, Ensemble deviation coefficient, Isolated Forest ensemble strategy, Outlier information of global paths

中图分类号: 

  • TP311
[1]ZHANG J F,LI Y H,QIN X,et al.Related-Subspace- Based Local Outlier Detection Algorithm Using MapReduce[J].Ruan Jian Xue Bao/Journal of Software,2015,26(5):1079-1095.
[2]MAXIMILIAN T,BERNHARD C G,ROMAN K.CLUSTERPurging:Efficient Outlier Detection Based on Rate-Distortion Theory[J].IEEE Transactions on Knowledge and Data Engineering,2023,35(2):1270-1282.
[3]SAJANRAJ T,JASISON P M,RAGHAVENDRA S.Opera-tional pattern forecast improvement with outlier detection in metro rail transport system[J].Multimedia Tools and Applications,2024,83(4):11229-11245.
[4]FANG J Z,WANG Z D,LIU W B,et al.A New Particle Swarm Optimization Algorithm for Outlier Detection:Industrial Data Clustering in Wire Arc Additive Manufacturing[J].IEEE Transactions on Automation Science and Engineering,2024,21(2):1244-1257.
[5]HUANG J Z,ZHAO Y,MENG B,et al.SEAOP:a statisticalensemble approach for outlier detection in quantitative proteomics data[J].Briefings in Bioinformatics,2024,25(3):bbae129.
[6]SINA D,ZEINAB T,NEGIN D.An outlier detection method based on the hidden Markov model and copula for wireless sensor networks[J].Wireless Networks,2024,30(6):4797-4810.
[7]HOSSEIN M,MOHAMMAD J,HAMID R D,et al.RODEO:Robust Outlier Detection via Exposing Adaptive Out-of-Distribution Samples[C]//Forty-first International Conference on Machine Learning.2024:21-27.
[8]ANTONELLA M,DAVID M,MANUELE B.Detecting outliers from pairwise proximities:Proximity isolation forests[J].Pattern Recognition,2023,138,109334.
[9]MAXIMILIAN T,BERNHARD C G,ROMAN K.Cluster Purging:Efficient Outlier Detection Based on Rate-Distortion Theory[J].IEEE Transactions on Knowledge and Data Engineering,2023,l.35(2):1270-1282.
[10]PANG G S,XU H Z,GAO L B,et al.Selective Value Coupling Learning for Detecting Outliers in High Dimensional Categorical Data[C]//International Conference on Information and Know-ledge Management.2023:807-816.
[11]LI J L,ZHANG J F,PANG N.Weighted Outlier Detection of High-Dimensional Categorical Data Using Feature Grouping[J].IEEE Transactions on Systems,Man,and Cybernetics:Systems,2020,50(11):4295-4308.
[12]AKANKSHA M,RAJEEV K.Combination fairness with scores in outlier detection ensembles[J].Information Sciences,2023,645:119337.
[13]AGGARWAL C C.Outlier ensembles:position paper[J].SIGKDD Explorations,2013,14(2):49-58.
[14]AGGARWAL C C,SATHE S.Theoretical foundations algo-rithms for outlier ensembles[J].SIGKDD Explorations,2015,17(1):24-47.
[15]ZIMEK A,CAMPELLO R,SANDER J.Ensembles for unsupervised outlier detection:challenges and research questions a position paper[J].SIGKDD Explorations,2013,15(1):11-22.
[16]HOU S Y,JIANG G X,WANG W J.A Label Noise Filtering Method Based on Relative Outlier Factor[J].ACTA AUTOMATICA SINICA,2024,50(1):1-15.
[17]CAI S H,HUANG R B,CHEN J F.An effificient outlier detection method for data streams based onclosed frequent patterns by considering antimonotonic constraints[J].Information Scien-ces,2021,555:125-146.
[18]JAVIER M,MARA C R,BERTRAND N.A review of recent approaches on wrapper feature selection for intrusion detection[J].Expert Systems with Applications,2022,198:116822.
[19]LIU C,PENG D Z,CHEN H M,et al.Attribute granules-based object entropy for outlier detection in nominal data[J].Engineering Applications of Artificial Intelligence,2024,133:108198.
[20]TANG J,QU M,WANG M Z.LINE:Large-scale Information Network Embedding[C]//Proceedings of the 24th International Conference on World Wide Web.2015:18-22.
[21]DINO I,RUGGERO G,ROSA M.A Semisupervised Approach to the Detection and Characterization of Outliers in Categorical Data[J].IEEE Transactions on Neural Networks and Learning Systems,2017,28(5):1017-1029.
[22]PANG G S,GAO L,CHEN L.Outlier Detection in ComplexCategorical Data by Modeling the Feature Value Couplings[C]//International Joint Conference on Artificial Intelligence.2016:1902-1908.
[23]PANG G S,GAO L,CHEN L.Homophily outlier detection innon-IID categorical data[J].Data Mining and Knowledge Discovery,2021,35(4):1163-1224.
[24]XU H Z,WANG Y J,WU Z Y,et al.Embedding-Based Complex Feature Value Coupling Learning for Detecting Outliers in Non- IID Categorical Data[C]//AAAI Conference on Artificial Intelligence.2019:5541-5548.
[25]ZHANG X Y,DOU W H,HE Q,et al.Lshiforest:A generic framework for fast tree isolation based ensemble anomaly analysis[J].IEEE International Conference on Data Engineering.2017:983-994.
[26]XIANG H L,ZHANG X Y,HU H S,et al.OptIForest:Optimal Isolation Forest for Anomaly Detection[C]//International Joint Conference on Artificial Intelligence.2023:2379-2387.
[27]AU W B,KEITH C C,ANDREW W,et al.AttributeClustering for Grouping,Selection,and Classification of Gene Expression Data[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2007,4(1):157.
[28]ZHENG L,CHAO F,PARTHALÁIN N M,et al.Featuregrouping and selection:A graph-based approach[J].Information Sciences,2021,546:1256-1272.
[29]TANG X C,DAI Y W,SUN P,et al.Interaction-based featureselection using Factorial Design[J].Neurocomputing,2018,281:47-54.
[30]AKANKSHA M,RAJEEV K.Building outlier detection ensembles by selective parameterization of heterogeneous methods[J].Pattern Recognition Letters,2021,146:126-133.
[31]LIU H Y,MA F D,HE S B,et al.Fairness-aware outlier ensemble[J].arXiv:2103.09419,2021.
[32]CHEN X J,YE Y M,XU X F,et al.A feature group weighting method for subspace clustering of high-dimensional data[J].Pattern Recognition,2012,45(1):434-446.
[33]FENG Y,ZHAO S Y,ZHANG Y Z,et al.Noise-TolerantLearning with Silhouette Coefficient for Unsupervised Person ReIdentification[C]//IEEE International Conference on Multimedia and Expo.2022:1-6.
[34]SAHAND H,MATIAS C K,ROBERT J B.Extended Isolation Forest[J].IEEE Transactions on Knowledge and Data Engineering,2021,33(4):1479-1489.
[35]LIU F,TING K,ZHOU Z H.Isolation forest[C]//IEEE International Conference on Data Mining.2008:413-422.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!