计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230200172-6.doi: 10.11896/jsjkx.230200172
韩怡梅1, 李东喜2
HAN Yimei1, LI Dongxi2
摘要: 针对高维数据的处理方法已成为当前研究大数据的热点问题之一。提出一种基于投影相关系数的两阶段随机森林模型(Projection Correlation-Random Forest,PC-RF),它将度量随机变量相关性的投影相关系数与随机森林算法相融合,在预测性能上表现出更优的结果。使用3种基因微阵列数据进行实证分析,在Leukemia和Colon数据集实验中,所提模型比现有算法准确率提升了2.4%~6.5%;在Breast数据集实验中,所提模型比传统随机森林模型准确率提升了3.55%~9.26%,同时在不同规模高维数据中的多种评价指标上表现稳定且优良。所提模型应用在基于微阵列数据的疾病诊断领域,将为疾病预防和诊断治疗提供更加科学有效的决策支持。
中图分类号:
[1]FAN J Q,LV J C.Sure Independence Screening for Ultrahigh Dimensional Feature Space[J].Journal of the Royal Statistical Society.Series B(Statistical Methodology),2008,70(5):849-911. [2]LI G R,PENG H,ZHANG J,et al.Robust Rank CorrelationBased Screening[J].The Annals of Statistics,2012,40(3):1846-1877. [3]FAN J,FENG Y,SONG R.Nonparametric IndependenceScreening in Sparse Ultra-High Dimensional Additive Models[J].Publications of the American Statistical Association,2011,106(494):544-557. [4]NIU Y,LI H P,LI Y H,et al.Review of feature screeningmethods for ultra-high dimensional data[J].Applied Probability Statistics,2021,37(1):69-110. [5]HE S M,WANG X.Ultra-high-dimensional feature screeningmethod based on maximum marginal utility[J].Statistics and Decision,2021,37(15):38-43. [6]ZHU L P,XU K,LI R Z,et al.Projection correlation between two random vectors[J].Biometrika,2017,104(4):829-843. [7]ESCANCIANO J.A Consistent Diagnostic Test For Regression Models Using Projections[J].Econometric Theory,2006,22(6):1030-1051. [8]DAVID S,MATTESON,RUEY S.Tsay.Independent Compo-nent Analysis via Distance Covariance[J].Journal of the American Statistical Association,2017,112(518):623-637. [9]LI R,ZHONG W,ZHU L.Feature Screening via Distance Correlation Learning[J].Am Stat Assoc.,2012,107(499):1129-1139. [10]LIU W J,KE Y,LIU J Y,et al.Model-free Feature Screening and FDR Control with Knockoff Features[J].Journal of the American Statistical Association,2020,117(537):428-443. [11]ALON U,NOTTERMAN D A.Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays[J].Proceedings of the National Academy of Sciences,1999,96(12):6745-6750. [12]GOLUB T R,SLONIM D K.Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression[J].Science,1999,286(5439):531-537. [13]ANTONIADIS A,LAMBERT-LACROIX S,LEBLANC F.Effective dimension reduction methods for tumor classification using gene expression data[J].Bioinformatics,2003(5):563-570. [14]NGUYEN D V,ROCKE D M.Tumor classification by partialleast squares using microarray gene expression data[J].Bioinformatics,2002,18(1):39-50. [15]FUREY T S,CRISTIANINI N,DUFFY N,et al.Support vector machine classification and validation of cancer tissue samples using microarray expression data[J].Bioinformatics,2000,16(10):906-14. [16]PENG S,XU Q,LING X B,et al.Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines[J].FEBS LETTERS,2003,555(2):358-362. [17]YAU C,ESSERMAN L,DAN H M,et al.A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer[J].Breast Cancer Research:BCR,2010,12(5):R85. |
|