计算机科学 ›› 2023, Vol. 50 ›› Issue (11A): 230200172-6.doi: 10.11896/jsjkx.230200172

• 大数据&数据科学 • 上一篇    下一篇

基于投影相关和随机森林融合模型的疾病诊断

韩怡梅1, 李东喜2   

  1. 1 太原理工大学数学学院 太原 030024
    2 太原理工大学大数据学院 太原 030024
  • 发布日期:2023-11-09
  • 通讯作者: 李东喜(dxli0426@126.com)
  • 作者简介:(hanyimei1998@163.com)
  • 基金资助:
    国家自然科学基金项目(11571009);山西省回国留学人员科研资助项目(2022-074)

Disease Diagnosis Based on Projection Correlation and Random Forest Fusion Model

HAN Yimei1, LI Dongxi2   

  1. 1 College of Mathematics,Taiyuan University of Technology,Taiyuan 030024,China
    2 College of Big Data,Taiyuan University of Technology,Taiyuan 030024,China
  • Published:2023-11-09
  • About author:HAN Yimei,born in 1998,postgra-duate.Her main research interests include data mining and machine learning.
    LI Dongxi,born in 1982,Ph.D,associate professor.His main research interests include data analysis,data mining,machine learning,biostatistics and biomathematics.
  • Supported by:
    National Natural Science Foundation of China(11571009) and Research Project Supported by Shanxi Scholarship Council of China(2022-074).

摘要: 针对高维数据的处理方法已成为当前研究大数据的热点问题之一。提出一种基于投影相关系数的两阶段随机森林模型(Projection Correlation-Random Forest,PC-RF),它将度量随机变量相关性的投影相关系数与随机森林算法相融合,在预测性能上表现出更优的结果。使用3种基因微阵列数据进行实证分析,在Leukemia和Colon数据集实验中,所提模型比现有算法准确率提升了2.4%~6.5%;在Breast数据集实验中,所提模型比传统随机森林模型准确率提升了3.55%~9.26%,同时在不同规模高维数据中的多种评价指标上表现稳定且优良。所提模型应用在基于微阵列数据的疾病诊断领域,将为疾病预防和诊断治疗提供更加科学有效的决策支持。

关键词: 投影相关系数, 随机森林, 高维数据, 特征选择, 机器学习

Abstract: The processing method for high-dimensional data has become one of the hot issues in the study of big data.In this paper,a two-stage random forest algorithm based on projection correlation is proposed,which integrates the projection correlation to measure the correlation of random variables with the random forest algorithm,and shows better results in prediction perfor-mance.Three kinds of gene data are used for experimental analysis.In the experiments on Leukemia and Colon datasets,the accuracy of the proposed model improves by 2.4%~6.5% compared with the existing algorithms.In the experiment on Breast data set,the accuracy rate of the proposed algorithm increases by 3.55%~9.26% compared with the traditional random forest model,and it also performs stably and well in various evaluation indexes of high-dimensional data of different scales.The application of the model in the field of disease diagnosis based on microarray data will provide more scientific and effective decision support for disease prevention,diagnosis and treatment.

Key words: Projection correlation, Random forest, High dimensional data, Feature selection, Machine learning

中图分类号: 

  • TP391
[1]FAN J Q,LV J C.Sure Independence Screening for Ultrahigh Dimensional Feature Space[J].Journal of the Royal Statistical Society.Series B(Statistical Methodology),2008,70(5):849-911.
[2]LI G R,PENG H,ZHANG J,et al.Robust Rank CorrelationBased Screening[J].The Annals of Statistics,2012,40(3):1846-1877.
[3]FAN J,FENG Y,SONG R.Nonparametric IndependenceScreening in Sparse Ultra-High Dimensional Additive Models[J].Publications of the American Statistical Association,2011,106(494):544-557.
[4]NIU Y,LI H P,LI Y H,et al.Review of feature screeningmethods for ultra-high dimensional data[J].Applied Probability Statistics,2021,37(1):69-110.
[5]HE S M,WANG X.Ultra-high-dimensional feature screeningmethod based on maximum marginal utility[J].Statistics and Decision,2021,37(15):38-43.
[6]ZHU L P,XU K,LI R Z,et al.Projection correlation between two random vectors[J].Biometrika,2017,104(4):829-843.
[7]ESCANCIANO J.A Consistent Diagnostic Test For Regression Models Using Projections[J].Econometric Theory,2006,22(6):1030-1051.
[8]DAVID S,MATTESON,RUEY S.Tsay.Independent Compo-nent Analysis via Distance Covariance[J].Journal of the American Statistical Association,2017,112(518):623-637.
[9]LI R,ZHONG W,ZHU L.Feature Screening via Distance Correlation Learning[J].Am Stat Assoc.,2012,107(499):1129-1139.
[10]LIU W J,KE Y,LIU J Y,et al.Model-free Feature Screening and FDR Control with Knockoff Features[J].Journal of the American Statistical Association,2020,117(537):428-443.
[11]ALON U,NOTTERMAN D A.Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays[J].Proceedings of the National Academy of Sciences,1999,96(12):6745-6750.
[12]GOLUB T R,SLONIM D K.Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression[J].Science,1999,286(5439):531-537.
[13]ANTONIADIS A,LAMBERT-LACROIX S,LEBLANC F.Effective dimension reduction methods for tumor classification using gene expression data[J].Bioinformatics,2003(5):563-570.
[14]NGUYEN D V,ROCKE D M.Tumor classification by partialleast squares using microarray gene expression data[J].Bioinformatics,2002,18(1):39-50.
[15]FUREY T S,CRISTIANINI N,DUFFY N,et al.Support vector machine classification and validation of cancer tissue samples using microarray expression data[J].Bioinformatics,2000,16(10):906-14.
[16]PENG S,XU Q,LING X B,et al.Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines[J].FEBS LETTERS,2003,555(2):358-362.
[17]YAU C,ESSERMAN L,DAN H M,et al.A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer[J].Breast Cancer Research:BCR,2010,12(5):R85.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!