Computer Science ›› 2023, Vol. 50 ›› Issue (11A): 230200172-6.doi: 10.11896/jsjkx.230200172

• Big Data & Data Science • Previous Articles     Next Articles

Disease Diagnosis Based on Projection Correlation and Random Forest Fusion Model

HAN Yimei1, LI Dongxi2   

  1. 1 College of Mathematics,Taiyuan University of Technology,Taiyuan 030024,China
    2 College of Big Data,Taiyuan University of Technology,Taiyuan 030024,China
  • Published:2023-11-09
  • About author:HAN Yimei,born in 1998,postgra-duate.Her main research interests include data mining and machine learning.
    LI Dongxi,born in 1982,Ph.D,associate professor.His main research interests include data analysis,data mining,machine learning,biostatistics and biomathematics.
  • Supported by:
    National Natural Science Foundation of China(11571009) and Research Project Supported by Shanxi Scholarship Council of China(2022-074).

Abstract: The processing method for high-dimensional data has become one of the hot issues in the study of big data.In this paper,a two-stage random forest algorithm based on projection correlation is proposed,which integrates the projection correlation to measure the correlation of random variables with the random forest algorithm,and shows better results in prediction perfor-mance.Three kinds of gene data are used for experimental analysis.In the experiments on Leukemia and Colon datasets,the accuracy of the proposed model improves by 2.4%~6.5% compared with the existing algorithms.In the experiment on Breast data set,the accuracy rate of the proposed algorithm increases by 3.55%~9.26% compared with the traditional random forest model,and it also performs stably and well in various evaluation indexes of high-dimensional data of different scales.The application of the model in the field of disease diagnosis based on microarray data will provide more scientific and effective decision support for disease prevention,diagnosis and treatment.

Key words: Projection correlation, Random forest, High dimensional data, Feature selection, Machine learning

CLC Number: 

  • TP391
[1]FAN J Q,LV J C.Sure Independence Screening for Ultrahigh Dimensional Feature Space[J].Journal of the Royal Statistical Society.Series B(Statistical Methodology),2008,70(5):849-911.
[2]LI G R,PENG H,ZHANG J,et al.Robust Rank CorrelationBased Screening[J].The Annals of Statistics,2012,40(3):1846-1877.
[3]FAN J,FENG Y,SONG R.Nonparametric IndependenceScreening in Sparse Ultra-High Dimensional Additive Models[J].Publications of the American Statistical Association,2011,106(494):544-557.
[4]NIU Y,LI H P,LI Y H,et al.Review of feature screeningmethods for ultra-high dimensional data[J].Applied Probability Statistics,2021,37(1):69-110.
[5]HE S M,WANG X.Ultra-high-dimensional feature screeningmethod based on maximum marginal utility[J].Statistics and Decision,2021,37(15):38-43.
[6]ZHU L P,XU K,LI R Z,et al.Projection correlation between two random vectors[J].Biometrika,2017,104(4):829-843.
[7]ESCANCIANO J.A Consistent Diagnostic Test For Regression Models Using Projections[J].Econometric Theory,2006,22(6):1030-1051.
[8]DAVID S,MATTESON,RUEY S.Tsay.Independent Compo-nent Analysis via Distance Covariance[J].Journal of the American Statistical Association,2017,112(518):623-637.
[9]LI R,ZHONG W,ZHU L.Feature Screening via Distance Correlation Learning[J].Am Stat Assoc.,2012,107(499):1129-1139.
[10]LIU W J,KE Y,LIU J Y,et al.Model-free Feature Screening and FDR Control with Knockoff Features[J].Journal of the American Statistical Association,2020,117(537):428-443.
[11]ALON U,NOTTERMAN D A.Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays[J].Proceedings of the National Academy of Sciences,1999,96(12):6745-6750.
[12]GOLUB T R,SLONIM D K.Molecular Classification of Cancer:Class Discovery and Class Prediction by Gene Expression[J].Science,1999,286(5439):531-537.
[13]ANTONIADIS A,LAMBERT-LACROIX S,LEBLANC F.Effective dimension reduction methods for tumor classification using gene expression data[J].Bioinformatics,2003(5):563-570.
[14]NGUYEN D V,ROCKE D M.Tumor classification by partialleast squares using microarray gene expression data[J].Bioinformatics,2002,18(1):39-50.
[15]FUREY T S,CRISTIANINI N,DUFFY N,et al.Support vector machine classification and validation of cancer tissue samples using microarray expression data[J].Bioinformatics,2000,16(10):906-14.
[16]PENG S,XU Q,LING X B,et al.Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines[J].FEBS LETTERS,2003,555(2):358-362.
[17]YAU C,ESSERMAN L,DAN H M,et al.A multigene predictor of metastatic outcome in early stage hormone receptor-negative and triple-negative breast cancer[J].Breast Cancer Research:BCR,2010,12(5):R85.
[1] LI Ke, YANG Ling, ZHAO Yanbo, CHEN Yonglong, LUO Shouxi. EGCN-CeDML:A Distributed Machine Learning Framework for Vehicle Driving Behavior Prediction [J]. Computer Science, 2023, 50(9): 318-330.
[2] HUANG Shuxin, ZHANG Quanxin, WANG Yajie, ZHANG Yaoyuan, LI Yuanzhang. Research Progress of Backdoor Attacks in Deep Neural Networks [J]. Computer Science, 2023, 50(9): 52-61.
[3] WANG Yao, LI Yi. Termination Analysis of Single Path Loop Programs Based on Iterative Trajectory Division [J]. Computer Science, 2023, 50(9): 108-116.
[4] LIU Peigang, SUN Jie, YANG Chaozhi, LI Zongmin. Crowd Counting Based on Multi-scale Feature Aggregation in Dense Scenes [J]. Computer Science, 2023, 50(9): 235-241.
[5] LIU Xiang, ZHU Jing, ZHONG Guoqiang, GU Yongjian, CUI Liyuan. Quantum Prototype Clustering [J]. Computer Science, 2023, 50(8): 27-36.
[6] WANG Yu, WANG Zuchao, PAN Rui. Survey of DGA Domain Name Detection Based on Character Feature [J]. Computer Science, 2023, 50(8): 251-259.
[7] LI Yang, LI Zhenhua, XIN Xianlong. Attack Economics Based Fraud Detection for MVNO [J]. Computer Science, 2023, 50(8): 260-270.
[8] ZHU Boyu, CHEN Xiao, SHA Letian, XIAO Fu. Two-layer IoT Device Classification Recognition Model Based on Traffic and Text Fingerprints [J]. Computer Science, 2023, 50(8): 304-313.
[9] LU Xingyuan, CHEN Jingwei, FENG Yong, WU Wenyuan. Privacy-preserving Data Classification Protocol Based on Homomorphic Encryption [J]. Computer Science, 2023, 50(8): 321-332.
[10] LIANG Yunhui, GAN Jianwen, CHEN Yan, ZHOU Peng, DU Liang. Unsupervised Feature Selection Algorithm Based on Dual Manifold Re-ranking [J]. Computer Science, 2023, 50(7): 72-81.
[11] WANG Dongli, YANG Shan, OUYANG Wanli, LI Baopu, ZHOU Yan. Explainability of Artificial Intelligence:Development and Application [J]. Computer Science, 2023, 50(6A): 220600212-7.
[12] REN Gaoke, MO Xiuliang. Network Security Situation Assessment for GA-LightGBM Based on PRF-RFECV Feature Optimization [J]. Computer Science, 2023, 50(6A): 220400151-6.
[13] WANG Jinjin, CHENG Yinhui, NIE Xin, LIU Zheng. Fast Calculation Method of High-altitude Electromagnetic Pulse Environment Based on Machine Learning [J]. Computer Science, 2023, 50(6A): 220500046-5.
[14] HUANG Yuhang, SONG You, WANG Baohui. Improved Forest Optimization Feature Selection Algorithm for Credit Evaluation [J]. Computer Science, 2023, 50(6A): 220600241-6.
[15] WANG Xiya, ZHANG Ning, CHENG Xin. Review on Methods and Applications of Text Fine-grained Emotion Recognition [J]. Computer Science, 2023, 50(6A): 220900137-7.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!