两种基于树结构的基因选择算法

doi:10.11896/j.issn.1002-137X.2015.07.053

摘要/Abstract

摘要： 癌症诊断是生物信息学领域的重要课题,其中从基因表达数据中选择与癌症相关的基因子集是癌症诊断的关键。随机森林是近年来很热门的算法,它能够评估分类中特征的重要性(该方法简称为PBM)。受此启发,提出了两种基于树结构的基因选择方法FBM和ABM,分别以树结构中特征出现的频率和重要性打分的平均值作为属性重要性的指标。数值实验中,使用提出的方法选取特征子集,并建立随机森林分类器,通过AUC结果评估基因选择的优劣。实验结果表明,当PBM的AUC值不低于0.900时,其在Leukemia数据集上至少需要26个基因,在Colon Cancer数据集上至少需要48个基因。而在仅选取前10 个基因时,FBM和ABM在Leukemia数据集的AUC值均达到0.989,在Colon Cancer数据集的AUC值达到0.900。此外,与其它典型的基因选择方法mRMR和ECRP等相比,提出的方法也有较高的精度,这对癌症的精确诊断和及早治疗具有重要的现实意义。

关键词: 分类,基因选择,随机森林

Abstract: Cancer diagnosis is one of the most significant topics in bioinformatics.For the microarray datasets,selecting a small subset of genes from thousands of genes (named gene selection) is helpful for accurate identification and treatment of cancerous tumors.Motivated by the instinct of random forests measuring variable importance (named ‘PBM’),we proposed two novel methods based on the tree structures for gene selection,namely FBM and ABM.They respectively make use of gene frequency and average scores yielded by a great number of decision trees,which are constructed on the microarray datasets.In computational experiments,the optimal gene subsets are determined by three methods,and random-forest classifiers are built on subsets to evaluate the performance of gene selection methods.AUC scores of PBM are greater than 0.900 when selecting 26 genes for leukemia dataset and 48 genes for colon cancer dataset,while the classifiers with FBM and ABM can achieve the AUC score of 0.989 for leukemia dataset and AUC score of 0.900 for colon cancer dataset respectively with top ten genes selected.In addition,the proposed methods have better perfor-mance than the developed methods (such as mRMR and ECRP),which play the critical roles in the accurate diagnosis and treatment of cancer.

Key words: Classification,Gene selection,Random forests

谢倩倩,李订芳,章文. 两种基于树结构的基因选择算法[J]. 计算机科学, 2015, 42(7): 250-253. https://doi.org/10.11896/j.issn.1002-137X.2015.07.053

XIE Qian-qian, LI Ding-fang and ZHANG Wen. Two Novel Tree Structure-based Methods for Gene Selection[J]. Computer Science, 2015, 42(7): 250-253. https://doi.org/10.11896/j.issn.1002-137X.2015.07.053

参考文献

[1] Xing E P,Jordan M I,Karp R M.Feature selection for high-dimensional genomic microarray data[C]∥Proceedings of the 15th International Conference on Machine Learning.2001:601-608
[2] Andrew Y N.On feature selection:learning with exponentially many irrelevant features as training examples[C]∥Proceedings of the 15th International Conference on Machine Learning.1998:404-412
[3] Bhattacharjee A,Richards W G,Staunton J,et al.Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses [J].Proceedings of the National Academy of Sciences of the United States of America,2001,98(24):13790-13795
[4] Golub T R,Slonim D K,Tamayo P,et al.Molecular classifica-tion of cancer,class discovery and class prediction by gene expression monitoring[J].Science,1999,286(5439):531-537
[5] Faivishevsky L,Goldberger J.Unsupervised feature selectionbased on non-parametric mutual information [C]∥2012 IEEE International Workshop on Machine Learning for Signal Proceeding (MLSP).IEEE,2012:1-6
[6] 冶晓隆,兰巨龙,郭通.基于PCA和禁忌搜索的网络流量特征选择算法[J].计算机科学,2014,41(1):187-191 Ye Xiao-long,Lan Ju-long,Guo Tong.Algorithm of Network Traffic Feature Selection Based on PCA and Tabu Search[J].Computer Science,2014,41(1):187-191
[7] Zhu Qiu-sha,Lin Lin,Shyu Mei-ling,el al.Feature Selection Using Correlation and Reliability Based Scoring Metric for Video Semantic Detection[C]∥IEEE Fourth International Conference on Semantic Computing.2010:462-469
[8] Ogura H,Amano H,Kondo M.Comparison of metrics for feature selection in imbalanced text classification [J].Expert Systems with Applications,2011,38(5):4978-4989
[9] Saeys Y,Inza I,Larranaga P.A review of feature selection techni-ques in bioinformatics[J].Bioinformatics,2007,23(19):2507-2517
[10] Amiri F,Yousefi M R,Lucas C,et al.Mutual information-based feature selection for intrusion detection systems [J].Journal of Network and Computer Applications,2011,34(4):1184-1199
[11] 于化龙,顾国昌,赵靖,等.基于DNA微阵列数据的癌症分类问题研究进展[J].计算机科学,2010,37(10):16-32 Yu Hua-long,Gu Guo-chang,Zhao Jing,et al.State of the Art on Cancer Classification Problems Based on DNA Microarray Data[J].Computer Science,2010,37(10):16-32
[12] Liu Jing-jing,Cai Wen-sheng,Shao Xue-guang.Cancer classification based on microarray gene expression data using a principal component accumulation method [J].Science China Chemistry,2011,54(5):802-803
[13] Breiman L.Random forests[J].Machine Learning,2001,45(1):5-32
[14] Breiman L,Friedman J H,Olshen R A,et al.Classification and Regression Trees [M].Chapman and Hall/CRC,1984
[15] Breiman L.Bagging predictors [J].Machine Learning,1996,24(2):123-140
[16] Alon U,Barkai N,Notterman D A,et al.Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays[J].Proceedings of the National Academy of Sciences of the United States of America,1999,96(12):6745-6750
[17] Ding C,Peng H.Minimum redundancy feature selection frommicroarray gene expression data [J].J Bioinform Comput Biol,2005,3(2):185-205
[18] Moon H,Ahn H,Kodell R L,et al.Ensemble methods for classification of patients for personalized medicine with high-dimensional data [J].Artif Intell Med,2007,41(3):197-207
[19] Yu L.Feature selection for genomic data analysis[M]∥Computational methods of feature selection.Chapman & Hall,2008:337-353
[20] Au W-H,Chan K C C,Wong A K C,et al.Attribute clustering for grouping,selection,and classification of gene expression data[J].IEEE/ACM Trans Computational Biology and Bioinforma-tics,2005,2(2):83-101
[21] Yang Kun,Cai Zhi-peng,Li Jian-zhong,et al.A stable gene selection in microarray data analysis[J].BMC Bioinformatics,2006,7:228

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed