计算机科学 ›› 2013, Vol. 40 ›› Issue (7): 216-221.

• 人工智能 • 上一篇    下一篇

一种基于互信息的模糊粗糙分类特征基因快速选取方法

徐菲菲,魏莱,杜海洲,王文欢   

  1. 上海电力学院计算机与信息工程学院 上海200090;上海海事大学信息工程学院 上海201303;上海电力学院计算机与信息工程学院 上海200090;上海电力学院能源与环境工程学院 上海200090
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受国家重点基础研究发展计划(973计划)子课题(2009CB219801),上海市教育委员会科研创新项目(12YZ140),上海高校青年教师培养资助

Fast Approach to Mutual Information Based Gene Selection with Fuzzy Rough Sets

XU Fei-fei,WEI Lai,DU Hai-zhou and WANG Wen-huan   

  • Online:2018-11-16 Published:2018-11-16

摘要: 依据基因表达谱建立有效肿瘤分类模型的关键在于准确找出决定样本类别的一组特征基因。粗糙集理论已成功应用于肿瘤分类特征基因选取中。然而,粗糙集方法处理连续值的基因表达谱数据集所必需的离散化过程会使得部分信息丢失,对所选取的特征基因的分类精度造成一定影响。因此,曾提出基于互信息的模糊粗糙集基因表达谱数据集特征基因的选取算法。然而,该算法计算代价较高,当所选取的基因数较多时难以实现。为此,对 该算法进行了 改进,从最大相关性和最重要性(最小冗余)两方面对互信息进行了近似替代计算,大大降低了算法的复杂度,提高了算法的效率。以急性白血病亚型(leukemia)、直肠癌(colon)和乳腺癌(Breast)分类特征基因选取为例进行实验,然后分别采用1NN和SVM分类器进行特征基因分类精度检验,结果证实了新方法的可行性和有效性。

关键词: 特征选取,模糊粗糙集,互信息,基因表达谱数据集 中图法分类号TP18文献标识码A

Abstract: Feature selection is an essential step to perform cancer classification with DNA microarrays.Rough set theory has already been successfully applied to gene selection.To avoid losing information by discretization of continuous gene expression data in rough set theory,the theory of fuzzy rough sets is applied to gene selection.A fuzzy rough attribute reduction algorithm based on mutual information was proposed and applied to gene selection.The cost of computation of the algorithm is too high to be carried out if the number of the selected genes is large.This paper raised an approximate replacement of computation of the mutual information,from both maximum relevance and maximum significance.The novel method improves the efficiency and decreases the complexity.Extensive experiments were conducted on three public gene expression datasets.The experimental results confirm the efficiency and effectiveness of the algorithm.

Key words: Feature selection,Fuzzy rough sets,Mutual information,Gene expression data

[1] Lander E S.Array of hope[J].Nature Genetics,1999,21(Suppl):3-4
[2] Ramaswamy S,Golub T R.DNA microarrays in clinical oncology[J].Journal of Clinical Oncology,2002,20(7):1932-1941
[3] Derisi J,Penland L,Brown P O,et al.Use of a cDNA microarray to analyse gene expression patterns in human cancer[J].Nature Genetics,1996,14(4):457-460
[4] Gloub T R,Slonim D K,Tamayo P,et al.Molecular classifica-tion of cancer:Class discovery and class prediction by gene expression monitoring[J].Science,1999,286(5439):531-537
[5] Khan J,Wei J S,Ringner M,et al.Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks[J].Nature Medicine,2001,7(6):673-679
[6] Guyon I,Weston J,Barnhill S,et al.Gene selection for cancerclassification using support vector machines[J].Machine Lear-ning,2000,46(13):389-422
[7] Tibshirani R,Hastie T,Narasimhan B,et al.Diagnosis of multiple cancer types by shrunken centroids of gene expression[J].Proceedings of the National Academy of Science,2002,99(10):6567-6572
[8] Fleuret F.Fast binary feature selection with conditional mutual information[J].J.Mach.Learning Res,2004(5):1531-1555
[9] Hedenfalk I,Duggan D,Chen Y,et al.Gene-expression profiles in hereditary breast cancer[J].New England Journal oMedicine,2001,344(8):529-548
[10] Li X,Rao S,Zhang T,et al.An ensemble method for gene discovery based on DNA microarray data[J].Science in China(Series C),2004,47(5):396-405
[11] Tang E K,Suganthan P N,Yao X.Gene selection algorithms for microarray data based on least squares support vector machine[J].BMC Bioinformatics,2006(7)
[12] Cai Rui-chu,Hao Zhi-feng,Yang Xiao-wei,et al.An efficientgene selection algorithm based on mutual information[J].Nerocomputing,2009,72:991-999
[13] Kohavi R,John G H.Wrappers for feature subset selection[J].Artif.Intell.,1997,97(1/2):273-324
[14] Guyon I,Elisseeff A.An introduction to variable and feature selection[J].J.Mach.Learning Res.,2003(3):1157-1182
[15] deSouza M C R,deCarvalho F A T,Tenorio C P.Twopartitional-methods for interval-valued data using mahalanobis distances[J].Adv.Artif.Intell.Iberamia,2004,3315:454-463
[16] Chang C F,Wai K M,Patterton H G.Calculating the statistical significance of physical clusters of co-regulated genes in the genome:the role of chromatin in domain-wide gene regulation[J].Nucl.Acids Res.,2004,32(5):1798-1807
[17] Quinlan J R.Learning efficient classification procedures andtheir application to chess end games.Machine Learning:An artificial intelligence approach[M].San Francisco,CA:Morgan Kaufmann,1983:463-482
[18] Quinlan J R.C4.5:programs for machine learning[M].Morgan Kaufmann Publishers Inc.San Francisco,CA,USA,1993,9(2):132-136
[19] Langleyand P.Selection of relevant features in machine learning[C]∥Proceedings of A AAI Fall Symposium on Relevance.1994
[20] Wang Y,Tetko I V,HallMark A,et al.Gene selection from microarray data for cancer classification-a machine learning approach[J].Computation Biology and Chemistry,2005,9(1):37-46
[21] Guyon I,Weston J,Barnhill S,et al.Gene selection for cancerclassification using support vector machines[J].Machine Lear-ning,2002,6(1-3):389-422
[22] Pawlak Z.Rough sets[J].International Journal of Information and Computer Science,1982,11:341-356
[23] 李衍达,孙之荣.生物信息学基因和蛋白质分析的实用指南[M].北京:清华大学出版社,2000
[24] Li Ding-fang,Zhang Wen.Gene selection using rough set theory[C]∥Rough Sets and Knowledge Technology 2006(RSKT 2006).Lecture Notes in Artificial Intelligence,Chongqing,2006,4062:778-785
[25] Skowron A,Komorowski J,Pawlak Z,et al.Rough sets perspective on data and knowledge[M].Handbook of data mining and knowledge discovery.NewYork:Oxford University Press,2002
[26] Banerjee M,Mitra S,Banka H.Evolutinary-Rough Feature Selection in Gene Expression Data[J].IEEE Transaction on Systems,Man,and Cyberneticd,Part C:Application and Reviews,2007,7:622-632
[27] Momin B F,Mitra S,Datta G R.Reduct Generation and Classifcation of Gene Expression Data[C]∥Proceeding of First International Conference on Hybrid Information Technology (ICHICT06).2006:699-708
[28] Valdes J J,Barton A J.Gene discovery in leukemia revisited:a computational intelligence perspective[C]∥Proceedings of the 17th International Conference on Industrial & Engineering Applications of Artificial International Conference & Expert Systems.Springer Verlag,2004:118-127
[29] 苗夺谦.粗糙集理论中连续属性的离散化方法[J].自动化学报,2001,27(3):296-302
[30] 权光日,等.连续属性空间上的规则学习算法[J].软件学报,1999,10(11):1225-1232
[31] 叶东毅,黄翠微,赵斌.基于逼近精度的一个粗糙集属性约简算法[J].福州大学学报:自然科学版,2000,28(1):7-10
[32] Dubois D,Prade H.Rough fuzzy sets and fuzzy rough sets[J].International Journal of General Systems,1990,17:191-209
[33] Zadeh L A.模糊集合,语言变量及模糊逻辑[M].北京:北京科学出版社,1982
[34] Xu F F,Miao D Q,Wei L.Fuzzy-rough attribute reduction via mutual information with an application to cancer classification[J].Computers & Mathematics with Applications,2009,57(6):1010-1017
[35] Bhatt R B,Gopal M.On fuzzy-rough sets approach to feature selection[J].Pattern Recognition Letters,2005,26(7):965-975
[36] Hu Qing-hua,An Shuang,Yu Da-ren.Soft fuzzy rough sets for robust feature evaluation and selection[J].Information Sciences,2010,180(22):4384-4400
[37] Jensen R,Shen Qiang.Fuzzy-rough data reduction with ant colony optimization[J].Fuzzy Sets and Systems,2005,149(1):5-20
[38] Chen De-gang,Zhao Su-yun.Local reduction of decision system with fuzzy rough sets.Fuzzy Sets and Systems,2010,161(13):1871-1883
[39] Priness I,Maimon O,Ben-Gal I.Evaluation of gene-expression clustering via mutual information distance measure,BMC Bioinformatics,2007,8:111
[40] Chow T W S,Huang D.Estimating optimal feature subsets using efficient estimation of high-dimensional mutual information[J].IEEE Trans.Neural Networks,2005,16(1):213-224
[41] 苗夺谦,王珏.粗集理论中概念与运算的信息表示[J].软件学报,1999,2:113-116
[42] 苗夺谦,胡桂荣.知识约简的一种启发式算法[J].计算机研究与发展,1999,36(6):681-684
[43] Peng H,Long F,Ding C.Feature selection based on mutual information:criteria of Max-Dependency,Max-Relevance,and Min-Redundancy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1226-1238
[44] Maji P,Paul S.Rough set based maximum relevance-maximum significance criterion and gene selection from microarray data[J].Int.J.Approx.Reason,2011,52(3):408-426
[45] West M,Blanchette C,Dressman H,et al.Predicting the clinical status of human breast cancer by using gene expression profiles[C]∥Proceedings of the National Academy of Science.USA 98,2001(20):11462-11467

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!