计算机科学 ›› 2015, Vol. 42 ›› Issue (Z6): 453-458.

• 数据挖掘 • 上一篇    下一篇

一种基于稀疏主成分的基因表达数据特征提取方法

沈宁敏,李静,周培云,庄毅   

  1. 南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016,南京航空航天大学计算机科学与技术学院 南京210016
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受中央高校基本科研业务费专项资金(NZ2013306)资助

Feature Extraction Method Based on Sparse Principal Components for Gene Expression Data

SHEN Ning-min, LI Jing, ZHOU Pei-yun and ZHUANG Yi   

  • Online:2018-11-14 Published:2018-11-14

摘要: 聚类已成为基因表达数据的一种前沿分析方法,通过基因类别的划分可以较快速地发现病变细胞,以实现对疾病的诊断。然而,高维、小样本的数据特点使得原始采集的基因表达数据具有大量的冗余与干扰信息,直接聚类会使得算法运行时间长,分析结果精度低。主成分分析是一种经典的数据降维方法,在保持方差最大的情况下,将高维数据映射到低维空间。但负载因子的非零特性使得主成分不具有强解释能力。提出基于截断幂的稀疏主成分分析方法对基因表达数据进行特征提取,并结合K-means方法对稀疏提取的特征基因数据进行聚类分析。最后,利用3个公开的基因数据集进行实验分析,验证了所提出的特征提取方法可提高基因表达数据聚类的精确性与高效性。

Abstract: Cluster analysis is a popular method for gene expression data,which can be used for finding cancer cell so that the diseases can be diagnosed accurately and rapidly through the gene class label.However,more attributes and less samples will produce a mass of redundant or disturbed information,resulting in the decline of the accuracy of the direct clustering in high dimensional data.Principal Component Analysis(PCA) is a classical method for dimension reduction which can transform high dimension data into low space under maintaining maximal variance.The shortcoming of PCA is the lack of strong interpretation for loadings that have no characteristic of sparsity.In this paper,a sparse PCA methodbased on Truncated Power was applied into the feature extraction for gene expression data,then the sparse PCA was fed into K-means process for clustering.Finally,the experimental results on Colon cancer,leukemia and lurg cancer three typical gene datasets verify that the sparse gene data can improve the efficiency and accuracy on clustering.

Key words: Gene expression data,Loadings,Truncated power,Sparse principal component analysis,Feature extraction

[1] Khobragade V P,Vinayababu A.A Classification of Microarray Gene Expression Data Using Hybrid Soft Computing Approach[J].International Journal of Computer Science Issues(IJCSI),2012,9(6)
[2] Bi X,Huang H,Matis-Mitchell S,et al.Building a classifier for identifying sentences pertaining to disease-drug relationships in tardive dyskinesia[C]∥2012 IEEE International Conference on Bioinformatics and Biomedicine(BIBM).IEEE,2012:1-4
[3] Zhou X,Liu K Y,Wong S T C.Cancer classification and prediction using logistic regression with Bayesian gene selection[J].Journal of Biomedical Informatics,2004,37(4):249-259
[4] Atallah R,Ryan J,Aeschlimann D.Incorporating Known Pathways into Gene Clustering Algorithms for Genetic Expression Data[C]∥CS 229:Machine Learning Final Projeecs,Autumn 2013.2013
[5] Abraham G,Inouye M.Fast Principal Component Analysis of Large-Scale Genome-Wide Data[J].PloS one,2014,9(4):e93766
[6] Natarajan N,Dhillon I S.Inductive matrix completion for predicting gene-disease associations[J].Bioinformatics,2014,30(12):i60-i68
[7] Hyvrinen A,Karhunen J,Oja E.Independent component analy-sis[M].John Wiley & Sons,2004
[8] Huang D S,Zheng C H.Independent component analysis-based penalized discriminant method for tumor classification using gene expression data[J].Bioinformatics,2006,22(15):1855-1862
[9] Liebermeister W.Linear modes of gene expression determinedby independent component analysis[J].Bioinformatics,2002,18(1):51-60
[10] Smith L I.A tutorial on principal components analysis[D].Cornell University,USA,2002,51:52
[11] Jolliffe I.Principal component analysis[M].John Wiley & Sons,Ltd,2005
[12] Misra J,Schmitt W,Hwang D,et al.Interactive exploration of microarray gene expression patterns in a reduced dimensional space[J].Genome research,2002,12(7):1112-1120
[13] Zou H,Hastie T,Tibshirani R.Sparse principal component ana-lysis[J].Journal of computational and graphical statistics,2006,15(2):265-286
[14] d’Aspremont A,El Ghaoui L,Jordan M I,et al.A direct formulation for sparse PCA using semidefinite programming[J].SIAM review,2007,49(3):434-448
[15] Journée M,Nesterov Y,Richtárik P,et al.Generalized power method for sparse principal component analysis[J].The Journal of Machine Learning Research,2010,11:517-553
[16] Yuan X T,Zhang T.Truncated power method for sparse eigenvalue problems[J].The Journal of Machine Learning Research,2013,14(1):899-925
[17] Saad Y.Numerical methods for large eigenvalue problems[M].Manchester:Manchester University Press,1992
[18] Mackey L W.Deflation methods for sparse pca[C]∥Advances in Neural Information Processing Systems.2009:1017-1024
[19] Cadima J,Jolliffe I T.Loading and correlations in the interpretation of principle compenents[J].Journal of Applied Statistics,1995,22(2):203-214
[20] Vines S K.Simple principal components[J].Journal of the Royal Statistical Society:Series C(Applied Statistics),2000,49(4):441-451
[21] Jolliffe I T,Trendafilov N T,Uddin M.A modified principal component technique based on the LASSO[J].Journal of Computational and Graphical Statistics,2003,12(3):531-547

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!