计算机科学 ›› 2019, Vol. 46 ›› Issue (7): 300-307.doi: 10.11896/j.issn.1002-137X.2019.07.046

• 交叉与前沿 • 上一篇    下一篇

相关性和相似度联合的癌症分类预测

张学扶,曾攀,金敏   

  1. (湖南大学信息科学与工程学院 长沙410006)
  • 收稿日期:2018-06-15 出版日期:2019-07-15 发布日期:2019-07-15
  • 作者简介:张学扶(1992-),男,硕士生,主要研究方向为数据分析,E-mail:xuefu_zhang@163.com;曾 攀(1990-),男,硕士生,主要研究方向为数据分析;金 敏(1973-),女,博士,教授,主要研究方向为数据分析,E-mail:jinmin@hnu.edu.cn(通信作者)。
  • 基金资助:
    国家自然科学基金项目(61773157)资助

Cancer Classification Prediction Model Based on Correlation and Similarity

ZHANG Xue-fu,ZENG Pan,JIN Min   

  1. (College of Computer Science and Electronic Engineering,Hunan University,Changsha 410006,China)
  • Received:2018-06-15 Online:2019-07-15 Published:2019-07-15

摘要: 基于经验型组织病理学的癌症诊断往往误诊率很高。从基因层次对癌症进行分析和研究是现阶段提高癌症分类预测精度的重要途径之一。生物学研究表明,同种癌症的关联基因有着共同的功能特点。基于此,文中提出相关性和相似度联合的癌症分类预测集成方法。首先,一方面,从统计学角度分析基因的差异化表达,利用互信息方法对基因表达谱数据进行相关性计算;另一方面,从生物机理上进行基因间的相似性分析,结合拓扑相似性和语义相似性分别对蛋白质互作网络和GO数据进行基因间的功能相似度计算。以上两者结合,即通过同时最大化目标集合的相关性和相似度筛选出特征基因集。然后,通过Bootstrap方法对数据集进行多样性采样,在前面所选特征基因集的基础上利用多种机器学习算法训练得到多个差异化较大的分类预测模型。最后,利用得到的多模型对测试样本进行分类预测,通过决策模型得到最终的分类结果。对GEO中4种不同癌症数据集进行分类预测研究,并将所提方法与最近的研究方法进行综合对比,结果所提方法在各数据集上的分类预测精度均提高5%左右,相比IG/SGA方法最高能达到10%的精度提升。实验结果表明,相关性和相似度联合的方法有效提高了癌症的分类预测精度,选择得到的特征基因有利于揭示生物学意义,且将多种算法优势互补,可解决单个分类算法适用范围受限的问题。

关键词: 癌症分类, 多算法多模型, 多样性采样, 拓扑相似性, 相关性, 语义相似性

Abstract: Cancer diagnosis based on empirical histopathology often has a high rate of misdiagnosis.Analyzing and studying cancer from the gene level is one of the important ways to improve the accuracy of cancer classification prediction at this stage.Biological studies have shown that the related genes of the same kind of cancer share common functional characteristics.Based on this,this paper proposes an integrated method of correlation and similarity for cancer classification prediction:First,on the one hand,statistical analysis of differential expression of genes The use of mutual information methods to perform correlation calculations on gene expression profiles.On the other hand,the similarity analysis between genes was performed on the basis of biological mechanisms,and the protein interaction network and GO data were genetically performed based on topological similarity and semantic similarity,respectively.The functional similarity calculation between the two,the combination of the two,that is,the feature set is selected by simultaneously maximizing the relevance and similarity of the target set;then the diversity of the data set is sampled by Bootstrap method,and the selected feature set in the front Based on the above,we use multiple different machine learning algorithms to train a number of differently differentiated prediction models.Finally,the multiple models are used to classify the test samples and obtain the final classification results through the decision model.The classification prediction of four differentcancerdatasets in GEO was compared with the latest research methods,and the classification accuracy on each dataset was improved by about 5%,which is up to 10% higher than that of IG/SGA methods.Increased accuracy.The experimental results show that the method of combining relevance and similarity can effectively improve the accuracy of cancer classification prediction.Selecting the obtained characteristic genes is beneficial for revealing biological significance,and the advantages of multiple algorithms can be complemented to solve the problem that the application scope of a single classification algorithm is limited.problem.

Key words: Cancer classification, Correlation, Diversity sampling, Multiple algorithms and multiple models, Semantic similarity, Topological similarity

中图分类号: 

  • TP391.9
[1]SONG N F.Design and Analysis of Ensemble Classifier for Gene Expression Data of Cancer[J].Wireless Internet Technology,2016(7):71-72.(in Chinese)<br /> 宋年丰.癌症基因表达数据的集成分类器设计与分析[J].无线互联科技,2016(7):71-72.<br /> [2]CHEN J,ZHANG M,SHAO X G.Gene selection and cancer classification based on Monte Carlo and non-negative matrix factorization:CN 104462817 B[P].2017.(in Chinese)<br /> 陈晶,张苗,邵学广.基于蒙特卡洛和非负矩阵因子分解的基因选择和癌症分类方法:CN 104462817 B[P].2017.<br /> [3]NGUYEN T,KHOSRAVI A,CREIGHTON D,et al.Hidden Markov models for cancer classification using gene profiles[J].Information Sciences,2015,316(C):293-307.<br /> [4]LI Y,LI J.Disease gene identification by random walk on multigraphs mergingheterogeneous genomic and phenotype data[J].Bmc Genomics,2012,13(7):1-12.<br /> [5]LIU B,JIN M,PAN Z.Prioritization of candidate disease genes by combining topological similarity and semantic similarity[J].Journal of Biomedical Informatics,2015,57(C):1-5.<br /> [6]LIU G,WONG L,CHUA H N.Complex discovery from weighted PPI networks[J].Bioinformatics,2009,25(15):1891.<br /> [7]WANG H,JING X,NIU B.A discrete bacterial algorithm for feature selection in classification of microarray gene cancer data[J].Knowledge-Based Systems,2017,126(C):8-19.<br /> [8]GEORGE V S,RAJ C.Review On Feature Selection Techniques And The Impact Of Svm For Cancer Classification Using Gene Expression Profile[J].International Journal of Computer Scien-ce & Engineering Survey,2011,2(3):16-27.<br /> [9]BOUAZZA S H,HAMDI N,ZEROUAL A,et al.Gene--based cancer classification through feature selection with KNN and SVM classifiers[C]∥Intelligent Systems and Computer Vision.IEEE,2015:1-6.<br /> [10]NIKUMBH S,GHOSH S,JAYARAMAN V K.Biogeography-based informative gene selection and cancer classification using SVM and Random Forests[C]∥Evolutionary Computation.IEEE,2012:1-6.<br /> [11]LI J,ZHAO Z,LIU Y,et al.A Comparative Study on Machine Classification Model in Lung Cancer Cases Analysis[C]∥International Conference on Frontier Computing.Singapore:Sprin-ger,2016:343-357.<br /> [12]NAGARAJAN R,UPRETI M.An ensemble predictive mode- ling framework for breast cancer classification[J].Methods,2017,131.<br /> [13]ZHOU M,JIN M.Holographic Ensemble Forecasting Method for Short-Term Power Load[J].IEEE Transactions on Smart Grid,2017,PP(99):1-1.<br /> [14]GOH K I,CUSICK M E,VALLE D,et al.The human disease network[J].Proceedings of the National Academy of Sciences of the United States of America,2007,104(21):8685-8690.<br /> [15]ALZUBAIDI A,COSMA G,BROWN D,et al.Breast Cancer Diag- nosis Using a Hybrid Genetic Algorithm for Feature Selection Based on Mutual Information[C]∥International Conference on Interactive Technologies and Games.IEEE,2016.<br /> [16]REAL R,VARGAS J M.The Probabilistic Basis of Jaccard’s Index of Similarity[J].Systematic Biology,1996,45(3):380-385.<br /> [17]KOMM D,KR LOVICˇ R,M MKE T.On the Advice Complexity of the Set Cover Problem[C]∥International Computer Science Symposium in Russia.Berlin:Springer,2012:241-252.<br /> [18]WANG X,GULBAHCE N,YU H.Network-based methods for human disease gene prediction[J].Briefings in Functional Genomics,2011,10(5):280-293.<br /> [19]WU X,PANG E,LIN K,et al.Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products:Insights from an Edge- and IC-Based Hybrid Method[J].Plos One,2013,8(5):e66745.<br /> [20]SZKLARCZYK D,FRANCESCHINI A,WYDER S,et al. STRING v10:protein-protein interaction networks,integrated over the tree of life[J].Nucleic Acids Research,2015,43:D447.<br /> [21]VANITHA C D A,DEVARAJ D,VENKATESULU M.Multiclass cancer diagnosis in microarray gene profile using mutual information and Support Vector Machine[J].Intelligent Data Analysis,2016,20(6):1425-1439.<br /> [22]DING C,PENG H.Minimum Redundancy Feature Selection from Microarray Gene Expression Data[J].Journal of Bioinformatics & Computational Biology,2005,3(2):185-205.<br /> [23]JOHNSON R W.An introduction to the bootstrap[J].Teaching Statistics,2001,23(2):49-54.<br /> [24]BARRETT T,SUZEK T O,TROUP D B,et al.NCBI GEO:mining millions of profiles—database and tools[J].Nucleic Acids Research,2005,33(Database Issue):D562.<br /> [25]TIMALSINA P,CHARLES K,MONDAL A M.STRING PPI Score to Characterize Protein Subnetwork Biomarkers for Human Diseases and Pathways[C]∥IEEE International Confe-rence on Bioinformatics and Bioengineering.IEEE,2014:251-256.<br /> [26]SALEM H,ATTIYA G,EL-FISHAWY N.Classification of human cancer diseases by gene profiles[J].Applied Soft Computing,2017,50:124-134.<br /> [27]CHEN K H,WANG K J,WANG K M,et al.Applying particle swarm optimization-based decision tree classifier forcancer classification on gene data[J].Applied Soft Computing,2014,24(C):773-780.
[1] 陈莹, 郝应光, 王洪玉, 王坤.
基于局部梯度强度图的动态规划检测前跟踪算法
Dynamic Programming Track-Before-Detect Algorithm Based on Local Gradient and Intensity Map
计算机科学, 2022, 49(8): 150-156. https://doi.org/10.11896/jsjkx.210700135
[2] 杨啸, 王翔坤, 胡浩, 朱敏.
面向设备状态监测的可视化技术综述
Survey on Visualization Technology for Equipment Condition Monitoring
计算机科学, 2022, 49(7): 89-99. https://doi.org/10.11896/jsjkx.210900167
[3] 赵耿, 王超, 马英杰.
基于混沌序列相关性的峰均比抑制研究
Study on PAPR Reduction Based on Correlation of Chaotic Sequences
计算机科学, 2022, 49(5): 250-255. https://doi.org/10.11896/jsjkx.210400292
[4] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[5] 罗月童, 汪涛, 杨梦男, 张延孔.
基于历史行车轨迹集的车辆行为可视分析方法
Historical Driving Track Set Based Visual Vehicle Behavior Analytic Method
计算机科学, 2021, 48(9): 86-94. https://doi.org/10.11896/jsjkx.200900040
[6] 冯霞, 胡志毅, 刘才华.
跨模态检索研究进展综述
Survey of Research Progress on Cross-modal Retrieval
计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165
[7] 陈钱, 周杰, 邵根富.
角度域任意功率谱MIMO信道特征计算
MIMO Channels with Arbitrary AoA Power Spectrum for Various Wireless Environments
计算机科学, 2020, 47(6): 271-275. https://doi.org/10.11896/jsjkx.190500022
[8] 莫彩网, 常侃, 李恒鑫, 李明鸿, 覃团发.
基于通道间相关性和非局部自相似性的彩色图像超分辨率算法
Color Image Super-resolution Algorithm Based on Inter-channel Correlation and Nonlocal Self-similarity
计算机科学, 2020, 47(6): 138-143. https://doi.org/10.11896/jsjkx.190500047
[9] 周先春, 徐燕.
基于结构相关性的自适应图像修复
Adaptive Image Inpainting Based on Structural Correlation
计算机科学, 2020, 47(4): 131-135. https://doi.org/10.11896/jsjkx.190300149
[10] 刘晓玲,刘柏嵩,王洋洋,唐浩.
基于深度学习的多标签生成研究进展
Research and Development of Multi-label Generation Based on Deep Learning
计算机科学, 2020, 47(3): 192-199. https://doi.org/10.11896/jsjkx.190300137
[11] 王瑞杰, 李军怀, 王侃, 王怀军, 商珣超, 徒鹏佳.
基于改进特征子集区分度的行为识别特征选择方法
Feature Selection Method for Behavior Recognition Based on Improved Feature Subset Discrimination
计算机科学, 2020, 47(11A): 204-208. https://doi.org/10.11896/jsjkx.200100030
[12] 张蕾,蔡明.
基于主题融合和关联规则挖掘的图像标注
Image Annotation Based on Topic Fusion and Frequent Patterns Mining
计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[13] 刘洪麟,帅仁俊.
一种具有空间约束的快速神经风格迁移方法
Method of Fast Neural Style Transfer with Spatial Constraint
计算机科学, 2019, 46(3): 283-286. https://doi.org/10.11896/j.issn.1002-137X.2019.03.042
[14] 单娜, 李龙杰, 刘昱阳, 陈晓云.
基于节点连接模式相关性的链接预测方法
Link Prediction Based on Correlation of Nodes’ Connecting Patterns
计算机科学, 2019, 46(12): 20-25. https://doi.org/10.11896/jsjkx.190700057
[15] 黄梦婷, 张灵, 姜文超.
基于非负矩阵分解的短文本特征扩展与分类
Short Text Feature Expansion and Classification Based on Non-negative Matrix Factorization
计算机科学, 2019, 46(12): 69-73. https://doi.org/10.11896/jsjkx.190400107
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!