计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 206-213.doi: 10.11896/jsjkx.200200081

• 人工智能 • 上一篇    下一篇

基于聚类与特征融合的蛋白质亚细胞定位预测

王艺皓, 丁洪伟, 李波, 保利勇, 张颖婕   

  1. 云南大学信息学院 昆明650500
  • 收稿日期:2020-02-16 修回日期:2020-05-21 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 丁洪伟(dhw1964@163.com)
  • 作者简介:893885847@qq.com
  • 基金资助:
    国家自然科学基金项目(61461053,61461054)

Prediction of Protein Subcellular Localization Based on Clustering and Feature Fusion

WANG Yi-hao, DING Hong-wei, LI Bo, BAO Li-yong, ZHANG Ying-jie   

  1. School of Information Science and Engineering,Yunnan University,Kunming 650500,China
  • Received:2020-02-16 Revised:2020-05-21 Online:2021-03-15 Published:2021-03-05
  • About author:WANG Yi-hao,born in 1995,postgra-duate,is a member of China Computer Federatio.His main research interests include machine lear-ning and computer vision.
    DING Hong-wei,born in 1964,Ph.D,professor,Ph.D supervisor.His main research interests include multiple access communication and machine learning.
  • Supported by:
    National Natural Science Foundation of China(61461053,61461054).

摘要: 蛋白质亚细胞的定位预测不仅是研究蛋白质结构和功能的重要基础,还对了解某些疾病的发病机理、药物设计与发现具有重要意义。然而,如何利用机器学习精准预测蛋白质亚细胞的位置一直是一项具有挑战性的科学难题。针对这一问题,提出了一种基于聚类与特征融合的蛋白质亚细胞定位方法。首先将自相关系数法和熵密度法引入蛋白质特征表达模型的构建,并在传统的PseAAC(Pseudo-amino Acid Composition)的基础上提出了一种改进型PseAAC方法。为了更好地表达蛋白质序列信息,文中首先将自相关系数法、熵密度法和改进型PseAAC进行融合,构造了一种全新的蛋白质序列表征模型;然后利用主成分分析法对融合后的特征向量进行降维,将结果输入到LibD3C集成分类器,对蛋白质亚细胞进行分类预测,并采用留一法在Gram-positive和Gram-negative数据集上进行交叉检验;最后将取得的实验结果与其他现有算法进行比较。实验结果表明,所提方法在Gram-positive和Gram-negative数据集上分别取得了99.24%和95.33%的预测准确率,说明所提方法具有科学性和有效性。

关键词: 聚类, 特征融合, 伪氨基酸组分法, 主成分分析法, 自相关系数

Abstract: The prediction of protein subcellular location is not only an important basis for the study of protein structure and function,but also of great significance for understanding the pathogenesis of some diseases,drug design and discovery.However,how to use machine learning to accurately predict the location of protein subcellular has always been a challenging scientific problem.To solve this problem,this paper proposes a protein subcellular localization method based on clustering and feature fusion.Firstly,autocorrelation coefficient method and entropy density method are introduced into the construction of protein feature expression model,and an improved PseAAC(Pseudo-amino acid composition) method is proposed on the basis of traditional PseAAC.In order to express protein sequence information better,this paper fuses autocorrelation coefficient method,entropy density method and the improved PseAAC to construct a new protein sequence representation model.Secondly,we use principal component analysis (PCA) to reduce the dimension of the fused feature vector.Thirdly,we adopt the LibD3C ensemble classifier to classify and predict protein subcellular,and the prediction accuracy is evaluated by leave-one-out cross validation on Gram-positive and Gram-negative datasets.Finally,the experimental results are compared with other existing algorithms.The results show that the new method achieves the prediction accuracy of 99.24% and 95.33% on Gram-positive and Gram-negative datasets respectively,and the new method is scientific and effective.

Key words: Autocorrelation coefficient, Clustering, Feature fusion, Principal component analysis, Pseudo-amino acid composition

中图分类号: 

  • TP391
[1]Q1AO S P,YAN B Q.Review of protein subcellular localization prediction[J].Application Research of Computers,2014,31(2):321-327.
[2]CHEN X J,HU X J,XUE W.Prediction of protein subcellular localization based on multilayer sparse coding[J].Chinese Journal of Biotechnology,2019,35(4):687-696.
[3]CHOU K C,XIANG C,XUAN X.PLoc_bal-mHum:Predictsubcellular localization of human proteins by PseAAC and quasi-balancing training dataset[J].Genomics,2019,111:1274-1282.
[4]WAN S,MAK M W,KUNG S Y.Gram-LocEN:Interpretable prediction of subcellular multi-localization of Gram-positive and Gram-negative bacterial proteins[J].Chemometrics and Intelligent Laboratory Systems,2017,162:1-9.
[5]LIU Q H,LAI Y P,DING H W,et al.Protein subcellular localization prediction based on SVM[J].Computer Engineering and Applications,2019,55(11):136-141.
[6]ZHANG H C,GAO Y J,DENG M H,et al.A survey on algorithms for protein contact prediction[J].Journal of Computer Research and Development,2017,51(1):1-19.
[7]CHOU K C.Some remarks on protein attribute prediction and pseudo amino acid composition[J].Journal of theoretical biology,2011,273(1):236-247.
[8]CHOU K C,CAI Y D.Predicting protein localization in budding Yeast[J].Bioinformatics,2005,21(7):944-950.
[9]LI L Z,DONG Z M.Using pseudo amino acid composition to predict protein subcellular localization:approached by incorporating evolutionary conservation information[J].Acta Biophysica Sinica,2009,25:125-132.
[10]WANG M H,GONG Y,WANG Q,et al.Prediction of protein subcellular localization by incorporating sequence and protein-protein interaction features[J].Journal of University of Electronic Science and Technology of China,2015,44(3):467-470.
[11]RAHMAN J,MONDAL N I,ISLAM K B,et al.Feature Fusion Based SVM Classifier for Protein Subcellular Localization Prediction[J].Journal of Integrative Bioinformatics,2016,13(1):23-33.
[12]LI Z C,LAI Y H,CHEN L L,et al.Identifying subcellular locali-zations of mammalian protein complexes based on graph theory with a random forest algorithm[J].Mol.Biosyst,2013,9(4):658-667.
[13]HE B,MORTUZA S M,WANG Y,et al.NeBcon:protein contact map prediction using neural network training coupled with naive Bayes classifiers[J].Bioinformatics,2017,33(15):2296-2306.
[14]CHOU K C,SHEN H B.Hum-PLoc:a novel ensemble classifier for predicting human protein subcellular localization[J].Biochemical and Biophysical Research Communications,2006,347(1):150-157.
[15]WEI L Y,DING Y J,SU R,et al.Prediction of human protein subcellular localization using deep learning[J].Journal of Parallel and Distributed Computing,2018,117:212-217.
[16]ZHAO Q.A review of principal component analysis[J].Softwart Engineering,2016,19(6):1-3.
[17]LIN C,CHEN W Q,QIU C,et al.LibD3C:Ensemble classifiers with a clustering and dynamic selection strategy[J].Neurocomputing,2014,123:424-435.
[18]MAO W,MU X,ZHENG Y,et al.Leave-one-out cross-validation-based model selection for multi-input multi-output support vector machine[J].Neural Computing and Applications,2014,24(2):441-451.
[19]ZHANG Y P,ZHA Y L,ZHAO S,et al.Protein structure class prediction based on autocorrelation coefficient and PseAAC[J].Journal of Frontiers of Computer Science and Technology,2014,8(1):103-108.
[20]CHOU K C.Prediction of protein cellular attributes using pseudo-amino acid composition[J].Proteins,2001,43(3):246-255.
[21]CHEN W Q.LibD3C2.0:An Ensemble Classifier Based onClustering and Its Parallel Implementation[D].Xiamen:Xiamen University,2014.
[22]FREY B J,DUECK D.Clustering by passing messages between data points[J].Science,2007,315(5814):972-976.
[23]WONG T T.Parametric methods for comparing the performance of two classification algorithms evaluated by k-fold cross validation on multiple data sets[J].Pattern Recognition:The Journal of the Pattern Recognition Society,2017,65:97-107.
[24]KROOPNICK M H,CHEN J,CHOI J,et al.Assessing Classification Bias in Latent Class Analysis:Comparing Resubstitution and Leave-One-Out Methods[J].Journal of Modern Applied Statistical Methods,2010,9(1):52-63.
[25]NEI S Y,LI M H.Construction and comparative analysis of seve-ral conditional independence test statistics[J].The Journal of Quantitative of Quantitative & Technical Economics,2014,31(2):137-147.
[26]CHOU K C,SHEN H B.Cell-PLoc:a package of Web servers for predicting subcellular localization of proteins in various organisms[J].Nature Protocols,2008,3(2):153-162.
[27]JAVED F,HAYAT M.Predicting subcellular localization ofmulti-label proteins by incorporating the sequence features into Chou’s PseAAC [J].Genomics,2019,111:1325-1332.
[28]WU Z C,XIAO X,CHOU K C.iLoc-Gpos:a multi-layer classi-fier for predicting the subcellular localization of single plex and multiplex Gram-positive bacterial proteins [J].Protein and Peptide Letters,2012,19(1):4-14.
[29]XIAO W,ZHANG J,LI G Z.Multi-location gram-positive and gram-negative bacterial protein subcellular localization using gene ontology and multi-label classifier ensemble[J].BMC Bioinformatics,2015,16(S12):S1.
[30]CHOU K C,SHEN H B.Large-scale predictions of gram-negative bacterial protein subcellular locations[J].Journal of Proteome Research,2006,5:3420-3428.
[31]SHEN H B,CHOU K C.Gneg-mPLoc:a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins [J].Journal of Theoretical Biology,2010,264(2):326-333.
[32]XIAO X,WU Z C,CHOU K C.A multi-label classifier for predicting the subcellular localization of Gram-negative bacterial proteins with both single and multiple sites[J].PLoS ONE,2011,6(6):e20592.
[1] 柴慧敏, 张勇, 方敏.
基于特征相似度聚类的空中目标分群方法
Aerial Target Grouping Method Based on Feature Similarity Clustering
计算机科学, 2022, 49(9): 70-75. https://doi.org/10.11896/jsjkx.210800203
[2] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于分层抽样优化的面向异构客户端的联邦学习
Federated Learning Based on Stratified Sampling Optimization for Heterogeneous Clients
计算机科学, 2022, 49(9): 183-193. https://doi.org/10.11896/jsjkx.220500263
[3] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[4] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[5] 刘丽, 李仁发.
医疗CPS协作网络控制策略优化
Control Strategy Optimization of Medical CPS Cooperative Network
计算机科学, 2022, 49(6A): 39-43. https://doi.org/10.11896/jsjkx.210300230
[6] 鲁晨阳, 邓苏, 马武彬, 吴亚辉, 周浩浩.
基于DBSCAN聚类的集群联邦学习方法
Clustered Federated Learning Methods Based on DBSCAN Clustering
计算机科学, 2022, 49(6A): 232-237. https://doi.org/10.11896/jsjkx.211100059
[7] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[8] 毛森林, 夏镇, 耿新宇, 陈剑辉, 蒋宏霞.
基于密度敏感距离和模糊划分的改进FCM算法
FCM Algorithm Based on Density Sensitive Distance and Fuzzy Partition
计算机科学, 2022, 49(6A): 285-290. https://doi.org/10.11896/jsjkx.210700042
[9] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[10] 杨玥, 冯涛, 梁虹, 杨扬.
融合交叉注意力机制的图像任意风格迁移
Image Arbitrary Style Transfer via Criss-cross Attention
计算机科学, 2022, 49(6A): 345-352. https://doi.org/10.11896/jsjkx.210700236
[11] 陈永平, 朱建清, 谢懿, 吴含笑, 曾焕强.
基于外接圆半径差损失的实时安全帽检测算法
Real-time Helmet Detection Algorithm Based on Circumcircle Radius Difference Loss
计算机科学, 2022, 49(6A): 424-428. https://doi.org/10.11896/jsjkx.220100252
[12] 孙洁琪, 李亚峰, 张文博, 刘鹏辉.
基于离散小波变换的双域特征融合深度卷积神经网络
Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation
计算机科学, 2022, 49(6A): 434-440. https://doi.org/10.11896/jsjkx.210900199
[13] 蓝凌翔, 池明旻.
基于特征注意力融合网络的遥感变化检测研究
Remote Sensing Change Detection Based on Feature Fusion and Attention Network
计算机科学, 2022, 49(6): 193-198. https://doi.org/10.11896/jsjkx.210500058
[14] 陈佳舟, 赵熠波, 徐阳辉, 马骥, 金灵枫, 秦绪佳.
三维城市场景中的小物体检测
Small Object Detection in 3D Urban Scenes
计算机科学, 2022, 49(6): 238-244. https://doi.org/10.11896/jsjkx.210400174
[15] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!