计算机科学 ›› 2024, Vol. 51 ›› Issue (8): 117-123.doi: 10.11896/jsjkx.231100014

• 数据库&大数据&数据科学 • 上一篇    下一篇

河海图结构蛋白质数据集及预测模型

魏想想, 孟朝晖   

  1. 河海大学计算机与软件学院 南京 211106
  • 收稿日期:2023-11-01 修回日期:2024-03-05 出版日期:2024-08-15 发布日期:2024-08-13
  • 通讯作者: 孟朝晖(mengzhaohui@hhu.edu.cn)
  • 作者简介:(221307030006@hhu.edu.cn)

Hohai Graphic Protein Data Bank and Prediction Model

WEI Xiangxiang, MENG Zhaohui   

  1. School of Computer and Software,Hohai University,Nanjing 211106,China
  • Received:2023-11-01 Revised:2024-03-05 Online:2024-08-15 Published:2024-08-13
  • About author:WEI Xiangxiang,born in 1999,master.His main research interests include artificial intelligence and neural network.
    MENG Zhaohui,born in 1968,associate professor.His main research interests include neural network and artificial intelligence.

摘要: 蛋白质是一种具有空间结构的物质。蛋白质结构预测的主要目标是从已有的大规模的蛋白质数据集中提取有效的信息,从而预测自然界中蛋白质的结构。目前蛋白质结构预测实验存在的一个问题是,缺少能够进一步反映出蛋白质空间结构特征的数据集。当前主流的 PDB 蛋白质数据集虽然是经过实验测得,但没有利用到蛋白质的空间特征,而且存在掺杂核酸数据和部分数据不完整的问题。针对以上问题,从蛋白质的空间结构角度来研究蛋白质的预测。在原始 PDB 数据集的基础上,提出了河海图结构蛋白质数据集(Hohai Graphic Protein Data Bank,HohaiGPDB)。该数据集以图结构为基础,表达出了蛋白质的空间结构特征。基于传统 Transformer 网络模型对新的数据集进行了相关的蛋白质结构预测实验,在 HohaiGPDB 数据集上的预测准确率可以达到 59.38%,证明了HohaiGPDB数据集的研究价值。HohaiGPDB 数据集可以作为蛋白质相关研究的通用数据集。

关键词: 河海图结构蛋白质数据集, 蛋白质空间结构, 蛋白质结构预测, Transformer模型

Abstract: Protein is a kind of substance with spatial structure.The main goal of protein structure prediction is to extract effective information from existing large-scale protein datasets,so as to predict the structure of proteins in nature.At present,one of the problems in protein structure prediction experiments is the lack of data sets that can further reflect the spatial structure of proteins.Although the current mainstream PDB(protein data bank) is experimentally measured,it does not utilize the spatial characteristics of proteins,and there are problems of doping nucleic acid data and partial data is incomplete.In view of the above pro-blems,this paper studies the prediction of protein from the perspective of spatial structure.Based on the original PDB,the Hohai graphic protein data bank is proposed.The dataset expresses the spatial structure characteristics of proteins based on the graph structure.Based on the traditional Transformer network model,relevant protein structure prediction experiments are carried out on the new dataset,and the prediction accuracy of HohaiGPDB could reach 59.38%,which proves the research value of Hohai-GPDB.The HohaiGPDB could be used as a general data set for protein-related studies.

Key words: Hohai graphic protein data bank, Protein spatial structure, Protein structure prediction, Transformer model

中图分类号: 

  • TP391
[1]BERMAN H M,BATTISTUZ T,BHAT T N,et al.The protein data bank[J].Acta Crystallographica Section D:Biological Crystallography,2002,58(6):899-907.
[2]BATEMAN A,MARTIN M J,ORCHARD S,et al.UniProt:the universal protein knowledgebase in 2023[J].Nucleic Acids Research,2022,51(D1):D523-D531.
[3]PENG C X,LIANG F,XIA Y H,et al.Recent Advances and Challenges in Protein Structure Prediction[J].Journal of Chemical Information and Modeling,2023,64(1):76-95.
[4]JUMPER J,EVANS R,PRITZEL A,et al.Highly accurate protein structure prediction with AlphaFold[J].Nature,2021,596(7873):583-589.
[5]CHEN B,CHENG X,GENG Y,et al.xtrimopglm:Unified100b-scale pre-trained transformer for deciphering the language of protein[J].arXiv:2401.06199v1,2024.
[6]BRYANT P,POZZATI G,ELOFSSON A.Improved prediction of protein-protein interactions using AlphaFold2[J].Nature Communications,2022,13(1):1265.
[7]AKDEL M,PIRES D E V,PARDO E P,et al.A structural bio-logy community assessment of AlphaFold2 applications[J].Nature Structural & Molecular Biology,2022,29(11):1056-1067.
[8]JISNA V A,JAYARAJ P B.Protein structure prediction:conventional and deep learning perspectives[J].The Protein Journal,2021,40(4):522-544.
[9]PEARCE R,ZHANG Y.Toward the solution of the proteinstructure prediction problem[J].Journal of Biological Chemistry,2021,297(1).
[10]KANDATHIL S M,GREENER J G,LAU A M,et al.Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins[J].Proceedings of the National Academy of Sciences,2022,119(4):e2113348119.
[11]WEISSENOW K,HEINZINGER M,STEINEGGER M,et al.Ultra-fast protein structure prediction to capture effects of sequence variation in mutation movies[J].arXiv:2022.11.14.516473v2,2022.
[12]ALQURAISHI M.End-to-end differentiable learning of protein structure[J].Cell Systems,2019,8(4):292-301.e3.
[13]INGRAHAM J,RIESSELMAN A,SANDER C,et al.Learning protein structure with a differentiable simulator[C]//International Conference on Learning Representations.2018.
[14]JONES D T,THORNTON J M.The impact of AlphaFold2 one year on[J].Nature Methods,2022,19(1):15-20.
[15]WANG W,PENG Z,YANG J.Single-sequence protein structure prediction using supervised transformer protein language models[J].Nature Computational Science,2022,2(12):804-814.
[16]LIN Z,AKIN H,RAO R,et al.Evolutionary-scale prediction of atomic-level protein structure with a language model[J].Science,2023,379(6637):1123-1130.
[17]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[J].arXiv:1706.03762,2017.
[18]BERMAN H M.The protein data bank:a historical perspective[J].Acta Crystallographica Section A:Foundations of Crystallography,2008,64(1):88-95.
[19]AL-LAZIKANI B,JUNG J,ANG Z,et al.Protein structure prediction[J].Current Opinion in Chemical Biology,2001,5(1):51-56.
[20]PHAN H K,DANG T H.Protein structure prediction usingDeep Learning[R].VNU University of Engineering and Technology,2018.
[21]TORRISI M,POLLASTRI G,LE Q.Deep learning methods in protein structure prediction[J].Computational and Structural Biotechnology Journal,2020,18:1301-1310.
[22]SKWARK M J,RAIMONDI D,MICHEL M,et al.Improvedcontact predictions using the recognition of protein like contact patterns[J].PLoS Computational Biology,2014,10(11):e1003889.
[23]LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-444.
[24]SRIVASTAVA A,NAGAI T,et al.Role of computationalmethods in going beyond X-ray crystallography to explore protein structure and dynamics[J].International Journal of Molecular Sciences,2018,19(11):3401.
[25]BILLETER M,WAGNER G,WÜTHRICH K.Solution NMRstructure determination of proteins revisited[J].Journal of Biomolecular NMR,2008,42:155-158.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!