计算机科学 ›› 2023, Vol. 50 ›› Issue (12): 322-329.doi: 10.11896/jsjkx.221100057

• 人工智能 • 上一篇    下一篇

基于混合路径HMC的分子树空间采样方法

李晓鹏1, 凌诚2, 高敬阳1   

  1. 1 北京化工大学信息科学与技术学院 北京 100000
    2 中海国际中心超威半导体 北京 100000
  • 收稿日期:2022-11-07 修回日期:2023-03-14 出版日期:2023-12-15 发布日期:2023-12-07
  • 通讯作者: 高敬阳(gaojy@mail.buct.edu.cn)
  • 作者简介:(752668528@qq.com)
  • 基金资助:
    北京市自然科学基金(5182018)

Mixed Path HMC Sampling Methods for Molecular Tree Spaces

LI Xiaopeng1, LING Cheng2, GAO Jingyang1   

  1. 1 School of Information Science and Technology,Beijing University of Chemical Technology,Beijing 100000,China
    2 China Overseas International Center,Advanced Micro Devices,Inc(AMD),Beijing 100000,China
  • Received:2022-11-07 Revised:2023-03-14 Online:2023-12-15 Published:2023-12-07
  • About author:LI Xiaopeng,born in 1998,postgra-duate.His main research interest is computational phylogenetics.
    GAO Jingyang,born in 1966,Ph.D,professor,Ph.D supervisor.Her main research interests include artificial intelligence and bioinformatics.
  • Supported by:
    Natural Science Foundation of Beijing(5182018).

摘要: 随着现代分子序列数据越来越丰富,描述物种间历史关系的树状拓扑空间也急剧扩大,系统发育树的可靠推断仍面临着巨大挑战。近年来,马尔可夫链蒙特卡洛算法(MCMC)家族中最先进的哈密顿马尔可夫蒙特卡洛(HMC)算法被证明可以应用于系统发育分析,可以避免传统MCMC算法中存在的大量随机游走行为,加快马氏链的混合。但在更为复杂的多模态发育树空间中,HMC算法无法通过从其他模式中获得提议来逃离局部的高概率区域,为了提升算法的健壮性,文中提出了一种混合路径哈密顿马尔可夫蒙特卡洛(MPHMC)的优化方法。在不增加额外的计算成本的情况下,所提算法采样路径中添加针对离散参数的非HMC更新组件,与HMC确定性更新交替进行,进而在树空间中引入了拓扑变化更大的分支重排策略,能更自由地遍历整个后验分布的树空间。在5组经验数据集上进行实验,结果证明,MPHMC方法能更好地从正确的后验分布中采样;在比较难采样的大数据集上运行时,HMC单一路径的采样算法可能会失效,而MPHMC方法能获得比使用广泛的系统发育分析工具Mrbayes(MCMC)高14%以上的采样效率。

关键词: MrBayes, 树空间, 哈密顿马尔可夫蒙特卡洛(HMC), 多模态后验分布, 混合路径

Abstract: With the increasing abundance of modern molecular sequence data and the dramatic expansion of the tree-like topological space describing historical relationships between species,reliable inference of phylogenetic trees continues to face enormous challenges.In recent years,the most advanced Hamiltonian Markov Monte Carlo(HMC) algorithm in the Markov Chain Monte Carlo(MCMC) family has been shown to be applicable to phylogenetic analysis,which can avoid the large amount of random walk behaviors present in traditional MCMC algorithms and speed up the mixing of Markov chains.However,in the more complex multimodal development tree space,the HMC algorithm cannot escape from the local high probability region by obtaining propo-sals from other modes.In order to improve the robustness of the algorithm,a hybrid path Hamiltonian Markov Monte Carlo(MPHMC) optimization strategy is proposed in this paper.Without adding additional computational cost,the algorithm samples paths with a non-HMC update component for discrete parameters,alternating with HMC deterministic updates,and introduces a branch rearrangement strategy with greater topological variation in the tree space,enabling freer traversal of the entire posterior distribution's tree space.Experiments on five empirical datasets demonstrate that the MPHMC method better samples from the correct posterior distribution,and the HMC single-path sampling algorithm may fail when run on larger datasets that are more difficult to sample,while the MPHMC method achieves a sampling efficiency gain over 14% than the widely used phylogenetic analysis tool,Mrbayes(MCMC).

Key words: MrBayes, Tree space, HMC, Multimodal posterior distribution, Mixed path

中图分类号: 

  • TP399
[1]KAPLI P,YANG Z,TELFORD M J.Phylogenetic tree building in the genomic age[J].Nature Reviews Genetics,2020,21(7):428-444.
[2]YANG Z.Computational molecular evolution[M].Oxford:Oxford University Press,2006.
[3]HUELSENBECK J P,RONQUIST F.MRBAYES:Bayesian inference of phylogenetic trees[J].Bioinformatics,2001,17(8):754-755.
[4]RONQUIST F,HUELSENBECK J P.MrBayes 3:Bayesianphylogenetic inference under mixed models[J].Bioinformatics,2003,19(12):1572-1574.
[5]RONQUIST F,TESLENKO M,VAN DER MARK P,et al.MrBayes 3.2:efficient Bayesian phylogenetic inference and model choice across a large model space[J].Systematic Biology,2012,61(3):539-542.
[6]DRUMMOND A J,RAMBAUT A.BEAST:Bayesian evolu-tionary analysis by sampling trees[J].BMC Evolutionary Biology,2007,7(1):1-8.
[7]METROPOLIS N,ROSENBLUTH A W,ROSENBLUTH MN,et al.Equation of state calculations by fast computing machines[J].The Journal of Chemical Physics,1953,21(6):1087-1092.
[8]HASTINGS W K.Monte Carlo sampling methods using Markov chains and their applications[J].Biometrika,1970,57(1):97-109.
[9]NEAL R M.MCMC using Hamiltonian dynamics[J].Handbook of Markov Chain Monte Carlo,2011,2(11):2.
[10]DINH V,BILGE A,ZHANG C,et al.Probabilistic path hamiltonian monte carlo[C]//International Conference on Machine Learning.PMLR,2017:1009-1018.
[11]BILLERA L J,HOLMES S P,VOGTMANN K.Geometry of the space of phylogenetic trees[J].Advances in Applied Mathematics,2001,27(4):733-767.
[12]JUKES T H,CANTOR C R.Evolution of protein molecules[J].Mammalian Protein Metabolism,1969,3:21-132.
[13]YANG Z.Estimating the pattern of nucleotide substitution[J].Journal of Molecular Evolution,1994,39(1):105-111.
[14]MAU B,NEWTON M A.Phylogenetic inference for binary data on dendograms using Markov chain Monte Carlo[J].Journal of Computational and Graphical Statistics,1997,6(1):122-131.
[15]YANG Z,RANNALA B.Bayesian phylogenetic inference using DNA sequences:a Markov Chain Monte Carlo method[J].Molecular Biology and Evolution,1997,14(7):717-724.
[16]BROMHAM L,DUCHÊNE S,HUA X,et al.Bayesian molecular dating:opening up the black box[J].Biological Reviews,2018,93(2):1165-1191.
[17]YANG Z,RANNALA B.Molecular phylogenetics:principlesand practice[J].Nature Reviews Genetics,2012,13(5):303-314.
[18]BETANCOURT M.A conceptual introduction to HamiltonianMonte Carlo[J].arXiv:1701.02434,2017.
[19]AFSHAR H M,DOMKE J.Reflection,Refraction,and Hamiltonian Monte Carlo[C]//NIPS.2015:3007-3015.
[20]FELSENSTEIN J.Evolutionary trees from DNA sequences:amaximum likelihood approach[J].Journal of molecular evolution,1981,17(6):368-376.
[21]JI X,ZHANG Z,HOLBROOK A,et al.Gradients do grow on trees:a linear-time O(N)-dimensional gradient for statistical phylogenetics[J].Molecular Biology and Evolution,2020,37(10):3047-3060.
[22]BOU-RABEE N,SANZ-SERNA J M.Geometric integrators and the Hamiltonian Monte Carlo method[J].Acta Numerica,2018,27:113-206.
[23]WHIDDEN C,MATSEN IV F A.Quantifying MCMC exploration of phylogenetic tree space[J].Systematic Biology,2015,64(3):472-491.
[24]MANGOUBI O,PILLAI N S,SMITH A.Does HamiltonianMonte Carlo mix faster than a random walk on multimodal densities?[J].arXiv:1808.03230,2018.
[25]PARK J.Sampling from multimodal distributions using tem-pered Hamiltonian transitions[J].arXiv:2111.06871,2021.
[26]LAKNER C,VAN DER MARK P,HUELSENBECK J P,et al.Efficiency of Markov chain Monte Carlo tree proposals in Baye-sian phylogenetics[J].Systematic Biology,2008,57(1):86-103.
[27]ZHOU G.Metropolis Augmented Hamiltonian Monte Carlo[J].arXiv:2201.08044,2022.
[28]SUN S,SHEN Y.“Parallel-Tempering”-Assisted Hybrid Monte Carlo Algorithm for Bayesian Inference in Dynamical Systems[C]//UK Workshop on Computational Intelligence.Cham:Springer,2019:357-368.
[29]HUELSENBECK J P.Performance of phylogenetic methods in simulation[J].Systematic Biology,1995,44(1):17-48.
[30]NEI M,TAKEZAKI N,SITNIKOVA T.Assessing molecular phylogenies[J].Science,1995,267(5195):253-255.
[31]XIE Q,BU W,ZHENG L.The Bayesian phylogenetic analysis of the 18S rRNA sequences from the main lineages of Trichophora(Insecta:Heteroptera:Pentatomomorpha)[J].Molecular Phylogenetics and Evolution,2005,34(2):448-451.
[32]XIE Q,TIAN Y,ZHENG L,et al.18S rRNA hyper-elongation and the phylogeny of Euhemiptera(Insecta:Hemiptera)[J].Molecular Phylogenetics and Evolution,2008,47(2):463-471.
[33]XIE Q,WANG Y,LIN J,et al.Potential key bases of ribosomal RNA to kingdom-specific spectra of antibiotic susceptibility and the possible archaeal origin of eukaryotes[J].PloS one,2012,7(1):e29468.
[34]GEYER C J.Markov Chain Monte Carlo Maximum Likelihood[M].American Cancer Society,2005.
[35]NYLANDER J A A,FREDRIK R,HUELSENBECK J P,et al.Bayesian Phylogenetic Analysis of Combined Data[J].Systema-tic Biology,2004,53(1):47-67.
[36]RAMBAUT A,DRUMMOND A J,XIE D,et al.Posterior summarization in Bayesian phylogenetics using Tracer 1.7[J].Systematic Biology,2018,67(5):901-904.
[37]SUKUMARAN J,HOLDER M T.DendroPy:a Python library for phylogenetic computing[J].Bioinformatics,2010,26(12):1569-1571.
[38]HUERTA-CEPAS J,SERRA F,BORK P.ETE 3:reconstruction,analysis,and visualization of phylogenomic data[J].Mole-cular Biology and Evolution,2016,33(6):1635-1638.
[39]SMITH M R.Robust analysis of phylogenetic tree space[J].Systematic Biology,2022,71(5):1255-1270.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!