计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 70-75.doi: 10.11896/jsjkx.200500156
邓丽, 武金达, 李科学, 卢亚康
DENG Li, WU Jin-da, LI Ke-xue, LU Ya-kang
摘要: 宏基因组序列组装在计算和内存上面临着巨大挑战。SpaRC(Spark Reads Clustering)是基于Apache Spark的宏基因组序列片段聚类算法,为来自下一代测序技术的数十亿测序片段聚类提供了一种可扩展的解决方案。但是,SpaRC算法参数的设置是一项非常具有挑战性的工作。SpaRC算法拥有许多对算法性能有着很大影响的超参数,选择合适的超参数集对于充分发挥SpaRC算法的性能来说是至关重要的。为了提高SpaRC算法的性能,探索了一种基于树状结构Parzen估计方法(Tree Parzen Estimator,TPE)的超参数优化方法,其能够利用先验知识高效地调节参数,并通过减少计算任务加速寻找最优参数,达到最佳聚类效果,从而避免昂贵的参数探索。对长序列片段(PacBio)和短序列片段(CAMI2)进行实验,结果表明,该方法在改善SpaRC算法性能方面有着良好的效果。
中图分类号:
[1] MARTIN H,MANJA M.De novo transcriptome assembly:A comprehensive cross-species comparison of short-read RNA-Seq assemblers[J].Giga Science,2019,8(5):39. [2] QUINCE C,WALKER A,SIMPSON J,et al.Shotgun meta-genomics,from sampling to analysis[J].Nat Biotechnol,2017,35:833-844. [3] LENCZ T,YU J,PALMER C,et al.High-depth whole genome sequencing of an Ashkenazi Jewish reference panel:enhancing sensitivity,accuracy,and imputation[J].Human Genetics,2018,137(4):343-355. [4] BERTRAND D,SHAW J,KALATHIYAPPAN M,et al.Hy-brid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human micro-biomes[J].Nature Biotechnology,2019,37(8):937-944. [5] LI D H,LIU C M,LUO R B,et al.MEGAHIT:An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph[J].Bioinformatics,2015,31(10):1674-1676. [6] GUO X,YU N,DING X J,et al.DIME:A novel framework for de novo metagenomic sequence assembly[J].Jouranl of Computational Biology,2015,22(2):159-177. [7] SHI L Z,MENG X D,TSENG E,et al.SpaRC:Scalable Sequence Clustering using Apache Spark[J].Bioinformatics,2019,35(5):760-768. [8] SUN Y,XUE B,ZHANG M,et al.An experimental study on hyper-parameter optimization for stacked auto-encoders[C]//2018 IEEE Congress on Evolutionary Computation (CEC).IEEE,2018:1-8. [9] ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016:147-162. [10] GHANBARI-ADIVI F,MOSLEH M.Text Emotion Detection in Social Networks Using a Novel Ensemble Classifier Based on Parzen Tree Estimator (TPE)[J].Neural Computing and Applications,2019,31(12):8971-8983. [11] RAGHAVAN U N,ALBERT R,KUMARA S.Near linear time algorithm to detect community structures in large-scale networks[J].Physical Review Research,2007,76(3):036106. [12] BERGSTRA J,KOMER B,ELIASMITH C,et al.Hyperopt:A Python library for model selection and hyperparameter optimization[J].Computational Science & Discovery,2015,8(1):014008. [13] BERGSTRA J,BARDENET R,BENGIO Y,et al.Algorithms for Hyper-Parameter Optimization[J].Advances in Neural Information Processing Systems,2011,24:2546-2554. [14] YANG L,SHAMI A.On hyperparameter optimization of machine learning algorithms:Theory and practice[J].Neurocomputing,2020,415:295-316. [15] SCZYRBA A,HOFMANN P,BELMANN P,et al.Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software[J].Nature Methods,2017,14:1063-1071. [16] BHASKAR M,NICK C.An Introduction to Neural Information Retrieval[M].America:Now Publishers,2018:11-19. |
[1] | 何志鹏, 李瑞琳, 牛北方. 高可用弹性宏基因组学计算平台 Highly Available Elastic Computing Platform for Metagenomics 计算机科学, 2021, 48(1): 326-332. https://doi.org/10.11896/jsjkx.191200030 |
[2] | 蒋凡 万小飞. 使用TTCN-3的端到端性能测试系统 计算机科学, 2006, 33(11): 29-30. |
|