计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 70-75.doi: 10.11896/jsjkx.200500156

• 新型分布式计算技术与系统* 上一篇    下一篇

基于TPE的SpaRC算法超参数优化方法

邓丽, 武金达, 李科学, 卢亚康   

  1. 上海大学机电工程与自动化学院 上海200072; 上海市电站自动化技术重点实验室 上海200072
  • 收稿日期:2020-05-29 修回日期:2020-12-05 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 邓丽(dengli@shu.edu.cn)
  • 基金资助:
    国家自然科学基金(61802246)

SpaRC Algorithm Hyperparameter Optimization Methodology Based on TPE

DENG Li, WU Jin-da, LI Ke-xue, LU Ya-kang   

  1. School of Mechatronic Engineering and Automation,Shanghai University,Shanghai 200072,China Shanghai Key Laboratory of Power Station Automation Technology,Shanghai 200072,China
  • Received:2020-05-29 Revised:2020-12-05 Online:2021-02-15 Published:2021-02-04
  • About author:DENG Li,born in 1978,associate professor.Her main research interests include metagene data analysis and machine learning.
  • Supported by:
    The National Natural Science Foundation of China(61802246).

摘要: 宏基因组序列组装在计算和内存上面临着巨大挑战。SpaRC(Spark Reads Clustering)是基于Apache Spark的宏基因组序列片段聚类算法,为来自下一代测序技术的数十亿测序片段聚类提供了一种可扩展的解决方案。但是,SpaRC算法参数的设置是一项非常具有挑战性的工作。SpaRC算法拥有许多对算法性能有着很大影响的超参数,选择合适的超参数集对于充分发挥SpaRC算法的性能来说是至关重要的。为了提高SpaRC算法的性能,探索了一种基于树状结构Parzen估计方法(Tree Parzen Estimator,TPE)的超参数优化方法,其能够利用先验知识高效地调节参数,并通过减少计算任务加速寻找最优参数,达到最佳聚类效果,从而避免昂贵的参数探索。对长序列片段(PacBio)和短序列片段(CAMI2)进行实验,结果表明,该方法在改善SpaRC算法性能方面有着良好的效果。

关键词: SpaRC, 宏基因组, 序列片段聚类, TPE, 超参数优化

Abstract: The assembly of metagenomic sequences faces huge challenge in computing and storage.SpaRC (Spark Reads Clustering) is a metagenomic sequence fragment clustering algorithm based on Apache Spark,which provides a scalable solution for clustering of billions of sequencing fragments.However,setting SpaRC parameters is a very challenging task.SpaRC algorithm has many hyperparameters that have a great impact on the performance of the algorithm.Choosing the appropriate hyperparameter set is crucial to the performance of SpaRC algorithm.In order to improve the performance of SpaRC algorithm,a hyperpara-meter optimization method based on Tree Parzen Estimator (TPE) is explored,which can use prior knowledge to efficiently adjust the parameters,accelerate the search for the optimal parameters by reducing the calculation task to achieve the optimal clustering effect,thus avoding expensive parameter exploration.After experiments with long-reads(PacBio) and short-reads(CAMI2),the results show that the proposed method has a great effect on improving the performance of SpaRC algorithm.

Key words: SpaRC, Metagenomics, Sequence fragment clustering, TPE, Hyperparametric optimization

中图分类号: 

  • TP399
[1] MARTIN H,MANJA M.De novo transcriptome assembly:A comprehensive cross-species comparison of short-read RNA-Seq assemblers[J].Giga Science,2019,8(5):39.
[2] QUINCE C,WALKER A,SIMPSON J,et al.Shotgun meta-genomics,from sampling to analysis[J].Nat Biotechnol,2017,35:833-844.
[3] LENCZ T,YU J,PALMER C,et al.High-depth whole genome sequencing of an Ashkenazi Jewish reference panel:enhancing sensitivity,accuracy,and imputation[J].Human Genetics,2018,137(4):343-355.
[4] BERTRAND D,SHAW J,KALATHIYAPPAN M,et al.Hy-brid metagenomic assembly enables high-resolution analysis of resistance determinants and mobile elements in human micro-biomes[J].Nature Biotechnology,2019,37(8):937-944.
[5] LI D H,LIU C M,LUO R B,et al.MEGAHIT:An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de bruijn graph[J].Bioinformatics,2015,31(10):1674-1676.
[6] GUO X,YU N,DING X J,et al.DIME:A novel framework for de novo metagenomic sequence assembly[J].Jouranl of Computational Biology,2015,22(2):159-177.
[7] SHI L Z,MENG X D,TSENG E,et al.SpaRC:Scalable Sequence Clustering using Apache Spark[J].Bioinformatics,2019,35(5):760-768.
[8] SUN Y,XUE B,ZHANG M,et al.An experimental study on hyper-parameter optimization for stacked auto-encoders[C]//2018 IEEE Congress on Evolutionary Computation (CEC).IEEE,2018:1-8.
[9] ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016:147-162.
[10] GHANBARI-ADIVI F,MOSLEH M.Text Emotion Detection in Social Networks Using a Novel Ensemble Classifier Based on Parzen Tree Estimator (TPE)[J].Neural Computing and Applications,2019,31(12):8971-8983.
[11] RAGHAVAN U N,ALBERT R,KUMARA S.Near linear time algorithm to detect community structures in large-scale networks[J].Physical Review Research,2007,76(3):036106.
[12] BERGSTRA J,KOMER B,ELIASMITH C,et al.Hyperopt:A Python library for model selection and hyperparameter optimization[J].Computational Science & Discovery,2015,8(1):014008.
[13] BERGSTRA J,BARDENET R,BENGIO Y,et al.Algorithms for Hyper-Parameter Optimization[J].Advances in Neural Information Processing Systems,2011,24:2546-2554.
[14] YANG L,SHAMI A.On hyperparameter optimization of machine learning algorithms:Theory and practice[J].Neurocomputing,2020,415:295-316.
[15] SCZYRBA A,HOFMANN P,BELMANN P,et al.Critical Assessment of Metagenome Interpretation-a benchmark of metagenomics software[J].Nature Methods,2017,14:1063-1071.
[16] BHASKAR M,NICK C.An Introduction to Neural Information Retrieval[M].America:Now Publishers,2018:11-19.
[1] 何志鹏, 李瑞琳, 牛北方. 高可用弹性宏基因组学计算平台[J]. 计算机科学, 2021, 48(1): 326-332.
[2] 蒋凡 万小飞. 使用TTCN-3的端到端性能测试系统[J]. 计算机科学, 2006, 33(11): 29-30.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王海涛, 宋丽华, 向婷婷, 刘力军. 人工智能发展的新方向——人机物三元融合智能[J]. 计算机科学, 2020, 47(11A): 1 -5 .
[2] 陈训敏, 叶书函, 詹瑞. 基于多任务学习及由粗到精的卷积神经网络人群计数模型[J]. 计算机科学, 2020, 47(11A): 183 -187 .
[3] . 目录[J]. 计算机科学, 2020, 47(12): 0 .
[4] . 复杂系统的软件工程和需求工程专题前言[J]. 计算机科学, 2020, 47(12): 2 .
[5] 孟繁祎, 王莹, 于海, 朱志良. 复杂软件系统的重构技术:现状、问题与展望[J]. 计算机科学, 2020, 47(12): 1 -10 .
[6] 吴文峻, 于鑫, 蒲彦均, 汪群博, 于笑明. 微服务时代的复杂服务软件开发[J]. 计算机科学, 2020, 47(12): 11 -17 .
[7] 杨经纬, 魏子麒, 刘璘. 用户如何看待产品中的预测分析功能?——面向非功能性需求的调研报告[J]. 计算机科学, 2020, 47(12): 18 -24 .
[8] 贾经冬, 张筱曼, 郝璐, 谭火彬. 工业界需求工程关注点分析[J]. 计算机科学, 2020, 47(12): 25 -34 .
[9] 周凯, 任怡, 汪哲, 管剑波, 张芳, 赵言亢. 基于主题模型的Ubuntu操作系统缺陷报告的分类及分析[J]. 计算机科学, 2020, 47(12): 35 -41 .
[10] 杨立, 马佳佳, 江华禧, 马肖肖, 梁赓, 左春. 面向机器学习系统的需求建模与决策选择[J]. 计算机科学, 2020, 47(12): 42 -49 .