计算机科学 ›› 2019, Vol. 46 ›› Issue (5): 36-43.doi: 10.11896/j.issn.1002-137X.2019.05.005

• 综述 • 上一篇    下一篇

高通量测序中序列拼接算法的研究进展

周卫星, 石海鹤   

  1. (江西师范大学计算机信息工程学院 南昌330022)
  • 收稿日期:2018-08-09 修回日期:2018-12-13 发布日期:2019-05-15
  • 作者简介:周卫星(1994-),男,硕士生,CCF学生会员,主要研究方向为生物序列分析;石海鹤(1979-),女,博士,教授,CCF会员,主要研究方向为生物信息学、软件工程、形式化方法,E-mail:haiheshi@jxnu.edu.cn(通信作者)。
  • 基金资助:
    国家自然科学基金项目(61662035,61762049,61862033),江西省自然科学基金项目(20171BAB202013)资助。

Survey on Sequence Assembly Algorithms in High-throughput Sequencing

ZHOU Wei-xing, SHI Hai-he   

  1. (College of Computer Information and Engineering,Jiangxi Normal University,Nanchang 330022,China)
  • Received:2018-08-09 Revised:2018-12-13 Published:2019-05-15

摘要: 高通量测序(High-throughput Sequencing,HTS)技术是继第一代测序技术之后发展起来的一种新型测序方式,又被称为下一代测序技术。与第一代测序技术中采用基于Sanger方法的自动、半自动毛细管测序方法不同,高通量测序技术采用了基于焦磷酸测序的并行测序技术,是对传统测序技术的一项重要技术突破,它不仅克服了第一代测序技术高成本、低通量、低速度的缺点,而且能满足现代分子生物学和基因组学快速发展的需求,达到低成本、高通量以及快速的目的。相较于第一代测序数据,高通量测序数据具有典型的长度短、覆盖度不均匀以及准确率低的特点,同时第三代测序技术虽保持了高通量测序技术边测序边合成的思想,但采用了更为高效的单分子实时测序技术和纳米孔测序技术,具有高通量、低成本和测序数据长的优势。因此,要获得完整的全基因组基因序列,生物学家就需要使用一种技术将短测序reads拼装成一条完整的基因单链序列。在这种情况下,序列拼接算法应运而生。首先,介绍了序列拼接算法的发展背景以及高通量测序技术的相关概念,分析了高通量测序技术在序列拼接算法中所具有的优势;其次,通过总结序列拼接算法的发展成果,按基于greedy策略、基于Overlap-Layout-Consensus (OLC)策略和基于De Bruijn Graph (DBG)策略的分类对序列拼接算法进行阐述;最后,探讨了序列拼接算法的相关研究方向和发展趋势。

关键词: DeBruijnGraph, greedy, Overlap-Layout-Consensus, 高通量测序技术, 序列拼接算法

Abstract: High-throughput sequencing technology is a new sequencing method developed after the first generation sequencing technology,also known as next-generation sequencing technology.Different from the automatic and semi-automatic capillary sequencing method based on Sanger,the high-throughput sequencing technology adopts the parallel sequencing technology based on pyrosequencing.It not only conquers the shortcomings of high cost,low throughput and low speed of the first generation sequencing technology,but also meets the demands of the rapid development of modern molecular biology and genomics with low cost,high throughput and fast speed.Compared with the first generation sequencing data,high-throughput sequencing data are characterized by short lengths,uneven coverage and low accuracy,and the third-generation sequencing technology adopts more efficient single molecular real-time sequencing and Nanopore sequencing technology as well as the principle of sequencing and synthesis,which has the advantages of high throughput,low cost and long sequencing data.Therefore,in order to obtain complete genome sequence,a technique is needed to assemble short sequencing reads into a complte single-stranded sequence of genes.In this case, the sequence assembly algorithm was proposed.Firstly,the development background of sequence assembly algorithms and the related concepts of high-throughput sequencing technology were introduced,and the advantages of high-throughput sequencing technology on sequence assembly were analyzed.Secondly,by summarizing the development of sequence assembly algorithms.The sequence assembly algorithms were illustrated,according to the algorithm classifications,respectively,by greedy strategy,Overlap-Layout-Consensus (OLC) strategy and De Bruijn Graph (DBG) strategy.Finally,the research direction and development trend of sequence assembly algorithms were discussed.

Key words: De bruijn graph, Greedy, High-throughput sequencing, Overlap-layout-consensus, Sequence assembly algorithms

中图分类号: 

  • TP301.6
[1]WARD R M,SCHMIEDER R,HIGHNAM G,et al.Big data challenges and opportunities in high-throughput sequencing[J].Systems Biomedicine,2013,1(1):29-34.
[2]PEARSON W R Y,LIPMAN D J.Improved tools for biological sequence analysis[J].Proceedings of the National Academy of Sciences of the United Sates of America,1988,85(46):16138-16143.
[3]ALTSCHUL S F,MADDEN T L,SCHÄFFER A A,et al.
Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402.
[4]LARKIN M A,BLACKSHIELDS G,BROWN N P,et al.Clus-tal W and Clustal X version 2.0[J].Bioinformatics,2007,23(21):2947-2948.
[5]DARLING A E,MAU B,PERNA N T.progressive Mauve:Mul-tiple Genome Alignment with Gene Gain,Loss and Rearrangement[J].PLOS One,2010,5(6):e11147.
[6]BLANCHETTE M,KENT W J,RIEMER C,et al.Aligningmultiple genomic sequences with the threaded blockset aligner[J].Genome Research,2004,14(4):708-715.
[7]CANTOR C R.Orchestrating the Human Genome Project[J].Science,1990,248(4951):49-51.
[8]CONSORTIUM T G P.A global reference for human genetic variation,the 1000 Genomes Project Consortium[J].Nature,2015,526:68-74.
[9]CONSORTIUM T U.The UK10K project identifies rare va-riants in health and disease[J].Nature,2015,526(7571):82-90.
[10]WATSON M.Illuminating the future of DNA sequencing[J].Genome Biology,2014,15(2):108.
[11]Sanger F,Nicklen S,Coulson A R.DNA sequencing with chain-terminating inhibitors[J].Proceedings of the National Academy of Sciences of the United States of America,1977,74(12):5463-5467.
[12]MOROZOVA O.Applications of next-generation sequencingtechnologies in functional genomics[J].Genomics,2008,92(5):255-264.
[13]WOOLEY J C,GODZIK A,FRIEDBERG I.A Primer on Meta-genomics[J].PLOS Computational Biology,2010,6(2):e1000667.
[14]WU X,ZHU X,WU G Q,et al.Data Mining with Big Data[J].IEEE Transactions on Knowledge & Data Engineering,2014,26(1):97-107.
[15]FONSECA N A,RUNG J,BRAZMA A,et al.Tools for mapping high-throughput sequencing data[J].Bioinformatics,2012,28(24):3169-3177.
[16]NIEDRINGHAUS T P,MILANOVA D,KERBY M B,et al.Landscape of Next-Generation Sequencing Technologies[J].Analytical Chemistry,2011,83(12):4327-4341.
[17]HARRIS T D,BUZBY P R,BABCOCK H,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109.
[18]FLUSBERG B A,WEBSTER D,LEE J,et al.Direct detection of DNA methylation during single-molecule,real-time sequencing[J].Nature Methods,2010,7(6):461-465.
[19]RUSK N.Cheap third-generation sequencing[J].Nature Me-thods,2009,6(4):244-244.
[20]CHIN C S,PELUSO P,SEDLAZECK F J,et al.Phased Diploid Genome Assembly with Single Molecule Real-Time Sequencing[J].Nature Methods,2016,13(12):1050-1054.
[21]HEEREMA S J,DEKKER C.Graphene nanodevices for DNA sequencing[J].Nature Nanotechnology,2016,11(2):127-136.
[22]HEATHER J M,CHAIN B.The sequence of sequencers:The history of sequencing DNA[J].Genomics,2016,107(1):1-8.
[23]SHENDURE J,JI H.Next-generation DNA sequencing[J].Nature Biotechnology,2008,26(10):1135-1145.
[24]JACKMAN S D,VANDERVALK B P,MOHAMADI H,et al.ABySS 2.0:resource-efficient assembly of large genomes using a Bloom filter[J].Genome Research,2017,27(5):768-777.
[25]ZIMIN A V,STEVENS K A,CREPEAU M W,et al.An improved assembly of the loblolly pine mega-genome using long-read single-molecule sequencing[J].GIGA Science,2017,6(1):1-4.
[26]MOSTOVOY Y,LEVYSAKIN M,LAM J,et al.A hybrid approach for de novo human genome sequence assembly and phasing[J].Nature Methods,2016,13(7):587-590.
[27]LIU Y,BERTIL S,MASKELL D L.Parallelized short read assembly of large genomes using de Bruijn graphs[J].BMC Bioinformatics,2011,12(1):354.
[28]WARREN R L,SUTTON G G,JONES S J M,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinforma-tics,2007,23(4):500-501.
[29]JECK W R,REINHARDT J A,BALTRUS D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944.
[30]DOHM J C,LOTTAZ C,BORODINA T,et al.SHARCGS,a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Research,2007,17(11):1697-1706.
[31]FARRANT G K,HOEBEKE M,PARTENSKY F,et al.WiseScaffolder:an algorithm for the semi-automatic scaffolding of Next Generation Sequencing data[J].BMC Bioinformatics,2015,16(1):281.
[32]CAO M D,NGUYEN S H,GANESAMOORTHY D,et al.Scaffolding and completing genome assemblies in real-time withna-nopore sequencing[J].Nature Communications,2017,8:14515.
[33]HIEU T N,ZIAUR R M,HE L,et al.Complete De NovoAssembly of Monoclonal Antibody Sequences[J].Scientific Reports,2016,6:31730.
[34]MIN L,LIAO Z,HE Y,et al.ISEA:Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2017,14(4):916-925.
[35]MYERS E W,SUTTON G G,DELCHER A L,et al.A Whole-Genome Assembly of Drosophila[J].Science,2000,287(5461):2196-2204.
[36]DE L B M,MCCOMBIE W R.Assembling genomic DNA se-quences with PHRAP[J].Current Protocols in Bioinformatics,2007,17(1):11.4.1-11.4.15.
[37]MARGULIES M,EGHOLM M,ALTMAN W E,et al.Genome sequencing in microfabricated high-density picolitre reactors[J].Nature,2005,437(7057):376-380.
[38]KAMATH G M,SHOMORONY I,XIA F,et al.HINGE:long-read assembly achieves optimal repeat resolution[J].Genome Research,2017,27(5):747-756.
[39]MYERS G.Efficient Local Alignment Discovery amongst Noisy Long Reads[M]∥Algorithms in Bioinformatics.Berlin:Sprin-ger,2014:52-67.
[40]KOREN S,WALENZ B P,BERLIN K,et al.Canu:scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation[J].Genome Research,2017,27(5):722-736.
[41]YE C,HILL C M,WU S,et al.DBG2OLC:Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies[J].Scientific Reports,2016,6(X):31900.
[42]YE C,MA Z S,CANNON C H,et al.Exploiting sparseness in de novo genome assembly[J].BMC Bioinformatics,2012,13(S6):S1.
[43]LI H.Minimap and miniasm:fast mapping and de novo assembly for noisy long sequences[J].Bioinformatics,2015,32(14):2103-2110.
[44]VASER R,SOVIC' I,NAGARAJAN N,et al.Fast and accurate de novo genome assembly from long uncorrected reads[J].Genome Research,2017,27(5):737-746.
[45]JANSEN H J,LIEM M,JONGRAADSEN S A,et al.Rapid de novo assembly of the European eel genome from nanopore sequencing reads[J].Scientific Reports,2017,7(1):7213.
[46]BAAIJENS J A, AZE A, RIVALS E,et al.De novo assembly of viral quasispecies using overlap graphs[J].Genome Research,2017,27(5):835-848.
[47]PENG G,JI P,ZHAO F.A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes[J].Genome Biology,2016,17(1):232.
[48]CAMERON D L,SCHROEDER J,PENINGTON J S,et al.
GRIDSS:sensitive and specific genomic rearrangement detection using positional de Bruijn graph assembly[J].Genome Research,2017,27(12):1-11.
[49]WEISENFELD N I,KUMAR V,SHAH P,et al.Direct determination of diploid genome sequences[J].Genome Research,2017,27(5):757-767.
[50]BUTLER J,MACCALLUM I,KLEBER M,et al.ALLPATHS:de novo assembly of whole-genome shotgun microreads[J].Genome Research,2008,18(5):810-820.
[51]PEVZNER P A,TANG H,WATERMAN M S.An Eulerianpath approach to DNA fragment assembly[J].Proceedings of the National Academy of Sciences of the United Sates of America,2001,98(17):9748-9753.
[52]GNERRE S,JAFFE D B.High-quality draft assemblies of mammalian genomes from massively parallel sequence data[J].Proceedings of the National Academy of Sciences of the United Sates of America,2011,108(4):1513-1518.
[53]LI R,ZHU H,RUAN J,et al.De novo assembly of human genomes with massively parallel short read sequencing[J].Genome Research,2010,20(2):265-272.
[54]LUO R,LIU B,XIE Y,et al.SOAPdenovo2:an empirically improved memory-efficient short-read de novo assembler[J].GIGA Science,2012,1(1):18.
[55]XIE Y,WU G,TANG J,et al.SOAPdenovo-Trans:de novotranscriptome assembly with short RNA-Seq reads[J].Bioinformatics,2014,30(12):1660-1666.
[56]PENG Y,LEUNG H C M,YIU S M,et al.IDBA-UD:a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth[J].Bioinformatics,2012,28(11):1420-1428.
[57]CHITSAZ H,YEE-GREENBAUM J L,TESLER G,et al.Efficient de novo assembly of single-cell bacterial genomes from short-read data sets[J].Nature Biotechnology,2011,29(10):915-921.
[58]PENG Y,LEUNG H C M,YIU S M,et al.IDBA-tran:a more robust de novo de Bruijn graph assembler for transcriptomes with uneven expression levels[J].Bioinformatics,2013,29(13):326-334.
[59]NURK S,MELESHKO D,KOROBEYNIKOV A,et al.metaSPAdes:a new versatile metagenomics assembler[J].Genome Research,2017,27(5):824-834[60]BANKEVICH A,NURK S,ANTIPOV D,et al.SPAdes:A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing[J].Journal of Computational Biology,2012,19(5):455-477.
[61]PRJIBELSKI A D,VASILINETC I,BANKEVICH A,et al.ExSPAnder:a universal repeat resolver for DNA fragment assembly[J].Bioinformatics,2014,30(12):293-301.
[62]SAFONOVA Y,BANKEVICH A,PEVZNER P A.dipSPAdes:Assembler for Highly Polymorphic Diploid Genomes[C]∥International Conference on Research in Computational Molecular Biology.New York:Springer-Verlag,2014:265-279.
[63]ANTIPOV D,KOROBEYNIKOV A,MCLEAN J S,et al.hybridSPAdes:an algorithm for hybrid assembly of short and long reads[J].Bioinformatics,2015,32(7):1009-1015.
[64]VASILINETC I,PRJIBELSKI A D,GUREVICH A,et al.Assembling short reads from jumping libraries with large insert sizes[J].Bioinformatics,2015,31(20):3262-3268.
[1] 邵子灏, 杨世宇, 马国杰.
室内信息服务的基础——低成本定位技术研究综述
Foundation of Indoor Information Services:A Survey of Low-cost Localization Techniques
计算机科学, 2022, 49(9): 228-235. https://doi.org/10.11896/jsjkx.210900260
[2] 孙刚, 伍江江, 陈浩, 李军, 徐仕远.
一种基于切比雪夫距离的隐式偏好多目标进化算法
Hidden Preference-based Multi-objective Evolutionary Algorithm Based on Chebyshev Distance
计算机科学, 2022, 49(6): 297-304. https://doi.org/10.11896/jsjkx.210500095
[3] 李丹丹, 吴宇翔, 朱聪聪, 李仲康.
基于多种改进策略的改进麻雀搜索算法
Improved Sparrow Search Algorithm Based on A Variety of Improved Strategies
计算机科学, 2022, 49(6A): 217-222. https://doi.org/10.11896/jsjkx.210700032
[4] 胡聪, 何晓晖, 邵发明, 张艳武, 卢冠林, 王金康.
基于极大极稳定区域及SVM的交通标志检测
Traffic Sign Detection Based on MSERs and SVM
计算机科学, 2022, 49(6A): 325-330. https://doi.org/10.11896/jsjkx.210300117
[5] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
[6] 张嘉淏, 刘峰, 齐佳音.
一种基于Bottleneck Transformer的轻量级微表情识别架构
Lightweight Micro-expression Recognition Architecture Based on Bottleneck Transformer
计算机科学, 2022, 49(6A): 370-377. https://doi.org/10.11896/jsjkx.210500023
[7] 王方红, 范兴刚, 杨静静, 周杰, 王德恩.
一种基于有向感知区域调整的强栅栏构建算法
Strong Barrier Construction Algorithm Based on Adjustment of Directional Sensing Area
计算机科学, 2022, 49(6A): 612-618. https://doi.org/10.11896/jsjkx.210300291
[8] 张志龙, 史贤俊, 秦玉峰.
基于改进准深度算法的诊断策略优化方法
Diagnosis Strategy Optimization Method Based on Improved Quasi Depth Algorithm
计算机科学, 2022, 49(6A): 729-732. https://doi.org/10.11896/jsjkx.210700076
[9] 高元浩, 罗晓清, 张战成.
基于特征分离的红外与可见光图像融合算法
Infrared and Visible Image Fusion Based on Feature Separation
计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148
[10] 刘洋, 李凡长.
基于变分贝叶斯的纤维丛元学习算法
Fiber Bundle Meta-learning Algorithm Based on Variational Bayes
计算机科学, 2022, 49(3): 225-231. https://doi.org/10.11896/jsjkx.201100111
[11] 乔杰, 蔡瑞初, 郝志峰.
一种基于信息瓶颈的因果关系挖掘方法
Mining Causality via Information Bottleneck
计算机科学, 2022, 49(2): 198-203. https://doi.org/10.11896/jsjkx.210100053
[12] 赵学磊, 季新生, 刘树新, 李英乐, 李海涛.
基于路径连接强度的有向网络链路预测方法
Link Prediction Method for Directed Networks Based on Path Connection Strength
计算机科学, 2022, 49(2): 216-222. https://doi.org/10.11896/jsjkx.210100107
[13] 魏昕, 冯锋.
基于高斯-柯西变异的帝国竞争算法优化
Optimization of Empire Competition Algorithm Based on Gauss-Cauchy Mutation
计算机科学, 2021, 48(11A): 142-146. https://doi.org/10.11896/jsjkx.201200071
[14] 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳.
基于用户兴趣词典和LSTM的个性化情感分类方法
User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification
计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202
[15] 肖满, 李伟东.
两点混合环上的半在线算法
Semi-online Algorithms for Mixed Ring with Two Nodes
计算机科学, 2021, 48(11A): 441-445. https://doi.org/10.11896/jsjkx.201100153
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!