Computer Science ›› 2015, Vol. 42 ›› Issue (9): 235-239.doi: 10.11896/j.issn.1002-137X.2015.09.045

Previous Articles     Next Articles

CGDNA:An Ensemble De Novo Genome Assembly Algorithm Based on Clustering Graph

XU Kui, CHEN Ke, XU Jun, TIAN Jia-lin, LIU Hao and WANG Yu-fan   

  • Online:2018-11-14 Published:2018-11-14

Abstract: The ultimate goal of genome sequencing is to determine the complete DNA sequence of an organism,which is the basis for genetic research and disease diagnosis.In general,genome sequencing can be divided into two steps:first,generating and determining the DNA fragments experimentally;second,assembling the fragments into full genome through computational method.Although the Sanger technology successfully resolves the human genome,it is replaced by the next generation of sequencing technology due to its high cost.The next generation of sequencing technology has the merits of high throughput,high coverage and low cost and accompanies with short reads and more errors as a byproduct,which brings more challenge to the assembly algorithms.Since it is reported that the assembly results by different algorithms are complementary and none of the assembly algorithms consistently outperforms the remaining algorithms,this study aimed at integrating the assembly results produced by multiple algorithms.In this study,we proposed an algorithm based on clustering graph.Through building index,mapping of reads,clustering of contigs and building of clustering graph,the proposed algorithm outperforms any of the single algorithm.The experimental results demonstrate that by implementing the CGDNA algorithm,two standard metrics(the largest scaffold and scaffold N50) are increased by 50% when compared to the state-of-the-art algorithms,i.e.,Velvet,ABySS,and SOAPdenovo.Moreover,the performance of CGDNA algorithm should be further improved when more base algorithms are added.The proposed algorithm largely improves the quality of assembly result,reduces the difficulty of genetic analysis and accelerates the genome research.

Key words: De novo genome assembly,Ensemble algorithm,Clustering graph,Indexing,Read mapping

[1] Medvedev P,Georgiou K,Myers G,et al.Computability of Mo-dels for Sequence Assmbly[M]∥Algorithms in Bioinformatics.Springer-Verlag Berlin Heidelberg,2007:289-301
[2] Drmanac R,et al.Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays[J].Science,2010,327(5961):78-81
[3] Harris T D,et al.Single-molecule DNA sequencing of a viral genome[J].Science,2008,320(5872):106-109
[4] Margulies M,et al.Genome sequencing inmicrofabricated high-density picolitre reactors[J].Nature,2005,7(7075):376-380
[5] McKernan K J,et al.Sequence and structural variation in a human genome uncovered by short-read,massively parallel ligation sequencing using two-base encoding[J].Genome Res,2009,19(9):1527-1541
[6] Medvedev P,et al.Paired de Bruijn graphs:a novel approach for incorporating mate pair information into genome assemblers[M]∥Research in Computational Molecular Biology.Springer,2011:238-251
[7] Hernandez D,Francois P,Farinelli L,et al.De novo bacterial genome sequencing:millions of very short reads assembled on a desktop computer[J].Genome Res.,2008,8(5):802-809
[8] Miller J R,Koren S,Sutton G.Assembly algorithms for next-generation sequencing data[J].Genomics,2010,5(6):315-327
[9] Warren R L,Sutton G G,Jones S J,et al.Assembling millions of short DNA sequences using SSAKE[J].Bioinformatics,2007,23(4):500-501
[10] Dohm J C,Lottaz C,Borodina T,et al.A fast and highly accurate short-read assembly algorithm for de novo genomic sequencing[J].Genome Res,2007,7(11):1697-1706
[11] Jeck W R,Reinhardt J A,Baltrus D A,et al.Extending assembly of short DNA sequences to handle error[J].Bioinformatics,2007,23(21):2942-2944
[12] http://linux1.softberry.com/berry.phtml?topic=OligoZip
[13] Bresler M,Sheehan S,Chan A H,et al.Telescoper:De novo Assembly of Highly Repetitive Regions[J].Bioinformatics,2012,8(18):311-317
[14] Chaisson M J P,et al.De novo fragment assembly with short mate-paired reads:does the read length matter?[J].Genome Res,2009,19(2):336-346
[15] MacCallum I,et al.ALLPATHS 2:small genomes assembled accurately and with high continuity from short paired reads[J].Genome Biol,2009,10(10):R103
[16] Simpson J T,et al.ABySS:a parallel assembler for short-read sequence data[J].Genome Res,2009,19(6):1117-1123
[17] Zerbino D R,Birney E.Velvet:algorithms for de novo short-read assembly using de Bruijn graphs[J].Genome Res,2008,18(5):821-829
[18] Li R,et al.De novo assembly of human genomes with massively parallel shortread sequencing[J].Genome Res,2010,20(2):265-272
[19] Earl D A,et al.Assemblathon 1:a competitive assessment of de novo short-read assembly methods[J].Genome Res,2011,21(12):2224-2241
[20] Miller J,Koren S,Sutton G.Assembly algorithms for next-gene-ration sequencing data[J].Genomics,2010,95(6):315-27

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!