计算机科学 ›› 2017, Vol. 44 ›› Issue (1): 80-83.doi: 10.11896/j.issn.1002-137X.2017.01.015

• 2016第六届中国数据挖掘会议 • 上一篇    下一篇

基于特征挖掘的基因组缺失变异集成检测方法

张晓东,凌诚,高敬阳   

  1. 北京化工大学信息科学与技术学院 北京100029,北京化工大学信息科学与技术学院 北京100029,北京化工大学信息科学与技术学院 北京100029
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金(61472026),广州市科技计划项目(2014J4100081)资助

Integrated Feature Mining Based Approach for Calling Genomic Deletions

ZHANG Xiao-dong, LING Cheng and GAO Jing-yang   

  • Online:2018-11-13 Published:2018-11-13

摘要: 随着高通量测序技术的应用与发展,基于测序的缺失变异检测方法大量涌现。然而,单一检测方法仍存在适用的局限性以及检测精度与敏感度不足的问题。为此,提出一种基于多检测理论融合的特征挖掘与机器学习算法集成的基因组缺失变异综合检测方法。该方法将多种工具应用于个体缺失变异检测,得到变异检测初始集;再根据多种检测理论对初始集中的缺失变异进行序列特征挖掘与特征提取;最后,将检测工具与机器学习算法相融合以获得集成的检测方法,剔除初始集中的假阳性变异,获得最终的结果集。基于千人基因组计划数据的实验表明,相较于单个工具的检测结果,该方法在检测精度和敏感度上均占优势;相较于多个工具检测结果的直接组合,该方法在损失少许检测敏感度的前提下显著地提高了检测精度。

关键词: 缺失变异,特征挖掘,集成检测

Abstract: With the application and development of next generation sequencing technology,methods of calling genomic deletions based on sequencing have proliferated.However,using a single method to call deletions has limitation in application and insufficiency of precision and sensitivity.To solve these problems,an integrated approach for calling deletions was proposed based on feature mining according to combining multiple theory and machine learning algorithm.First,different callers are used for calling deletions.These results are merged as aninitial result set of deletions.Then,according to variety of detection strategies,features of the initial result set of deletions are extracted based on next generation sequencing data.Finally,to obtain the final result set of calling deletions,a machine learning model is trained to distinguish false positive deletions from initial call set.The experimental results show that compared with a single caller such as Pindel and SVseq2,the proposed approach has higher precision and sensitivity simultaneously.Compared with directly merging multiple deletion call sets,the proposed approach can significantly improve the precision with slight loss of sensitivity.

Key words: Deletion,Feature mining,Integrated detection

[1] EICHLER E E,NICKERSON D A,ALTSHULER D,et al.Completing the map of human genetic variation[J].Nature,2007,447(7141):161-165.
[2] CONRAD D F,PINTO D,REDON R,et al.Origins and functional impact of copy number variation in the human genome[J].Nature,2010,464(7289):704-712.
[3] PAK C H,DANKO T,ZHANG Y,et al.Human neuropsychia-tric disease modeling using conditional deletion reveals synaptic transmission defects caused by heterozygous mutations in NRXN1[J].Cell Stem Cell,2015,17(3):316-328.
[4] LEE M Y,WON H S,BAEK J W,et al.Variety of prenatally diag-nosed congenital heart disease in 22q11.2 deletion syndrome[J].Obstetrics & Gynecology Science,2014,57(1):11-16.
[5] ALKAN C,COE B P,EICHLER E E.Genome structural variation discovery and genotyping[J].Nature Reviews Genetics,2011,12(5):363-376.
[6] YE K,SCHULZ M H,LONG Q,et al.Pindel:a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads[J].Bioinformatics,2009,25(21):2865-2871.
[7] ZHANG J,WANG J,WU Y.An improved approach for accu-rate and efficient calling of structural variations with low-coverage sequence data[J].BMC Bioinformatics,2012,13(Suppl 6):1-11.
[8] RAUSCH T,ZICHNER T,SCHLATTL A,et al.DELLY:st-ructural variant discovery by integrated paired-end and split-read analysis[J].Bioinformatics,2012,28(18):i333-i339.
[9] CHEN K,WALLIS J W,MCLELLAN M D,et al.BreakDancer:an algorithm for high-resolution mapping of genomic structural variation[J].Nature Methods,2009,6(9):677-681.
[10] ABYZOV A,URBAN A E,SNYDER M,et al.CNVnator:anapproach to discover,genotype,and characterize typical and atypical CNVs from family and population genome sequencing[J].Genome Research,2011,21(6):974-984.
[11] HORMOZDIARI F,HAJIRASOULIHA I,DAO P,et al.Next-generation Variation Hunter:combinatorial algorithms for transposon insertion discovery[J].Bioinformatics,2010,26(12):i350-i357.
[12] LI H,DURBIN R.Fast and accurate short read alignment with Burrows-Wheeler transform[J].Bioinformatics,2009,25(14):1754-1760.
[13] LI H,HANDSAKER B,WYSOKER A,et al.The sequence alignment/map format and SAMtools[J].Bioinformatics,2009,25(16):2078-2079.
[14] CHANG C C,LIN C J.LIBSVM:A library for support vector machines[J].ACM Transactions on Intelligent Systems and Technology (TIST),2011,2(3):389-396.
[15] 1000 Genomes Project Consortium.An integrated map of genetic variation from 1092 human genomes[J].Nature,2012,491(7422):56-65.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!