计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 55-58, 84.doi: 10.11896/j.issn.1002-137X.2017.10.010

• 生物信息学 • 上一篇    下一篇

多序列星比对算法的改进及其在Spark中的并行化研究

董改芳,付学良,李宏慧   

  1. 内蒙古农业大学计算机与信息工程学院 呼和浩特010018,内蒙古农业大学计算机与信息工程学院 呼和浩特010018,内蒙古农业大学计算机与信息工程学院 呼和浩特010018
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61063004,61363006),内蒙古自然科学基金(2015MS0605,2015MS0626,2015MS0627),内蒙古教育厅高校研究项目(NJZC059),教育部留学人员基金([2014]1685),内蒙古自治区科技计划项目:穿透降水量GSM网络在线监测与数据传输系统的研制资助

Improvement of Multiple Sequence Center Star Method and Its Parallelization in Spark

DONG Gai-fang, FU Xue-liang and LI Hong-hui   

  • Online:2018-12-01 Published:2018-12-01

摘要: 多序列星比对算法在确定中心序列时需要计算任意两个输入序列的距离及分数,其较高的时间复杂度 耗费了大量时间,因此提出了通过综合计算每个序列产生的k-mers及各个k-mer在各序列中出现的次数来确定k-mers的拼接选择,由k-mers进行拼接从而 得到中心序列。进而,在双序列比对过程中采用搜索两个序列最大相似子串的思想,改进的星比对算法的精度在一定程度上得到了明显提升。接着, 将改进的星比对算法在Spark中进行并行化设计与实现。采用Spark的Yarn-Client运行模式,对正常人线粒体的多组数据进行实验,分析了算法性能上的不足及改进方向。

关键词: 多序列比对,星比对算法,K-mer,Spark,RDD

Abstract: Because center star alignment algorithm needs to calculate the distance and scores of any two input sequences when determining the central sequence,it caused the high time complexity.A strategy for determining the assembling selection of k-mers was proposed by synthesizing computing the k-mers generated by each sequence and the number of occurrences of each k-mer in each sequence.Furthermore,in the process of pair wise sequence alignment,the idea of searching two sequences of the largest similar sub-sequences was used.The accuracy of the improved center star alignment algorithm is improved with a certain degree.The improved center star alignment algorithm was parallelized designed and implemented in Spark.Spark’s Yarn-Client running mode was used to experiment the multi-group data of normal mitochondria.The performance of the algorithm was analyzed and the direction of improvement was analyzed.

Key words: Multiple sequence alignment,Center star method,K-mer,Spark,RDD

[1] Kevin.生物序列比对研究的现实意义[EB/OL].http://blog.sina.com.cn/s/blog_ 5f6687090 1013mvp.html.2012.05.07.
[2] 生物大数据激增或揭示疾病如何发生[EB/OL].http://www.cbdio.com/BigData /2016-08/17/content_5191896.htm.
[3] BLAZEWICZ J,FROHMBERG W,KIERZYNKA M,et al.G-MSA-A GPU-based,fast and accurate algorithm for multiple sequence alignment[J].Journal of Parallel and Distributed Computing,2013,3(1):32-41.
[4] ZHU X Y,LI K L,AHMAD S.A data parallel strategy for aligning multiple biological sequences on multi-core computers[J].Computers in Biology and Medicine,2013,3(4):350-361.
[5] GUDY A,DEOROWICZ S.QuickProbs-A Fast Multiple Se-quence Alignment Algorithm Designed for Graphics Processors[J].PLoS One,2014,9(2):e88901.
[6] SADASIVAM,SUDHA G,BAKTAVATCHALAM G.A novel approach to Multiple Sequence Alignment using hadoop data grids[J].International Journal of Bioinformatics Research and Applications,2010,6(5):472-483.
[7] SHU N J,ARNE E.KalignP:Improved multiple sequence alignments using position specific gap penalties in Kalign2[J].Bioinformatics,2011,27(12):1702-1703.
[8] FABIAN S,DAVID D,ANDREAS W,et al.Making automated multiple alignments of very large numbers of protein sequences [J].Bioinformatics,2013,9(8):989-995.
[9] KAZUTAKA K,TOH,HIROYUKI.Parallelization of the MAFFTmultiple sequence alignment program[J].Bioinformatics,2010,6(15):1899-1900.
[10] LAN H D,CHAN Y D,XU K,et al.Parallel algorithms forlarge-scale biological sequence alignment on Xeon-Phi based clusters[J].BMC Bioinformatics,2016,17(9):267.
[11] 王洋子豪.知乎[EB/OL].https://www.zhihu.com/question/19903344 /answer/13779421.
[12] YANG C Y,ZHONG C.CPU and GPU co-ordinate parallel accelerate multiple biological sequence alignment[J].Journal of Chinese Mini-Micro Computer Systems,2016,7(12):2780-2784.(in Chinese) 杨春燕,钟诚.CPU和GPU协同并行加速多生物序列比对[J].小型微型计算机系统,2016,7(12):2780-2784.
[13] LIN J,TANG M,TONG R F.GPU accelerate biological se- quence alignment[J].Journal of Computer-Aided Design and Computer Graphics,2010,2(3):420-427.(in Chinese) 林江,唐敏,童若锋.GPU加速的生物序列比对[J].计算机辅助设计与图形学学报,2010,2(3):420-427.
[14] YE X C,LIN W,FAN D R,et al.Parallel optimization of protein sequence alignment algorithm in public nucleus structure[J].Journal of Software,2010,1(12):3094-3105.(in Chinese) 叶笑春,林伟,范东睿,等.蛋白质序列比对算法在众核结构上的并行优化[J].软件学报,2010,21(12):3094-3105.
[15] CAO Z Y,LANG X Y,LIU X,et al.Parallel optimization of super-large-scale sequence alignment calculation[J].Journal of Computer Applications,2011,1(2):32-35.(in Chinese) 曹宗雁,郎显宇,刘昕,等.超大规模序列比对计算的并行优化[J].计算机应用,2011,1(2):32-35.
[16] JIN X,LUO Z G,JIANG X Z,et al.Parallel Design and Implementation of Multi-Sequence Alignment Software T_Coffee[J].Computer Applications and Software,2008,25(4):221-223.(in Chinese) 靳新,骆志刚,蒋晓舟,等.多序列比对软件T_Coffee的并行化设计与实现[J].计算机应用与软件,2008,25(4):221-223.
[17] WEI S F, LIU y, JIANG C Y. Parallel Research on Multiple Se quence Alignment of Genetic Annealing Based on GPU [J]. Computer Engineering and Design, 2014, 35 (4): 1247-1252. (in Chinese) 韦树烽,刘羽,蒋财运.基于GPU的遗传退火多序列比对并行研究[1].计算机工程与设计,2014,35(4):1247-1252.
[18] WANG W D,TANG W,DUAN B,et al.Study on Parallel Acceleration Technique of High-throughput Gene Sequences based on HASH index[J].Journal of Computer Research and Deve-lopment,2013,50(11):2463-2471.(in Chinese) 王文迪,汤文,段勃,等.基于Hash索引的高通量基因序列比对并行加速技术研究[J].计算机研究与发展,2013,0(11):2463-2471.
[19] ZHANG L,CHAI H,WO L K,et al.Research on Biological Sequence Alignment Algorithm and Graphics Hardware Acceleration[J].Chinese Journal of Biomedical Engineering,2011,30(6):853-858.(in Chinese) 张林,柴惠,沃立科,等.生物序列比对算法与图形硬件加速研究[J].中国生物医学工程学报,2011,0(6):853-858.
[20] ZHU X Y,LI R F,LI K L,et al.Research on Parallel Processing of Biological Sequence Alignment Based on Heterogeneous System[J].Computer Science,2015,2(11):390-399.(in Chinese) 朱香元,李仁发,李肯立,等.基于异构系统的生物序列比对并行处理研究进展[J].计算机科学,2015,2(11):390-399.
[21] ABUN J M,PENA T F,PICHEL J C.PASTASpark:multiple sequence alignment meets Big Data[J].Bioinformatics,2017.
[22] ZAMBRANO-VEGA C,NEBRO A J,GARCA-NIETO J,et al.M2Align:parallel multiple sequence alignment with a multi-objective metaheuristic[J].Bioinformatics,2017.
[23] ZOU Q,HU Q,GUO M,et al.HAlign:Fast multiple similarDNA/RNA sequence alignment based on the centre star strategy[J].Bioinformatics,2015,1(15):2475-2481.
[24] LI D P,JU Y,ZOU Q.Application in tumor research of multiple sequence star alignment method based on MapReduce[J].Can-cer Progress,2016,4(6):510-513.(in Chinese) 李大鹏,鞠颖,邹权.基于MapReduce的多序列星比对方法在肿瘤研究中的应用[J].癌症进展,2016,4(6):510-513.
[25] ZOU Q,GUO M,WANG X,et al.An Algorithm for DNA Multiple Sequence Alignment Based on Center Star Method and Keyword Tree[J].Acta Electronica Sinica,2009(37):1746-1750.
[26] CHEN X,WANG C,TANG S,et al.CMSA:a heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment[J].Bmc Bioinformatics,2017,18(1):315.
[27] ZOU Q,SHAN X,JIANG Y. A Novel Center Star Multiple Se quence Alignment Algorithm Based on AffineGap Penalty and K-Band[J].Physics Proced,2012,33:322-327.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .