一种寻找最近子串的快速种子集求精算法

doi:10.11896/j.issn.1002-137X.2016.05.049

计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 261-264.doi: 10.11896/j.issn.1002-137X.2016.05.049

一种寻找最近子串的快速种子集求精算法

张懿璞,茹锋,王飚

长安大学电子控制与工程学院西安710064,长安大学电子控制与工程学院西安710064,长安大学电子控制与工程学院西安710064

出版日期:2018-12-01 发布日期:2018-12-01
基金资助:
本文受国家自然科学青年基金(61501058),陕西省自然科学基础研究计划项目(2014JM8328,2014JQ7269,2016JQ6075),中央高校基本业务费(310832161008)资助

Fast Seed Set Refinement Algorithm for Closest Substrings Discovery

ZHANG Yi-pu, RU Feng and WANG Biao

Online:2018-12-01 Published:2018-12-01

摘要/Abstract

摘要： 寻找序列中的最近子串对挖掘基因中特殊的功能位点和了解基因调控关系有着重要意义。提出了一种新的基于种子集求精的改进的期望最大化算法SCEM来对基因序列中的最近子串进行寻找。通过对输入序列聚类,将数据集分解为若干种子集,再使用改进的期望最大化算法对各种子集进行求精,SCEM最终可寻找到序列中的最近子串。真实数据和模拟数据实验表明,SCEM算法可以寻找到真实的最近子串,与随机投影等流行算法相比也能保证较高的性能和效率,并且可以有效解决较长的最近子串寻找问题。

关键词: 最近子串,种子集,聚类求精,期望最大化

Abstract: Finding the closest substrings in sequences is crucial for mining the specific functional sites in gene and understanding the gene regulatory relationship.This paper proposed a novel algorithm namely SCEM based on the improved expectation maximization algorithm of the seed sets refinement.SCEM divides the dataset into a series of seed sets by clustering the input sequences,then uses the improved EM algorithm to refine these seed sets and finds the closest substrings finally.Experiments using the real datasets and the simulated datasets demonstrat our algorithm can find the real closest substrings and has the high performance and efficiency comparing with the popular algorithm such as Random Projection.Moreover,SCEM also can solve the long closest substrings finding problem effectively.

Key words: Closest substring,Seed set,Cluster refinement,Expectation maximization

张懿璞,茹锋,王飚. 一种寻找最近子串的快速种子集求精算法[J]. 计算机科学, 2016, 43(5): 261-264. https://doi.org/10.11896/j.issn.1002-137X.2016.05.049

ZHANG Yi-pu, RU Feng and WANG Biao. Fast Seed Set Refinement Algorithm for Closest Substrings Discovery[J]. Computer Science, 2016, 43(5): 261-264. https://doi.org/10.11896/j.issn.1002-137X.2016.05.049

参考文献

[1] Park P J.ChIP-seq:advantages and challenges of a maturing technology[J].Nat Rev Genet,2009,10(10):669-80
[2] Zhang Yi-pu,Wang Ping.A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets[C]∥BioMed Research International,2015.2015
[3] Bailey T L,Elkan C.Fitting a mixture model by expectation maximization to discover motifs in bipolymers[C]∥ Second International Conference on Intelligent Systems for Molecurlar Bio-logy.1994:28-36
[4] Zambelli F,Pesole G,Pavesi G.Motif discovery and transcription factor binding sites before and after the next-generation sequencing era[J].Briefings in Bioinformatics,2013,14(2):225-237
[5] Zhang Yi-pu,Huo H,Yu Qiang.A Heuristic Cluster-based EM Algorithm for the Planted(l,d) Problem[J].Journal of Bioinformatics and Computational Biology,2013,11(4):1350009
[6] Pevzner P A,Sze S H.Combinatorial approaches to finding subtle signals in DNA sequences[C]∥ISMB.2000:269-278
[7] Reid J,Wernisch L.STEME:efficient EM to find motifs in large data sets[J].Nucleic Acids Research,2011,39(18):126
[8] Buhler J,Tompa M.Finding motifs using random projections[J].Journal of Computational Biology,2002,9(2):225-242
[9] Chen Kun,Zhang Xiao-jun.MCI Clustering Algorithm SolvingPlanted (1,d) Motif Identification[J].Journal of Henan University(Natural Science),2015,45(1):102-107(in Chinese) 陈昆,张小骏.MCL聚类算法求解植入(l,d)模体识别问题[J].河南大学学报(自然科学版),2015,45(1):102-107
[10] Sun De-cai,Wang Xiao-xia.Research on Filtering Algorithm of Approximate String Matching[J].Computer Technology and Development,2015(4):171-176(in Chinese) 孙德才,王晓霞.近似串匹配过滤算法研究[J].计算机技术与发展,2015(4):171-176
[11] Xu Yong-kang,Yang Guang-lu,Lu Song-feng,et al.Approxi-mate string matching algorithm based on compressed suffixarray[J].Computer Engineering and Applications,2015,51(23):139-142(in Chinese) 胥永康,杨光露,路松峰,等.基于压缩后缀数组的近似字符串匹配算法[J].计算机工程与应用,2015,1(23):139-142
[12] Chan H L,Lam T W,Sung W K,et al.Compressed indexes for approximate string matching[J].Algorithmica,2010,58(2):263-281
[13] Atallah M J,Grigorescu E,Wu Y.A lower-variance randomized algorithm for approximate string matching[J].Information Processing Letters,2013,113(18):690-692

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种寻找最近子串的快速种子集求精算法

Fast Seed Set Refinement Algorithm for Closest Substrings Discovery

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0