一种寻找最近子串的快速种子集求精算法

doi:10.11896/j.issn.1002-137X.2016.05.049

Abstract

Abstract: Finding the closest substrings in sequences is crucial for mining the specific functional sites in gene and understanding the gene regulatory relationship.This paper proposed a novel algorithm namely SCEM based on the improved expectation maximization algorithm of the seed sets refinement.SCEM divides the dataset into a series of seed sets by clustering the input sequences,then uses the improved EM algorithm to refine these seed sets and finds the closest substrings finally.Experiments using the real datasets and the simulated datasets demonstrat our algorithm can find the real closest substrings and has the high performance and efficiency comparing with the popular algorithm such as Random Projection.Moreover,SCEM also can solve the long closest substrings finding problem effectively.

Key words: Closest substring,Seed set,Cluster refinement,Expectation maximization

ZHANG Yi-pu, RU Feng and WANG Biao. Fast Seed Set Refinement Algorithm for Closest Substrings Discovery[J].Computer Science, 2016, 43(5): 261-264.

References

[1] Park P J.ChIP-seq:advantages and challenges of a maturing technology[J].Nat Rev Genet,2009,10(10):669-80
[2] Zhang Yi-pu,Wang Ping.A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets[C]∥BioMed Research International,2015.2015
[3] Bailey T L,Elkan C.Fitting a mixture model by expectation maximization to discover motifs in bipolymers[C]∥ Second International Conference on Intelligent Systems for Molecurlar Bio-logy.1994:28-36
[4] Zambelli F,Pesole G,Pavesi G.Motif discovery and transcription factor binding sites before and after the next-generation sequencing era[J].Briefings in Bioinformatics,2013,14(2):225-237
[5] Zhang Yi-pu,Huo H,Yu Qiang.A Heuristic Cluster-based EM Algorithm for the Planted(l,d) Problem[J].Journal of Bioinformatics and Computational Biology,2013,11(4):1350009
[6] Pevzner P A,Sze S H.Combinatorial approaches to finding subtle signals in DNA sequences[C]∥ISMB.2000:269-278
[7] Reid J,Wernisch L.STEME:efficient EM to find motifs in large data sets[J].Nucleic Acids Research,2011,39(18):126
[8] Buhler J,Tompa M.Finding motifs using random projections[J].Journal of Computational Biology,2002,9(2):225-242
[9] Chen Kun,Zhang Xiao-jun.MCI Clustering Algorithm SolvingPlanted (1,d) Motif Identification[J].Journal of Henan University(Natural Science),2015,45(1):102-107(in Chinese) 陈昆,张小骏.MCL聚类算法求解植入(l,d)模体识别问题[J].河南大学学报(自然科学版),2015,45(1):102-107
[10] Sun De-cai,Wang Xiao-xia.Research on Filtering Algorithm of Approximate String Matching[J].Computer Technology and Development,2015(4):171-176(in Chinese) 孙德才,王晓霞.近似串匹配过滤算法研究[J].计算机技术与发展,2015(4):171-176
[11] Xu Yong-kang,Yang Guang-lu,Lu Song-feng,et al.Approxi-mate string matching algorithm based on compressed suffixarray[J].Computer Engineering and Applications,2015,51(23):139-142(in Chinese) 胥永康,杨光露,路松峰,等.基于压缩后缀数组的近似字符串匹配算法[J].计算机工程与应用,2015,1(23):139-142
[12] Chan H L,Lam T W,Sung W K,et al.Compressed indexes for approximate string matching[J].Algorithmica,2010,58(2):263-281
[13] Atallah M J,Grigorescu E,Wu Y.A lower-variance randomized algorithm for approximate string matching[J].Information Processing Letters,2013,113(18):690-692

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Fast Seed Set Refinement Algorithm for Closest Substrings Discovery

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0