Computer Science ›› 2016, Vol. 43 ›› Issue (5): 261-264.doi: 10.11896/j.issn.1002-137X.2016.05.049

Previous Articles     Next Articles

Fast Seed Set Refinement Algorithm for Closest Substrings Discovery

ZHANG Yi-pu, RU Feng and WANG Biao   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Finding the closest substrings in sequences is crucial for mining the specific functional sites in gene and understanding the gene regulatory relationship.This paper proposed a novel algorithm namely SCEM based on the improved expectation maximization algorithm of the seed sets refinement.SCEM divides the dataset into a series of seed sets by clustering the input sequences,then uses the improved EM algorithm to refine these seed sets and finds the closest substrings finally.Experiments using the real datasets and the simulated datasets demonstrat our algorithm can find the real closest substrings and has the high performance and efficiency comparing with the popular algorithm such as Random Projection.Moreover,SCEM also can solve the long closest substrings finding problem effectively.

Key words: Closest substring,Seed set,Cluster refinement,Expectation maximization

[1] Park P J.ChIP-seq:advantages and challenges of a maturing technology[J].Nat Rev Genet,2009,10(10):669-80
[2] Zhang Yi-pu,Wang Ping.A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets[C]∥BioMed Research International,2015.2015
[3] Bailey T L,Elkan C.Fitting a mixture model by expectation maximization to discover motifs in bipolymers[C]∥ Second International Conference on Intelligent Systems for Molecurlar Bio-logy.1994:28-36
[4] Zambelli F,Pesole G,Pavesi G.Motif discovery and transcription factor binding sites before and after the next-generation sequencing era[J].Briefings in Bioinformatics,2013,14(2):225-237
[5] Zhang Yi-pu,Huo H,Yu Qiang.A Heuristic Cluster-based EM Algorithm for the Planted(l,d) Problem[J].Journal of Bioinformatics and Computational Biology,2013,11(4):1350009
[6] Pevzner P A,Sze S H.Combinatorial approaches to finding subtle signals in DNA sequences[C]∥ISMB.2000:269-278
[7] Reid J,Wernisch L.STEME:efficient EM to find motifs in large data sets[J].Nucleic Acids Research,2011,39(18):126
[8] Buhler J,Tompa M.Finding motifs using random projections[J].Journal of Computational Biology,2002,9(2):225-242
[9] Chen Kun,Zhang Xiao-jun.MCI Clustering Algorithm SolvingPlanted (1,d) Motif Identification[J].Journal of Henan University(Natural Science),2015,45(1):102-107(in Chinese) 陈昆,张小骏.MCL聚类算法求解植入(l,d)模体识别问题[J].河南大学学报(自然科学版),2015,45(1):102-107
[10] Sun De-cai,Wang Xiao-xia.Research on Filtering Algorithm of Approximate String Matching[J].Computer Technology and Development,2015(4):171-176(in Chinese) 孙德才,王晓霞.近似串匹配过滤算法研究[J].计算机技术与发展,2015(4):171-176
[11] Xu Yong-kang,Yang Guang-lu,Lu Song-feng,et al.Approxi-mate string matching algorithm based on compressed suffixarray[J].Computer Engineering and Applications,2015,51(23):139-142(in Chinese) 胥永康,杨光露,路松峰,等.基于压缩后缀数组的近似字符串匹配算法[J].计算机工程与应用,2015,1(23):139-142
[12] Chan H L,Lam T W,Sung W K,et al.Compressed indexes for approximate string matching[J].Algorithmica,2010,58(2):263-281
[13] Atallah M J,Grigorescu E,Wu Y.A lower-variance randomized algorithm for approximate string matching[J].Information Processing Letters,2013,113(18):690-692

No related articles found!
Full text



No Suggested Reading articles found!