计算机科学 ›› 2016, Vol. 43 ›› Issue (5): 261-264.doi: 10.11896/j.issn.1002-137X.2016.05.049

• 人工智能 • 上一篇    下一篇

一种寻找最近子串的快速种子集求精算法

张懿璞,茹锋,王飚   

  1. 长安大学电子控制与工程学院 西安710064,长安大学电子控制与工程学院 西安710064,长安大学电子控制与工程学院 西安710064
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学青年基金(61501058),陕西省自然科学基础研究计划项目(2014JM8328,2014JQ7269,2016JQ6075),中央高校基本业务费(310832161008)资助

Fast Seed Set Refinement Algorithm for Closest Substrings Discovery

ZHANG Yi-pu, RU Feng and WANG Biao   

  • Online:2018-12-01 Published:2018-12-01

摘要: 寻找序列中的最近子串对挖掘基因中特殊的功能位点和了解基因调控关系有着重要意义。提出了一种新的基于种子集求精的改进的期望最大化算法SCEM来对基因序列中的最近子串进行寻找。通过对输入序列聚类,将数据集分解为若干种子集,再使用改进的期望最大化算法对各种子集进行求精,SCEM最终可寻找到序列中的最近子串。真实数据和模拟数据实验表明,SCEM算法可以寻找到真实的最近子串,与随机投影等流行算法相比也能保证较高的性能和效率,并且可以有效解决较长的最近子串寻找问题。

关键词: 最近子串,种子集,聚类求精,期望最大化

Abstract: Finding the closest substrings in sequences is crucial for mining the specific functional sites in gene and understanding the gene regulatory relationship.This paper proposed a novel algorithm namely SCEM based on the improved expectation maximization algorithm of the seed sets refinement.SCEM divides the dataset into a series of seed sets by clustering the input sequences,then uses the improved EM algorithm to refine these seed sets and finds the closest substrings finally.Experiments using the real datasets and the simulated datasets demonstrat our algorithm can find the real closest substrings and has the high performance and efficiency comparing with the popular algorithm such as Random Projection.Moreover,SCEM also can solve the long closest substrings finding problem effectively.

Key words: Closest substring,Seed set,Cluster refinement,Expectation maximization

[1] Park P J.ChIP-seq:advantages and challenges of a maturing technology[J].Nat Rev Genet,2009,10(10):669-80
[2] Zhang Yi-pu,Wang Ping.A Fast Cluster Motif Finding Algorithm for ChIP-Seq Data Sets[C]∥BioMed Research International,2015.2015
[3] Bailey T L,Elkan C.Fitting a mixture model by expectation maximization to discover motifs in bipolymers[C]∥ Second International Conference on Intelligent Systems for Molecurlar Bio-logy.1994:28-36
[4] Zambelli F,Pesole G,Pavesi G.Motif discovery and transcription factor binding sites before and after the next-generation sequencing era[J].Briefings in Bioinformatics,2013,14(2):225-237
[5] Zhang Yi-pu,Huo H,Yu Qiang.A Heuristic Cluster-based EM Algorithm for the Planted(l,d) Problem[J].Journal of Bioinformatics and Computational Biology,2013,11(4):1350009
[6] Pevzner P A,Sze S H.Combinatorial approaches to finding subtle signals in DNA sequences[C]∥ISMB.2000:269-278
[7] Reid J,Wernisch L.STEME:efficient EM to find motifs in large data sets[J].Nucleic Acids Research,2011,39(18):126
[8] Buhler J,Tompa M.Finding motifs using random projections[J].Journal of Computational Biology,2002,9(2):225-242
[9] Chen Kun,Zhang Xiao-jun.MCI Clustering Algorithm SolvingPlanted (1,d) Motif Identification[J].Journal of Henan University(Natural Science),2015,45(1):102-107(in Chinese) 陈昆,张小骏.MCL聚类算法求解植入(l,d)模体识别问题[J].河南大学学报(自然科学版),2015,45(1):102-107
[10] Sun De-cai,Wang Xiao-xia.Research on Filtering Algorithm of Approximate String Matching[J].Computer Technology and Development,2015(4):171-176(in Chinese) 孙德才,王晓霞.近似串匹配过滤算法研究[J].计算机技术与发展,2015(4):171-176
[11] Xu Yong-kang,Yang Guang-lu,Lu Song-feng,et al.Approxi-mate string matching algorithm based on compressed suffixarray[J].Computer Engineering and Applications,2015,51(23):139-142(in Chinese) 胥永康,杨光露,路松峰,等.基于压缩后缀数组的近似字符串匹配算法[J].计算机工程与应用,2015,1(23):139-142
[12] Chan H L,Lam T W,Sung W K,et al.Compressed indexes for approximate string matching[J].Algorithmica,2010,58(2):263-281
[13] Atallah M J,Grigorescu E,Wu Y.A lower-variance randomized algorithm for approximate string matching[J].Information Processing Letters,2013,113(18):690-692

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!