Computer Science ›› 2014, Vol. 41 ›› Issue (3): 249-252.

Previous Articles     Next Articles

Algorithm for Computing All NE Repeats in Sequence Based on QSA Array

Munina YUSUFU,〓Gulina YUSUFU and ZHANG Hai-jun   

  • Online:2018-11-14 Published:2018-11-14

Abstract: The sequence repeating pattern recognition and extraction algorithms have a wide range of practical applications in the fields of data mining,pattern recognition,data compression and bioinformatics.This paper proposed a new algorithm RPT for computing all NE repeats with constraints based on QSA array.The characteristics of the NE repeats are fully considered in the algorithm design procedure for creating a statistical relation between the features and the repeats detection results.Algorithm constraints include minimum period pmin and maximum gap gmax which are used to filter the qualified NE repeats,and algorithm can output all occurrences of the NE repeats in ascending order.Compared with the existing algorithm based on suffix index,the space efficiency of the algorithm is improved.The experiments on classified data sets show that the algorithm RPT is effective for recognizing and extracting the NE repeats on biological sequences,in particular DNA sequences,as well as Uyghur Web texts.

Key words: Repeats,Data mining,Statistical characteristics,Constraints,Bioinformatics,Uyghur Web texts

[1] Benson G.Tandem repeats finder:a program to analyze DNA se-quences[J].Nucleic Acids Research,1999,27(2):573-580
[2] Lander E S,Linton L M,Birren B,et al.Initial Sequencing and Analysis of the Human Genome[J].Nature,2001,409(6822):860-921
[3] Price A L,Jones N C,Pevzner P A.De novo identification of repeat families in large genomes[J].Bioinformatics,2005,21(Suppl.):351-358
[4] 霍红卫,王小武.DNA序列中基于适应性后缀树的重复体识别算法[J].计算机学报,2010,33(4):747-754
[5] 纪震,周家锐,姜来,等.DNA序列数据压缩技术综述[J].电子学报,2010,38(5):1113-1121
[6] Franek F,Smyth W F,Tang Yu-dong.Computing all repeats using suffix arrays[J].Automata,Languages and Combinatorics,2003,8(4):579-591
[7] Narisawa K,Inenaga S,Bannai H,et al.Efficient computation of substring equivalence classes with suffix arrays[C]∥Proc.18th Annual Symp.Combinatorial Pattern Matching.2007:340-351
[8] Puglisi S J,Smyth W F,Yusufu M.Fast optimal algorithms for computing all the repeats in a string[J].Mathematics in Computer Science,2010,3(4):373-389
[9] 胡吉祥,许洪波,刘悦,等.重复串特征提取算法及其在文本聚类中的应用[J].计算机工程,2007,33(2):65-67
[10] Franek F,Holub J,Smyth W F,et al.Computing quasi suffix arrays[J].J.Automata,Languages & Combinatorics,2003,8(4):593-606

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!