基于QSA数组计算序列中所有NE重复模式的算法

Abstract

Abstract: The sequence repeating pattern recognition and extraction algorithms have a wide range of practical applications in the fields of data mining,pattern recognition,data compression and bioinformatics．This paper proposed a new algorithm RPT for computing all NE repeats with constraints based on QSA array．The characteristics of the NE repeats are fully considered in the algorithm design procedure for creating a statistical relation between the features and the repeats detection results．Algorithm constraints include minimum period p_min and maximum gap g_max which are used to filter the qualified NE repeats,and algorithm can output all occurrences of the NE repeats in ascending order．Compared with the existing algorithm based on suffix index,the space efficiency of the algorithm is improved．The experiments on classified data sets show that the algorithm RPT is effective for recognizing and extracting the NE repeats on biological sequences,in particular DNA sequences,as well as Uyghur Web texts．

Key words: Repeats,Data mining,Statistical characteristics,Constraints,Bioinformatics,Uyghur Web texts

Munina YUSUFU,〓Gulina YUSUFU and ZHANG Hai-jun. Algorithm for Computing All NE Repeats in Sequence Based on QSA Array[J].Computer Science, 2014, 41(3): 249-252.

References

[1] Benson G．Tandem repeats finder:a program to analyze DNA se-quences[J]．Nucleic Acids Research,1999,27(2):573-580
[2] Lander E S,Linton L M,Birren B,et al.Initial Sequencing and Analysis of the Human Genome[J]．Nature,2001,409(6822):860-921
[3] Price A L,Jones N C,Pevzner P A．De novo identification of repeat families in large genomes[J]．Bioinformatics,2005,21(Suppl.):351-358
[4] 霍红卫,王小武．DNA序列中基于适应性后缀树的重复体识别算法[J]．计算机学报,2010,33(4):747-754
[5] 纪震,周家锐,姜来,等．DNA序列数据压缩技术综述[J]．电子学报,2010,38(5):1113-1121
[6] Franek F,Smyth W F,Tang Yu-dong．Computing all repeats using suffix arrays[J]．Automata,Languages and Combinatorics,2003,8(4):579-591
[7] Narisawa K,Inenaga S,Bannai H,et al.Efficient computation of substring equivalence classes with suffix arrays[C]∥Proc．18th Annual Symp．Combinatorial Pattern Matching．2007:340-351
[8] Puglisi S J,Smyth W F,Yusufu M．Fast optimal algorithms for computing all the repeats in a string[J]．Mathematics in Computer Science,2010,3(4):373-389
[9] 胡吉祥,许洪波,刘悦,等．重复串特征提取算法及其在文本聚类中的应用[J]．计算机工程,2007,33(2):65-67
[10] Franek F,Holub J,Smyth W F,et al.Computing quasi suffix arrays[J]．J．Automata,Languages & Combinatorics,2003,8(4):593-606

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Algorithm for Computing All NE Repeats in Sequence Based on QSA Array

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0