基于QSA数组计算序列中所有NE重复模式的算法

计算机科学 ›› 2014, Vol. 41 ›› Issue (3): 249-252.

基于QSA数组计算序列中所有NE重复模式的算法

木妮娜·玉素甫,古丽娜·玉素甫,张海军

新疆师范大学计算机科学技术学院乌鲁木齐830054;新疆师范大学教育科学学院乌鲁木齐830054;新疆师范大学计算机科学技术学院乌鲁木齐830054

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受国家自然科学基金(61263044,61163045),新疆维吾尔自治区高校科研计划重点项目(XJEDU2011I40),新疆维吾尔自治区自然科学基金(2012211A056),新疆师范大学计算机应用技术重点学科招标课题(12XSXZ0602)资助

Algorithm for Computing All NE Repeats in Sequence Based on QSA Array

Munina YUSUFU,〓Gulina YUSUFU and ZHANG Hai-jun

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 序列中重复模式的识别与提取算法在数据挖掘、模式识别、数据压缩、生物信息学等领域中具有广泛的实际应用。提出一种全新的基于QSA数组计算所有带有约束条件的NE重复模式的算法RPT。算法设计中充分考虑了NE重复模式的特征,以建立特征和重复模式检测结果之间的统计联系；算法中的约束条件包括最小周期p_min和最大间距gmax,其可用于筛选符合条件的NE重复模式,并可按照递增序输出所有NE重复模式的出现位置。与已有的基于后缀索引的算法相比,此算法的空间效率得到了提高。在分类属性数据样本集上进行的实验表明,算法RPT对生物序列尤其是DNA序列以及维吾尔语Web文本中NE重复模式的识别与提取都很有效。

关键词: 重复模式,数据挖掘,统计特征,约束条件,生物计算,维吾尔语Web文本中图法分类号TP391文献标识码A

Abstract: The sequence repeating pattern recognition and extraction algorithms have a wide range of practical applications in the fields of data mining,pattern recognition,data compression and bioinformatics．This paper proposed a new algorithm RPT for computing all NE repeats with constraints based on QSA array．The characteristics of the NE repeats are fully considered in the algorithm design procedure for creating a statistical relation between the features and the repeats detection results．Algorithm constraints include minimum period p_min and maximum gap g_max which are used to filter the qualified NE repeats,and algorithm can output all occurrences of the NE repeats in ascending order．Compared with the existing algorithm based on suffix index,the space efficiency of the algorithm is improved．The experiments on classified data sets show that the algorithm RPT is effective for recognizing and extracting the NE repeats on biological sequences,in particular DNA sequences,as well as Uyghur Web texts．

Key words: Repeats,Data mining,Statistical characteristics,Constraints,Bioinformatics,Uyghur Web texts

木妮娜·玉素甫,古丽娜·玉素甫,张海军. 基于QSA数组计算序列中所有NE重复模式的算法[J]. 计算机科学, 2014, 41(3): 249-252. https://doi.org/

Munina YUSUFU,〓Gulina YUSUFU and ZHANG Hai-jun. Algorithm for Computing All NE Repeats in Sequence Based on QSA Array[J]. Computer Science, 2014, 41(3): 249-252. https://doi.org/

参考文献

[1] Benson G．Tandem repeats finder:a program to analyze DNA se-quences[J]．Nucleic Acids Research,1999,27(2):573-580
[2] Lander E S,Linton L M,Birren B,et al.Initial Sequencing and Analysis of the Human Genome[J]．Nature,2001,409(6822):860-921
[3] Price A L,Jones N C,Pevzner P A．De novo identification of repeat families in large genomes[J]．Bioinformatics,2005,21(Suppl.):351-358
[4] 霍红卫,王小武．DNA序列中基于适应性后缀树的重复体识别算法[J]．计算机学报,2010,33(4):747-754
[5] 纪震,周家锐,姜来,等．DNA序列数据压缩技术综述[J]．电子学报,2010,38(5):1113-1121
[6] Franek F,Smyth W F,Tang Yu-dong．Computing all repeats using suffix arrays[J]．Automata,Languages and Combinatorics,2003,8(4):579-591
[7] Narisawa K,Inenaga S,Bannai H,et al.Efficient computation of substring equivalence classes with suffix arrays[C]∥Proc．18th Annual Symp．Combinatorial Pattern Matching．2007:340-351
[8] Puglisi S J,Smyth W F,Yusufu M．Fast optimal algorithms for computing all the repeats in a string[J]．Mathematics in Computer Science,2010,3(4):373-389
[9] 胡吉祥,许洪波,刘悦,等．重复串特征提取算法及其在文本聚类中的应用[J]．计算机工程,2007,33(2):65-67
[10] Franek F,Holub J,Smyth W F,et al.Computing quasi suffix arrays[J]．J．Automata,Languages & Combinatorics,2003,8(4):593-606

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed