一种基于尾匹配q-gram的近似串匹配算法

doi:10.11896/j.issn.1002-137X.2014.06.048

摘要/Abstract

摘要： 近似串匹配是生物信息学、文本检索、信号处理等领域的一个基础问题,如何提高近似串匹配的速度一直都是研究的关键问题。提出一种新的在大文本库中快速查找近似匹配的无损过滤算法。为保证在大文本库中的匹配速度,本算法使用了查询速度较快的q-gram索引。为通过提高过滤算法的过滤效率达到提升算法整体性能的目的,详细分析了含有匹配串的文本区域,提取了一些基于尾匹配q-gram特征的新过滤条件,然后用这些特征优化了过滤算法的过滤标准。实验数据表明,新过滤条件有效地提高了算法的过滤效率,提升了算法的整体性能。结果显示新算法适合各种匹配错误率下的近似匹配,算法的通用性较强。

关键词: 近似串匹配,过滤算法,q-gram过滤,q元语法中图法分类号TP391.3文献标识码A

Abstract: Approximate string matching is a basic problem in computational biology,text retrieval and signal proces-sing,etc.,and how to improve the matching speed is a key issue up to now．Here,a new lossless q-gram filter was proposed to enhance the performance of finding true matches in a large text database for a given query．Q-gram index is employed as the index structure of large text database for its fast searching speed．To improve new filter’s performance by enhancing its filtration efficiency,some new features based on tail matched q-gram were extracted from text that includes true matches．In new filter,more irrelevant text is eliminated in filtration phase by using a filtration criterion which is enhanced by new extracted features,and the unfiltered text is verified by smith-waterman algorithm to locate the positions of all true matches．The experimental results demonstrate that the proposed filter’s filtration efficiency and performance are both enhanced．As a result,new filter algorithm has strong commonality and is suitable for approxi-mate string matching on condition of various matching error ratio.

Key words: Approximate string matching,Filter algorithm,q-gram filter,q-gram

孙德才,王晓霞. 一种基于尾匹配q-gram的近似串匹配算法[J]. 计算机科学, 2014, 41(6): 243-249. https://doi.org/10.11896/j.issn.1002-137X.2014.06.048

SUN De-cai and WANG Xiao-xia. Approximate String Matching Using Tail Matched q-gram[J]. Computer Science, 2014, 41(6): 243-249. https://doi.org/10.11896/j.issn.1002-137X.2014.06.048

参考文献

[1] Dorneles C F,Goncalves R,Mello R D．Approximate data instance matching:a survey[J]．Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G．A guided tour to approximate string matching[J]．ACM Computing Surveys,2001,33(1):31-88
[3] Levenshtein V．Binary codes capable of correcting deletions,insertions,and reversals[J]．Soviet Physics Doklady,1966,10(8):707-710
[4] Burkhardt S．Filter algorithms for approximate string matching[D]．Saarland:Department of Computer Science,Saarland University,2002
[5] Tian Y,Tata S,Patel J M,et al．Practical methods for constructing suffix trees[J]．VLDB Journal,2005,14(3):281-99
[6] Manber U,Myers G．Suffix arrays:A new method for on-linestring searches[J]．SIAM Journal on Computing,1993,22(5):935-948
[7] Navarro G,Baeza-yates R．A practical q-gram index for text retrieval allowing errors[J]．CLEI Electronic Journal,1998,1(2):1-16
[8] Kim M S,Whang K Y,Lee J G．n-Gram/2L-approximation:Atwo-level n-gram inverted index structure for approximate string matching[J]．Computer Systems Science and Engineering,200722(6):365-379
[9] Kim Y,Park H,Shim K．Efficient processing of substring match queries with inverted variable-length gram indexes[J]．Information Sciences,2013,244(0):119-141
[10] Wu S,Manber U．Fast text searching allowing errors[J]．Communications of the ACM,1992,35(10):83-91
[11] Chang Y I,Chen J R,Hsu M T．A hash trie filter method for approximate string matching in genomic databases[J]．Applied Intelligence,2010,33(1):21-38
[12] Burkhardt S,Crauser A,Ferragina P,et al．Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99．New York,USA:ACM,1999:77-83
[13] Jokinen P,Ukkonen E．Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science．Berlin,Germany:Springer-Verlag,1991:240-248
[14] Rasmussen K R,Stoye J,Myers E W．Efficient q-gram filters for finding all epsilon-matches over a given length[J]．Journal of Computational Biology,2006,13(2):296-308
[15] Lu C W,Lu C L,Lee R C T．A new filtration method and a hybrid strategy for approximate string matching[J]．Theoretical Computer Science,2013,481:9-17
[16] Altschul S F,Gish W,Miller W,et al．Basic local alignmentsearch tool[J]．Journal of Molecular Biology,1990,215(3):403-410
[17] Altschul S F,Madden T L,Alejandro A S,et al．Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]．Nucleic Acids Research,1997,25(17):3389-3402
[18] Sutinen E,Tarhio J．Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96．Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al．PatternHunter:faster and more sensitive homology search[J]．Bioinformatics,2002,18(3):440-445
[20] Egidi L,Manzini G．Better spaced seeds using Quadratic Residues[J]．Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G．Faster approximate string matching[J]．Algorithmica,1999,23(2):127-158
[22] Myers E W．A sublinear algorithm for approximate keywordsearching[J]．Algorithmica,1994,12(4/5):345-374
[23] Sutinen E,Szpankowski W．On the collapse of the q-gram filtra-tion[C]∥Proceedings of the International Conference on FUN with Algorithms．Waterloo,Canada:Carleton Scientific,1998:178-193
[24] Puglisi1S J,Smyth W F,Turpin A．Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval．Berlin,Germany:Springer Verlag,2006:122-133
[25] Karp R M,Rabin M O．Efficient randomized pattern-matchingalgorithms[J]．IBM Journal of Research and Development,1987,31(2):249-260
[26] Smith T F,Waterman M S．Identification of common molecular subsequences[J]．Journal of Molecular Biology,1981,147(1):195-197
[27] NCBI．UniGene Build #223,Homo sapiens[DB/OL]．ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz,2010-04-28

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed