Computer Science ›› 2014, Vol. 41 ›› Issue (6): 243-249.doi: 10.11896/j.issn.1002-137X.2014.06.048

Previous Articles     Next Articles

Approximate String Matching Using Tail Matched q-gram

SUN De-cai and WANG Xiao-xia   

  • Online:2018-11-14 Published:2018-11-14

Abstract: Approximate string matching is a basic problem in computational biology,text retrieval and signal proces-sing,etc.,and how to improve the matching speed is a key issue up to now.Here,a new lossless q-gram filter was proposed to enhance the performance of finding true matches in a large text database for a given query.Q-gram index is employed as the index structure of large text database for its fast searching speed.To improve new filter’s performance by enhancing its filtration efficiency,some new features based on tail matched q-gram were extracted from text that includes true matches.In new filter,more irrelevant text is eliminated in filtration phase by using a filtration criterion which is enhanced by new extracted features,and the unfiltered text is verified by smith-waterman algorithm to locate the positions of all true matches.The experimental results demonstrate that the proposed filter’s filtration efficiency and performance are both enhanced.As a result,new filter algorithm has strong commonality and is suitable for approxi-mate string matching on condition of various matching error ratio.

Key words: Approximate string matching,Filter algorithm,q-gram filter,q-gram

[1] Dorneles C F,Goncalves R,Mello R D.Approximate data instance matching:a survey[J].Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G.A guided tour to approximate string matching[J].ACM Computing Surveys,2001,33(1):31-88
[3] Levenshtein V.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710
[4] Burkhardt S.Filter algorithms for approximate string matching[D].Saarland:Department of Computer Science,Saarland University,2002
[5] Tian Y,Tata S,Patel J M,et al.Practical methods for constructing suffix trees[J].VLDB Journal,2005,14(3):281-99
[6] Manber U,Myers G.Suffix arrays:A new method for on-linestring searches[J].SIAM Journal on Computing,1993,22(5):935-948
[7] Navarro G,Baeza-yates R.A practical q-gram index for text retrieval allowing errors[J].CLEI Electronic Journal,1998,1(2):1-16
[8] Kim M S,Whang K Y,Lee J G.n-Gram/2L-approximation:Atwo-level n-gram inverted index structure for approximate string matching[J].Computer Systems Science and Engineering,200722(6):365-379
[9] Kim Y,Park H,Shim K.Efficient processing of substring match queries with inverted variable-length gram indexes[J].Information Sciences,2013,244(0):119-141
[10] Wu S,Manber U.Fast text searching allowing errors[J].Communications of the ACM,1992,35(10):83-91
[11] Chang Y I,Chen J R,Hsu M T.A hash trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,33(1):21-38
[12] Burkhardt S,Crauser A,Ferragina P,et al.Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99.New York,USA:ACM,1999:77-83
[13] Jokinen P,Ukkonen E.Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science.Berlin,Germany:Springer-Verlag,1991:240-248
[14] Rasmussen K R,Stoye J,Myers E W.Efficient q-gram filters for finding all epsilon-matches over a given length[J].Journal of Computational Biology,2006,13(2):296-308
[15] Lu C W,Lu C L,Lee R C T.A new filtration method and a hybrid strategy for approximate string matching[J].Theoretical Computer Science,2013,481:9-17
[16] Altschul S F,Gish W,Miller W,et al.Basic local alignmentsearch tool[J].Journal of Molecular Biology,1990,215(3):403-410
[17] Altschul S F,Madden T L,Alejandro A S,et al.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402
[18] Sutinen E,Tarhio J.Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96.Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al.PatternHunter:faster and more sensitive homology search[J].Bioinformatics,2002,18(3):440-445
[20] Egidi L,Manzini G.Better spaced seeds using Quadratic Residues[J].Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G.Faster approximate string matching[J].Algorithmica,1999,23(2):127-158
[22] Myers E W.A sublinear algorithm for approximate keywordsearching[J].Algorithmica,1994,12(4/5):345-374
[23] Sutinen E,Szpankowski W.On the collapse of the q-gram filtra-tion[C]∥Proceedings of the International Conference on FUN with Algorithms.Waterloo,Canada:Carleton Scientific,1998:178-193
[24] Puglisi1S J,Smyth W F,Turpin A.Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval.Berlin,Germany:Springer Verlag,2006:122-133
[25] Karp R M,Rabin M O.Efficient randomized pattern-matchingalgorithms[J].IBM Journal of Research and Development,1987,31(2):249-260
[26] Smith T F,Waterman M S.Identification of common molecular subsequences[J].Journal of Molecular Biology,1981,147(1):195-197
[27] NCBI.UniGene Build #223,Homo sapiens[DB/OL].ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz,2010-04-28

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!