Computer Science ›› 2014, Vol. 41 ›› Issue (9): 279-284.doi: 10.11896/j.issn.1002-137X.2014.09.053

Previous Articles     Next Articles

Q-gram Index for Approximate String Matching with Multi-seeds

SUN De-cai and WANG Xiao-xia   

  • Online:2018-11-14 Published:2018-11-14

Abstract: How to find out all approximate strings of a given string from a big text database quickly is a key issue in the age of big data.Approximate string matching algorithm based on multi-seeds is researched for its fast searching speed.But it is difficult to process large text database due to its huge memory consumption.Here,a new q-gram index was proposed to solve the problem of multi-seeds used in approximate string matching algorithms.In the proposed index,the addresses of consecutive seed with arbitrary length can be computed out quickly.The experimental results demon-strate that the space consumption is decreased largely.As a result,the proposed index is of great practicality to deal with large database in the age of big data.

Key words: Big data,Approximate string matching,Seed,Q-gram index,Multi-seeds index

[1] Dorneles C F,Goncalves R,Mello R D.Approximate data instance matching:a survey[J].Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G.A guided tour to approximate string matching[J].ACM Computing Surveys,2001,33(1):31-88
[3] Lu C W,Lu C L,Lee R C T.A new filtration method and a hybrid strategy for approximate string matching[J]. Springer Berlin Heidelberg,2013,20:143-155
[4] Levenshtein V.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710
[5] Burkhardt S.Filter algorithms for approximate string matching[D].Saarland:Department of Computer Science,Saarland University,2002
[6] Tian Y,Tata S,Patel J M,et al.Practical methods for constructing suffix trees[J].VLDB Journal,2005,14(3):281-99
[7] Manber U,Myers G.Suffix arrays:A new method for on-linestring searches[J].SIAM Journal on Computing,1993,22(5):935-948
[8] Navarro G,Baeza-yates R.A practical q-gram index for text retrieval allowing errors[J].CLEI Electronic Journal,1998,1(2):1-16
[9] Kim M S,Whang K Y,Lee J G.n-Gram/2L- approximation:A two-level n-gram inverted index structure for approximate string matching[J].Computer Systems Science and Engineering,2007,22(6):365-379
[10] Puglisi S J,Smyth W F,Turpin A,Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval.Berlin,Germany:Springer Verlag,2006:122-133
[11] Altschul S F,Gish W,Miller W,et al.Basic local alignment search tool[J].Journal of Molecular Biology,1990,215(3):403-410
[12] Altschul S F,Madden T L,Alejandro A S,et al.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402
[13] Wu S,Manber U.Fast text searching allowing errors[J].Communications of the ACM,1992,35(10):83-91
[14] Chang Y I,Chen J R,Hsu M T.A hash trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,33(1):21-38
[15] Burkhardt S,Crauser A,Ferragina P,et al.Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99.New York,USA:ACM,1999:77-83
[16] Jokinen P,Ukkonen E.Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science.Berlin,Germany:Springer-Verlag,1991:240-248
[17] Rasmussen KR,Stoye J,Myers EW.Efficient q-gram filters for finding all epsilon-matches over a given length[J].Journal of Computational Biology,2006,13(2):296-308
[18] Sutinen E,Tarhio J.Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96.Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al.PatternHunter:faster and more sensitive homology search[J].BIOINFORMATICS,2002,18(3):440-445
[20] Egidi L,Manzini G.Better spaced seeds using Quadratic Residues[J].Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G.Faster approximate string matching[J].Algorithmica,1999,23(2):127-158
[22] Myers E W.A sublinear algorithm for approximate keywordsearching[J].Algorithmica,1994,12(4/5):345-374
[23] Ilie S.Efficient computation of spaced seeds[J].BMC Research Notes,2012,5(1):1-7
[24] Kim Y,Park H,Shim K.Efficient processing of substring match queries with inverted variable-length gram indexes[J].Information Sciences,2013,244:119-141
[25] Yang X,Wang B,Li C.Cost-based variable-length-gram selection for string collections to support approximate queries efficiently[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.New York,USA:ACM,2008:353-364
[26] Karp R M,Rabin M O.Efficient randomized pattern-matchingalgorithms[J].IBM Journal of Research and Development,1987,31(2):249-260
[27] NCBI.UniGene Build #223,Homo sapiens[DB/OL].(2010-1-27)[2010-04-28].ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!