一种支持多种子近似串匹配的q-gram索引

doi:10.11896/j.issn.1002-137X.2014.09.053

Abstract

Abstract: How to find out all approximate strings of a given string from a big text database quickly is a key issue in the age of big data．Approximate string matching algorithm based on multi-seeds is researched for its fast searching speed．But it is difficult to process large text database due to its huge memory consumption．Here,a new q-gram index was proposed to solve the problem of multi-seeds used in approximate string matching algorithms．In the proposed index,the addresses of consecutive seed with arbitrary length can be computed out quickly．The experimental results demon-strate that the space consumption is decreased largely．As a result,the proposed index is of great practicality to deal with large database in the age of big data.

Key words: Big data,Approximate string matching,Seed,Q-gram index,Multi-seeds index

SUN De-cai and WANG Xiao-xia. Q-gram Index for Approximate String Matching with Multi-seeds[J].Computer Science, 2014, 41(9): 279-284.

References

[1] Dorneles C F,Goncalves R,Mello R D．Approximate data instance matching:a survey[J]．Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G．A guided tour to approximate string matching[J]．ACM Computing Surveys,2001,33(1):31-88
[3] Lu C W,Lu C L,Lee R C T．A new filtration method and a hybrid strategy for approximate string matching[J]． Springer Berlin Heidelberg,2013,20:143-155
[4] Levenshtein V．Binary codes capable of correcting deletions,insertions,and reversals[J]．Soviet Physics Doklady,1966,10(8):707-710
[5] Burkhardt S．Filter algorithms for approximate string matching[D]．Saarland:Department of Computer Science,Saarland University,2002
[6] Tian Y,Tata S,Patel J M,et al．Practical methods for constructing suffix trees[J]．VLDB Journal,2005,14(3):281-99
[7] Manber U,Myers G．Suffix arrays:A new method for on-linestring searches[J]．SIAM Journal on Computing,1993,22(5):935-948
[8] Navarro G,Baeza-yates R．A practical q-gram index for text retrieval allowing errors[J]．CLEI Electronic Journal,1998,1(2):1-16
[9] Kim M S,Whang K Y,Lee J G．n-Gram/2L- approximation:A two-level n-gram inverted index structure for approximate string matching[J]．Computer Systems Science and Engineering,2007,22(6):365-379
[10] Puglisi S J,Smyth W F,Turpin A,Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval．Berlin,Germany:Springer Verlag,2006:122-133
[11] Altschul S F,Gish W,Miller W,et al．Basic local alignment search tool[J]．Journal of Molecular Biology,1990,215(3):403-410
[12] Altschul S F,Madden T L,Alejandro A S,et al．Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]．Nucleic Acids Research,1997,25(17):3389-3402
[13] Wu S,Manber U．Fast text searching allowing errors[J]．Communications of the ACM,1992,35(10):83-91
[14] Chang Y I,Chen J R,Hsu M T．A hash trie filter method for approximate string matching in genomic databases[J]．Applied Intelligence,2010,33(1):21-38
[15] Burkhardt S,Crauser A,Ferragina P,et al．Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99．New York,USA:ACM,1999:77-83
[16] Jokinen P,Ukkonen E．Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science．Berlin,Germany:Springer-Verlag,1991:240-248
[17] Rasmussen KR,Stoye J,Myers EW．Efficient q-gram filters for finding all epsilon-matches over a given length[J]．Journal of Computational Biology,2006,13(2):296-308
[18] Sutinen E,Tarhio J．Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96．Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al．PatternHunter:faster and more sensitive homology search[J]．BIOINFORMATICS,2002,18(3):440-445
[20] Egidi L,Manzini G．Better spaced seeds using Quadratic Residues[J]．Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G．Faster approximate string matching[J]．Algorithmica,1999,23(2):127-158
[22] Myers E W．A sublinear algorithm for approximate keywordsearching[J]．Algorithmica,1994,12(4/5):345-374
[23] Ilie S．Efficient computation of spaced seeds[J]．BMC Research Notes,2012,5(1):1-7
[24] Kim Y,Park H,Shim K．Efficient processing of substring match queries with inverted variable-length gram indexes[J]．Information Sciences,2013,244:119-141
[25] Yang X,Wang B,Li C．Cost-based variable-length-gram selection for string collections to support approximate queries efficiently[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data．New York,USA:ACM,2008:353-364
[26] Karp R M,Rabin M O．Efficient randomized pattern-matchingalgorithms[J]．IBM Journal of Research and Development,1987,31(2):249-260
[27] NCBI．UniGene Build #223,Homo sapiens[DB/OL]．(2010-1-27)[2010-04-28]．ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz．

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Q-gram Index for Approximate String Matching with Multi-seeds

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0