计算机科学 ›› 2014, Vol. 41 ›› Issue (9): 279-284.doi: 10.11896/j.issn.1002-137X.2014.09.053
孙德才,王晓霞
SUN De-cai and WANG Xiao-xia
摘要: 如何在大型文本库中快速找出给定串的近似串是大数据时代要解决的关键问题。基于多种子的近似串匹配算法因匹配速度快而得到众多学者的青睐,但巨大的索引空间消耗也使其难以处理大型文本库。提出了一种支持多种子的q-gram索引结构,通过该索引能够快速地计算出给定任意长度连续种子的地址集合,解决了多种子近似串匹配算法中种子的数目和长度受存储空间限制的问题。实验数据显示,新索引方案成倍地减少了存储空间的消耗。实验结果表明,提出的索引方案在大数据环境下的多种子近似匹配中具有一定的优势。
[1] Dorneles C F,Goncalves R,Mello R D.Approximate data instance matching:a survey[J].Knowledge and Information Systems,2011,7(1):1-21 [2] Navarro G.A guided tour to approximate string matching[J].ACM Computing Surveys,2001,33(1):31-88 [3] Lu C W,Lu C L,Lee R C T.A new filtration method and a hybrid strategy for approximate string matching[J]. Springer Berlin Heidelberg,2013,20:143-155 [4] Levenshtein V.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710 [5] Burkhardt S.Filter algorithms for approximate string matching[D].Saarland:Department of Computer Science,Saarland University,2002 [6] Tian Y,Tata S,Patel J M,et al.Practical methods for constructing suffix trees[J].VLDB Journal,2005,14(3):281-99 [7] Manber U,Myers G.Suffix arrays:A new method for on-linestring searches[J].SIAM Journal on Computing,1993,22(5):935-948 [8] Navarro G,Baeza-yates R.A practical q-gram index for text retrieval allowing errors[J].CLEI Electronic Journal,1998,1(2):1-16 [9] Kim M S,Whang K Y,Lee J G.n-Gram/2L- approximation:A two-level n-gram inverted index structure for approximate string matching[J].Computer Systems Science and Engineering,2007,22(6):365-379 [10] Puglisi S J,Smyth W F,Turpin A,Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval.Berlin,Germany:Springer Verlag,2006:122-133 [11] Altschul S F,Gish W,Miller W,et al.Basic local alignment search tool[J].Journal of Molecular Biology,1990,215(3):403-410 [12] Altschul S F,Madden T L,Alejandro A S,et al.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402 [13] Wu S,Manber U.Fast text searching allowing errors[J].Communications of the ACM,1992,35(10):83-91 [14] Chang Y I,Chen J R,Hsu M T.A hash trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,33(1):21-38 [15] Burkhardt S,Crauser A,Ferragina P,et al.Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99.New York,USA:ACM,1999:77-83 [16] Jokinen P,Ukkonen E.Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science.Berlin,Germany:Springer-Verlag,1991:240-248 [17] Rasmussen KR,Stoye J,Myers EW.Efficient q-gram filters for finding all epsilon-matches over a given length[J].Journal of Computational Biology,2006,13(2):296-308 [18] Sutinen E,Tarhio J.Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96.Berlin,Germany:Springer-Verlag,1996:50-63 [19] Ma B,Tromp J,Li M,et al.PatternHunter:faster and more sensitive homology search[J].BIOINFORMATICS,2002,18(3):440-445 [20] Egidi L,Manzini G.Better spaced seeds using Quadratic Residues[J].Journal of Computer and System Sciences,2013,79(7):1144-1155 [21] Baeza-Yates R,Navarro G.Faster approximate string matching[J].Algorithmica,1999,23(2):127-158 [22] Myers E W.A sublinear algorithm for approximate keywordsearching[J].Algorithmica,1994,12(4/5):345-374 [23] Ilie S.Efficient computation of spaced seeds[J].BMC Research Notes,2012,5(1):1-7 [24] Kim Y,Park H,Shim K.Efficient processing of substring match queries with inverted variable-length gram indexes[J].Information Sciences,2013,244:119-141 [25] Yang X,Wang B,Li C.Cost-based variable-length-gram selection for string collections to support approximate queries efficiently[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.New York,USA:ACM,2008:353-364 [26] Karp R M,Rabin M O.Efficient randomized pattern-matchingalgorithms[J].IBM Journal of Research and Development,1987,31(2):249-260 [27] NCBI.UniGene Build #223,Homo sapiens[DB/OL].(2010-1-27)[2010-04-28].ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz. |
No related articles found! |
|