一种支持多种子近似串匹配的q-gram索引

doi:10.11896/j.issn.1002-137X.2014.09.053

计算机科学 ›› 2014, Vol. 41 ›› Issue (9): 279-284.doi: 10.11896/j.issn.1002-137X.2014.09.053

一种支持多种子近似串匹配的q-gram索引

孙德才,王晓霞

渤海大学信息科学与技术学院锦州121013;渤海大学大学计算机教研部锦州121013

出版日期:2018-11-14 发布日期:2018-11-14
基金资助:
本文受2014年辽宁省博士科研启动基金计划(20141138),辽宁省社科联2014年度辽宁经济社会发展立项重点课题(2014lslktzdian-04),国家自然科学基金项目(61173142,6,61202462,1,61173136),辽宁省教育厅一般项目(L2013422)资助

Q-gram Index for Approximate String Matching with Multi-seeds

SUN De-cai and WANG Xiao-xia

Online:2018-11-14 Published:2018-11-14

摘要/Abstract

摘要： 如何在大型文本库中快速找出给定串的近似串是大数据时代要解决的关键问题。基于多种子的近似串匹配算法因匹配速度快而得到众多学者的青睐,但巨大的索引空间消耗也使其难以处理大型文本库。提出了一种支持多种子的q-gram索引结构,通过该索引能够快速地计算出给定任意长度连续种子的地址集合,解决了多种子近似串匹配算法中种子的数目和长度受存储空间限制的问题。实验数据显示,新索引方案成倍地减少了存储空间的消耗。实验结果表明,提出的索引方案在大数据环境下的多种子近似匹配中具有一定的优势。

关键词: 大数据,近似串匹配,种子,q-gram索引,多种子索引

Abstract: How to find out all approximate strings of a given string from a big text database quickly is a key issue in the age of big data．Approximate string matching algorithm based on multi-seeds is researched for its fast searching speed．But it is difficult to process large text database due to its huge memory consumption．Here,a new q-gram index was proposed to solve the problem of multi-seeds used in approximate string matching algorithms．In the proposed index,the addresses of consecutive seed with arbitrary length can be computed out quickly．The experimental results demon-strate that the space consumption is decreased largely．As a result,the proposed index is of great practicality to deal with large database in the age of big data.

Key words: Big data,Approximate string matching,Seed,Q-gram index,Multi-seeds index

孙德才,王晓霞. 一种支持多种子近似串匹配的q-gram索引[J]. 计算机科学, 2014, 41(9): 279-284. https://doi.org/10.11896/j.issn.1002-137X.2014.09.053

SUN De-cai and WANG Xiao-xia. Q-gram Index for Approximate String Matching with Multi-seeds[J]. Computer Science, 2014, 41(9): 279-284. https://doi.org/10.11896/j.issn.1002-137X.2014.09.053

参考文献

[1] Dorneles C F,Goncalves R,Mello R D．Approximate data instance matching:a survey[J]．Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G．A guided tour to approximate string matching[J]．ACM Computing Surveys,2001,33(1):31-88
[3] Lu C W,Lu C L,Lee R C T．A new filtration method and a hybrid strategy for approximate string matching[J]． Springer Berlin Heidelberg,2013,20:143-155
[4] Levenshtein V．Binary codes capable of correcting deletions,insertions,and reversals[J]．Soviet Physics Doklady,1966,10(8):707-710
[5] Burkhardt S．Filter algorithms for approximate string matching[D]．Saarland:Department of Computer Science,Saarland University,2002
[6] Tian Y,Tata S,Patel J M,et al．Practical methods for constructing suffix trees[J]．VLDB Journal,2005,14(3):281-99
[7] Manber U,Myers G．Suffix arrays:A new method for on-linestring searches[J]．SIAM Journal on Computing,1993,22(5):935-948
[8] Navarro G,Baeza-yates R．A practical q-gram index for text retrieval allowing errors[J]．CLEI Electronic Journal,1998,1(2):1-16
[9] Kim M S,Whang K Y,Lee J G．n-Gram/2L- approximation:A two-level n-gram inverted index structure for approximate string matching[J]．Computer Systems Science and Engineering,2007,22(6):365-379
[10] Puglisi S J,Smyth W F,Turpin A,Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval．Berlin,Germany:Springer Verlag,2006:122-133
[11] Altschul S F,Gish W,Miller W,et al．Basic local alignment search tool[J]．Journal of Molecular Biology,1990,215(3):403-410
[12] Altschul S F,Madden T L,Alejandro A S,et al．Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J]．Nucleic Acids Research,1997,25(17):3389-3402
[13] Wu S,Manber U．Fast text searching allowing errors[J]．Communications of the ACM,1992,35(10):83-91
[14] Chang Y I,Chen J R,Hsu M T．A hash trie filter method for approximate string matching in genomic databases[J]．Applied Intelligence,2010,33(1):21-38
[15] Burkhardt S,Crauser A,Ferragina P,et al．Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99．New York,USA:ACM,1999:77-83
[16] Jokinen P,Ukkonen E．Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science．Berlin,Germany:Springer-Verlag,1991:240-248
[17] Rasmussen KR,Stoye J,Myers EW．Efficient q-gram filters for finding all epsilon-matches over a given length[J]．Journal of Computational Biology,2006,13(2):296-308
[18] Sutinen E,Tarhio J．Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96．Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al．PatternHunter:faster and more sensitive homology search[J]．BIOINFORMATICS,2002,18(3):440-445
[20] Egidi L,Manzini G．Better spaced seeds using Quadratic Residues[J]．Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G．Faster approximate string matching[J]．Algorithmica,1999,23(2):127-158
[22] Myers E W．A sublinear algorithm for approximate keywordsearching[J]．Algorithmica,1994,12(4/5):345-374
[23] Ilie S．Efficient computation of spaced seeds[J]．BMC Research Notes,2012,5(1):1-7
[24] Kim Y,Park H,Shim K．Efficient processing of substring match queries with inverted variable-length gram indexes[J]．Information Sciences,2013,244:119-141
[25] Yang X,Wang B,Li C．Cost-based variable-length-gram selection for string collections to support approximate queries efficiently[C]∥Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data．New York,USA:ACM,2008:353-364
[26] Karp R M,Rabin M O．Efficient randomized pattern-matchingalgorithms[J]．IBM Journal of Research and Development,1987,31(2):249-260
[27] NCBI．UniGene Build #223,Homo sapiens[DB/OL]．(2010-1-27)[2010-04-28]．ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz．

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

一种支持多种子近似串匹配的q-gram索引

Q-gram Index for Approximate String Matching with Multi-seeds

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0