计算机科学 ›› 2014, Vol. 41 ›› Issue (6): 243-249.doi: 10.11896/j.issn.1002-137X.2014.06.048

• 人工智能 • 上一篇    下一篇

一种基于尾匹配q-gram的近似串匹配算法

孙德才,王晓霞   

  1. 渤海大学信息科学与技术学院 锦州121013;渤海大学大学计算机教研部 锦州121013
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受辽宁省社科联2014年度辽宁经济社会发展立项重点课题(2014lslktzdian-04), 国家自然科学基金项目(61173142,2),辽宁省教育厅一般项目(L2013422,L2012397),辽宁省“百千万人才工程”项目(2012921058)资助

Approximate String Matching Using Tail Matched q-gram

SUN De-cai and WANG Xiao-xia   

  • Online:2018-11-14 Published:2018-11-14

摘要: 近似串匹配是生物信息学、文本检索、信号处理等领域的一个基础问题,如何提高近似串匹配的速度一直都是研究的关键问题。提出一种新的在大文本库中快速查找近似匹配的无损过滤算法。为保证在大文本库中的匹配速度,本算法使用了查询速度较快的q-gram索引。为通过提高过滤算法的过滤效率达到提升算法整体性能的目的,详细分析了含有匹配串的文本区域,提取了一些基于尾匹配q-gram特征的新过滤条件,然后用这些特征优化了过滤算法的过滤标准。实验数据表明,新过滤条件有效地提高了算法的过滤效率,提升了算法的整体性能。结果显示新算法适合各种匹配错误率下的近似匹配,算法的通用性较强。

关键词: 近似串匹配,过滤算法,q-gram过滤,q元语法 中图法分类号TP391.3文献标识码A

Abstract: Approximate string matching is a basic problem in computational biology,text retrieval and signal proces-sing,etc.,and how to improve the matching speed is a key issue up to now.Here,a new lossless q-gram filter was proposed to enhance the performance of finding true matches in a large text database for a given query.Q-gram index is employed as the index structure of large text database for its fast searching speed.To improve new filter’s performance by enhancing its filtration efficiency,some new features based on tail matched q-gram were extracted from text that includes true matches.In new filter,more irrelevant text is eliminated in filtration phase by using a filtration criterion which is enhanced by new extracted features,and the unfiltered text is verified by smith-waterman algorithm to locate the positions of all true matches.The experimental results demonstrate that the proposed filter’s filtration efficiency and performance are both enhanced.As a result,new filter algorithm has strong commonality and is suitable for approxi-mate string matching on condition of various matching error ratio.

Key words: Approximate string matching,Filter algorithm,q-gram filter,q-gram

[1] Dorneles C F,Goncalves R,Mello R D.Approximate data instance matching:a survey[J].Knowledge and Information Systems,2011,7(1):1-21
[2] Navarro G.A guided tour to approximate string matching[J].ACM Computing Surveys,2001,33(1):31-88
[3] Levenshtein V.Binary codes capable of correcting deletions,insertions,and reversals[J].Soviet Physics Doklady,1966,10(8):707-710
[4] Burkhardt S.Filter algorithms for approximate string matching[D].Saarland:Department of Computer Science,Saarland University,2002
[5] Tian Y,Tata S,Patel J M,et al.Practical methods for constructing suffix trees[J].VLDB Journal,2005,14(3):281-99
[6] Manber U,Myers G.Suffix arrays:A new method for on-linestring searches[J].SIAM Journal on Computing,1993,22(5):935-948
[7] Navarro G,Baeza-yates R.A practical q-gram index for text retrieval allowing errors[J].CLEI Electronic Journal,1998,1(2):1-16
[8] Kim M S,Whang K Y,Lee J G.n-Gram/2L-approximation:Atwo-level n-gram inverted index structure for approximate string matching[J].Computer Systems Science and Engineering,200722(6):365-379
[9] Kim Y,Park H,Shim K.Efficient processing of substring match queries with inverted variable-length gram indexes[J].Information Sciences,2013,244(0):119-141
[10] Wu S,Manber U.Fast text searching allowing errors[J].Communications of the ACM,1992,35(10):83-91
[11] Chang Y I,Chen J R,Hsu M T.A hash trie filter method for approximate string matching in genomic databases[J].Applied Intelligence,2010,33(1):21-38
[12] Burkhardt S,Crauser A,Ferragina P,et al.Q-gram based database searching using a suffix array[C]∥Proceedings of the Annual International Conference on Computational Molecular Bio-logy,RECOMB 99.New York,USA:ACM,1999:77-83
[13] Jokinen P,Ukkonen E.Two algorithms for approximate string matching in static texts[C]∥16th International Symposium Proceedings on Mathematical Foundations of Computer Science.Berlin,Germany:Springer-Verlag,1991:240-248
[14] Rasmussen K R,Stoye J,Myers E W.Efficient q-gram filters for finding all epsilon-matches over a given length[J].Journal of Computational Biology,2006,13(2):296-308
[15] Lu C W,Lu C L,Lee R C T.A new filtration method and a hybrid strategy for approximate string matching[J].Theoretical Computer Science,2013,481:9-17
[16] Altschul S F,Gish W,Miller W,et al.Basic local alignmentsearch tool[J].Journal of Molecular Biology,1990,215(3):403-410
[17] Altschul S F,Madden T L,Alejandro A S,et al.Gapped BLAST and PSI-BLAST:a new generation of protein database search programs[J].Nucleic Acids Research,1997,25(17):3389-3402
[18] Sutinen E,Tarhio J.Filtration with q-samples in approximatestring matching[C]∥Proceedings of 7th Annual Symposium on Combinatorial Pattern Matching,CPM 96.Berlin,Germany:Springer-Verlag,1996:50-63
[19] Ma B,Tromp J,Li M,et al.PatternHunter:faster and more sensitive homology search[J].Bioinformatics,2002,18(3):440-445
[20] Egidi L,Manzini G.Better spaced seeds using Quadratic Residues[J].Journal of Computer and System Sciences,2013,79(7):1144-1155
[21] Baeza-Yates R,Navarro G.Faster approximate string matching[J].Algorithmica,1999,23(2):127-158
[22] Myers E W.A sublinear algorithm for approximate keywordsearching[J].Algorithmica,1994,12(4/5):345-374
[23] Sutinen E,Szpankowski W.On the collapse of the q-gram filtra-tion[C]∥Proceedings of the International Conference on FUN with Algorithms.Waterloo,Canada:Carleton Scientific,1998:178-193
[24] Puglisi1S J,Smyth W F,Turpin A.Inverted files versus suffix arrays for locating patterns in primary memory[C]∥Procee-dings of the 13th Symposium on String Processing and Information Retrieval.Berlin,Germany:Springer Verlag,2006:122-133
[25] Karp R M,Rabin M O.Efficient randomized pattern-matchingalgorithms[J].IBM Journal of Research and Development,1987,31(2):249-260
[26] Smith T F,Waterman M S.Identification of common molecular subsequences[J].Journal of Molecular Biology,1981,147(1):195-197
[27] NCBI.UniGene Build #223,Homo sapiens[DB/OL].ftp.ncbi.nih.gov//repository/ UniGene/Homo_sapiens/Hs.seq.uniq.gz,2010-04-28

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!