计算机科学 ›› 2017, Vol. 44 ›› Issue (2): 270-274.doi: 10.11896/j.issn.1002-137X.2017.02.045

• 人工智能 • 上一篇    下一篇

基于格代数的最长公共子序列近似求解

孙焘,朱晓明   

  1. 大连理工大学创新创业学院 辽宁116024,大连理工大学创新创业学院 辽宁116024
  • 出版日期:2018-11-13 发布日期:2018-11-13

Computing Longest Common Subsequences Approximately Based on Lattice

SUN Tao and ZHU Xiao-ming   

  • Online:2018-11-13 Published:2018-11-13

摘要: 多条序列的最长公共子序列可以代表多条序列的公共信息,其在诸多领域里有着重要的应用,如信息检索、基因序列匹配等。求解多条序列的最长公共子序列是著名的NP难问题,本质为多解问题。一些近似算法虽然时间复杂度较低,但只能求出单解,对于有多解的序列集合,求得的结果信息量损失较大。因此提出一个新的近似算法来解决最长公共子序列问题。算法引入了代数结构“格”,通过动态规划求解出两条序列的公共格,并递归求解当前格与当前序列的公共格。公共格中的路径保存了多条公共子序列使得最终求解出的最长公共子序列为多个。对算法的相关定理给出了理论证明,并通过实验验证了算法的正确性。

关键词: 最长公共子序列,格,近似算法,贪心算法

Abstract: The longest common subsequences can represent the common information of a sequence set,and it has many important applications in many fields,such as computational genomics,information retrieval and so on.Computing longest common subsequences is a famous NP-hard problem with multiple solutions.Some approximate algorithms have a low time complexity,but the result set only has one subsequence.For the sequence set with many longest common subsequences,the information loss is overmuch.In this paper,we presented a new approximate algorithm for this problem.Our algorithm employs a specific mathematical structure called lattice.Firstly,the common lattice of two sequences is computed through dynamic programming and then the common lattice of the current lattice and the current sequence recursively combined with the greedy algorithm are also computed.As the paths in a common lattice contain many common subsequences,the result set contains many longest common subsequences of the original sequence set.And the validity of our algorithm is proved through theory and experiment.

Key words: Longest common subsequences (LCS),Lattice,Approximate algorithm,Greedy algorithm

[1] MAIER D.The complexity of some problem on subsequencesand superseuqences[J].Journal of the ACM (JACM), 1978,25 (2):322-366.
[2] BERGROTH L,HAKONEN H,RAITA T.A survey of longest common subsequence algorithms[C]∥Processings.Seventh International Symposium on String Processing and Information Retrieval,2000(SPIRE 2000).IEEE,2000:39-48.
[3] ATTWOOD T,FINDLAY J.Fingerprinting g-protein-coupledreceptors[J].Protein engineering,1994,7(2):195-203.
[4] SANKOFF D,BLANCHETTE M.Phylogenetic invariants forgenome rearrangements[J].Journal of Computational Biology,1999,6(3/4):431-445.
[5] SHUKLA A,AGARWAL S.A relative position based algorithm to find out the longest common subsequence from multiple biological sequences[C]∥Proceedings of the 2010 International Conference on Computer and Communication Technology (ICCCT).2010:496-502.
[6] RIZVI S,AGARWAL P.A new index-based parallel algorithm for finding longest common subsequence in multiple dna sequences[C]∥International Conference in Cognitive Systems.Citeseer,2005.
[7] HAKATA K,IMAI H.Algorithms for the longest commonsubsequence problem for multiple strings based on geometric maxima[J].Optimization Methods and Software,1998,10(2):233-260.
[8] BOURQUE G,PEVZNER P A.Genome-scale evolution:reconstructing gene orders in the ancestral species[J].Genome research,2002,12(1):26-36.
[9] SHERIDAN R P,VENKATARAGHAVAN R.A systematicsearch for protein signature sequences[J].Proteins-Structure Function and Bioinformatics,1992,14(1):16-28.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!