Computer Science ›› 2015, Vol. 42 ›› Issue (5): 114-118.doi: 10.11896/j.issn.1002-137X.2015.05.023

Previous Articles     Next Articles

Hierarchical Clustering of Categorical Sequences by Similarity Normalization

ZHANG Hao, CHEN Li-fei and GUO Gong-de   

  • Online:2018-11-14 Published:2018-11-14

Abstract: A categorical sequence is composed of finite symbols which are arranged in a certain order.Nowadays,categorical sequences,such as gene sequences,protein sequences,and speech sequences,etc.,widely exist in many application domains of data mining.As a major method for sequence data mining,sequence clustering has a great value in identifying the intrinsic structural of sequence data,while it is also an open problem due to the difficulties in measuring the similarity between sequences.This paper proposed a new similarity measure for categorical sequences,and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length,and to improve the effectiveness of measuring sequences similarity.Based on the new similarity measure,a new clustering method was proposed,where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning.Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.

Key words: Categorical sequence,Clustering,Similarity,Normalized variant

[1] Xiong T,Wang S,Jiang Q,et al.A new Markov model for clustering categorical sequences[C]∥Proceedings of the International Conference on Data Mining (ICDM).2011:854-863
[2] Dong Guo-zhu,Pei Jian.Sequence Data Mining[M].New York:Springer-Verlag New York Inc.,2007:1-65
[3] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61
[4] Kondrak G.N-gram similarity and distance[C]∥String Proces-sing and Information Retrieval.2005:115-126
[5] Ron D,Singer Y,Tishby N.The power of amnesia:Learningprobabilistic automata with variable memory length[J].Machine learning,1996,25(2/3):117-149
[6] Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of protein sequences based on a new similarity measure[J].BMC bioinformatics,2007,8(1):286
[7] Kelil A,Wang S.SCS:A new similarity measure for categorical sequences[C]∥International Conference on Data Mining.2008:343-352
[8] Alpaydin E.机器学习导论[M].范明,等译.北京:机械工业出版社,2009:95-96
[9] Grossi R,Vitter J.Compressed suffix arrays and suffix treeswith applications to text indexing and string matching[C]∥Proc.of ACM STOC.2000:397-406
[10] Gusfield D.Algorithms on strings,trees,and sequences[J].ACM SIGACT News,1997,28(4):41-60
[11] Ukkonen E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260
[12] Yang J,Wang W.CLUSEQ:Efficient and effective sequenceclustering[C]∥International Conference on Data Engineering.IEEE,2003:101-112
[13] Hirschberg D S.Pattern Matching Algorithms[M].OxfordUniv.Press,London,1997:123-142
[14] Karlin S,Ghandour G.Comparative statistics for DNA and protein sequences:single sequence analysis[J].Proceedings of the National Academy of Sciences,1985,82(17):5800-5804
[15] Melamed I D.Bitext maps and alignment via pattern recognition[J].Computational Linguistics,1999,25(1):107-130
[16] Wei D,Jiang Q,Wei Y,et al.A novel hierarchical clustering algorithm for gene sequences[J].BMC bioinformatics,2012,13(1):174
[17] Halkidi M,Batistakis Y,Vazrgiannis M.On clustering validation techniques[ J].Intelligent In formation Systems,2001,7(2/3):107-145
[18] Larsen B,Aone C.Fast and effective text mining using linear-time document clustering[C]∥Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:16-22

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!