计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 114-118.doi: 10.11896/j.issn.1002-137X.2015.05.023

• 2014' 数据挖掘会议 • 上一篇    下一篇

规范化相似度的符号序列层次聚类

张 豪,陈黎飞,郭躬德   

  1. 福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点 实验室 福州350007,福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点 实验室 福州350007,福建师范大学数学与计算机科学学院福建省网络安全与密码技术重点 实验室 福州350007
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学基金(61175123),深圳市基础研究(重点)项目(JCYJ20120617120716224)资助

Hierarchical Clustering of Categorical Sequences by Similarity Normalization

ZHANG Hao, CHEN Li-fei and GUO Gong-de   

  • Online:2018-11-14 Published:2018-11-14

摘要: 符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等。作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题。首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性。在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类。在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度。

关键词: 符号序列,聚类,相似度,规范化因子

Abstract: A categorical sequence is composed of finite symbols which are arranged in a certain order.Nowadays,categorical sequences,such as gene sequences,protein sequences,and speech sequences,etc.,widely exist in many application domains of data mining.As a major method for sequence data mining,sequence clustering has a great value in identifying the intrinsic structural of sequence data,while it is also an open problem due to the difficulties in measuring the similarity between sequences.This paper proposed a new similarity measure for categorical sequences,and introduced a length-normalization factor to address the problem that the existing methods are sensitive to the sequences length,and to improve the effectiveness of measuring sequences similarity.Based on the new similarity measure,a new clustering method was proposed,where directed acyclic graphs are constructed according to the similarity between samples and a hierarchical clustering of categorical sequences is performed by graph partitioning.Experimental results on real-world datasets show that the new methods based on the normalized similarity measure are able to improve the clustering accuracy significantly.

Key words: Categorical sequence,Clustering,Similarity,Normalized variant

[1] Xiong T,Wang S,Jiang Q,et al.A new Markov model for clustering categorical sequences[C]∥Proceedings of the International Conference on Data Mining (ICDM).2011:854-863
[2] Dong Guo-zhu,Pei Jian.Sequence Data Mining[M].New York:Springer-Verlag New York Inc.,2007:1-65
[3] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61
[4] Kondrak G.N-gram similarity and distance[C]∥String Proces-sing and Information Retrieval.2005:115-126
[5] Ron D,Singer Y,Tishby N.The power of amnesia:Learningprobabilistic automata with variable memory length[J].Machine learning,1996,25(2/3):117-149
[6] Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of protein sequences based on a new similarity measure[J].BMC bioinformatics,2007,8(1):286
[7] Kelil A,Wang S.SCS:A new similarity measure for categorical sequences[C]∥International Conference on Data Mining.2008:343-352
[8] Alpaydin E.机器学习导论[M].范明,等译.北京:机械工业出版社,2009:95-96
[9] Grossi R,Vitter J.Compressed suffix arrays and suffix treeswith applications to text indexing and string matching[C]∥Proc.of ACM STOC.2000:397-406
[10] Gusfield D.Algorithms on strings,trees,and sequences[J].ACM SIGACT News,1997,28(4):41-60
[11] Ukkonen E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260
[12] Yang J,Wang W.CLUSEQ:Efficient and effective sequenceclustering[C]∥International Conference on Data Engineering.IEEE,2003:101-112
[13] Hirschberg D S.Pattern Matching Algorithms[M].OxfordUniv.Press,London,1997:123-142
[14] Karlin S,Ghandour G.Comparative statistics for DNA and protein sequences:single sequence analysis[J].Proceedings of the National Academy of Sciences,1985,82(17):5800-5804
[15] Melamed I D.Bitext maps and alignment via pattern recognition[J].Computational Linguistics,1999,25(1):107-130
[16] Wei D,Jiang Q,Wei Y,et al.A novel hierarchical clustering algorithm for gene sequences[J].BMC bioinformatics,2012,13(1):174
[17] Halkidi M,Batistakis Y,Vazrgiannis M.On clustering validation techniques[ J].Intelligent In formation Systems,2001,7(2/3):107-145
[18] Larsen B,Aone C.Fast and effective text mining using linear-time document clustering[C]∥Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:16-22

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!