计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 114-118.doi: 10.11896/j.issn.1002-137X.2015.05.023
张 豪,陈黎飞,郭躬德
ZHANG Hao, CHEN Li-fei and GUO Gong-de
摘要: 符号序列由有限个符号按一定顺序排列而成,广泛存在于数据挖掘的许多应用领域,如基因序列、蛋白质序列和语音序列等。作为序列挖掘的一种主要方法,序列聚类分析在识别序列数据内在结构等方面具有重要的应用价值;同时,由于符号序列间相似性度量较为困难,序列聚类也是当前的一项开放性难题。首先提出一种新的符号序列相似度度量,引入长度规范因子解决现有度量对序列长度敏感的问题,从而提高了符号序列相似度度量的有效性。在此基础上,提出一种新的聚类方法,根据样本相似度构建无回路连通图,通过图划分进行符号序列的层次聚类。在多个实际数据集上的实验结果表明,采用规范化度量的新方法可以有效提高符号序列的聚类精度。
[1] Xiong T,Wang S,Jiang Q,et al.A new Markov model for clustering categorical sequences[C]∥Proceedings of the International Conference on Data Mining (ICDM).2011:854-863 [2] Dong Guo-zhu,Pei Jian.Sequence Data Mining[M].New York:Springer-Verlag New York Inc.,2007:1-65 [3] 孙吉贵,刘杰,赵连宇.聚类算法研究[J].软件学报,2008,19(1):48-61 [4] Kondrak G.N-gram similarity and distance[C]∥String Proces-sing and Information Retrieval.2005:115-126 [5] Ron D,Singer Y,Tishby N.The power of amnesia:Learningprobabilistic automata with variable memory length[J].Machine learning,1996,25(2/3):117-149 [6] Kelil A,Wang S,Brzezinski R,et al.CLUSS:Clustering of protein sequences based on a new similarity measure[J].BMC bioinformatics,2007,8(1):286 [7] Kelil A,Wang S.SCS:A new similarity measure for categorical sequences[C]∥International Conference on Data Mining.2008:343-352 [8] Alpaydin E.机器学习导论[M].范明,等译.北京:机械工业出版社,2009:95-96 [9] Grossi R,Vitter J.Compressed suffix arrays and suffix treeswith applications to text indexing and string matching[C]∥Proc.of ACM STOC.2000:397-406 [10] Gusfield D.Algorithms on strings,trees,and sequences[J].ACM SIGACT News,1997,28(4):41-60 [11] Ukkonen E.On-line construction of suffix trees[J].Algorithmica,1995,14(3):249-260 [12] Yang J,Wang W.CLUSEQ:Efficient and effective sequenceclustering[C]∥International Conference on Data Engineering.IEEE,2003:101-112 [13] Hirschberg D S.Pattern Matching Algorithms[M].OxfordUniv.Press,London,1997:123-142 [14] Karlin S,Ghandour G.Comparative statistics for DNA and protein sequences:single sequence analysis[J].Proceedings of the National Academy of Sciences,1985,82(17):5800-5804 [15] Melamed I D.Bitext maps and alignment via pattern recognition[J].Computational Linguistics,1999,25(1):107-130 [16] Wei D,Jiang Q,Wei Y,et al.A novel hierarchical clustering algorithm for gene sequences[J].BMC bioinformatics,2012,13(1):174 [17] Halkidi M,Batistakis Y,Vazrgiannis M.On clustering validation techniques[ J].Intelligent In formation Systems,2001,7(2/3):107-145 [18] Larsen B,Aone C.Fast and effective text mining using linear-time document clustering[C]∥Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,1999:16-22 |
No related articles found! |
|