计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 40-49.doi: 10.11896/jsjkx.210700226

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学* 上一篇    下一篇

嵌入典型时间序列特征的随机Shapelet森林算法

高振卓, 王志海, 刘海洋   

  1. 北京交通大学计算机与信息技术学院 北京100044
    交通数据分析与挖掘北京市重点实验室 北京100044
  • 收稿日期:2021-07-23 修回日期:2022-02-28 出版日期:2022-07-15 发布日期:2022-07-12
  • 通讯作者: 刘海洋(haiyangliu@bjtu.edu.cn)
  • 作者简介:(19120349@bjtu.edu.cn)
  • 基金资助:
    国家自然科学基金(61771058);北京市自然科学基金(4214067)

Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features

GAO Zhen-zhuo, WANG Zhi-hai, LIU Hai-yang   

  1. School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China
    Beijing Key Laboratory of Traffic Data Analysis and Mining,Beijing 100044,China
  • Received:2021-07-23 Revised:2022-02-28 Online:2022-07-15 Published:2022-07-12
  • About author:GAO Zhen-zhuo,born in 1997,master,is a member of China Computer Federation.His main research interests include data mining and machine learning.
    LIU Hai-yang,born in 1987,Ph.D,lecturer,is a member of China Computer Federation.His main research interests include data mining and so on.
  • Supported by:
    National Natural Science Foundation of China(61771058)and Beijing Natural Science Foundation of China(4214067).

摘要: 近年来,时间序列分类问题的研究受到了广泛关注。先进的时间序列分类方法通常建立在良好的特征表示的基础之上。Shapelet是时间序列中具备鉴别性的子序列,可有效表达时间序列的局部形状特征。然而,高昂的计算成本大大限制了基于Shapelet的时间序列分类方法的实用性。除此之外,传统的Shapelet仅能描述欧氏距离度量下子序列的形状特征,因此极易受到噪声干扰并难以挖掘子序列中蕴含的其他类型的鉴别性信息。为应对上述问题,提出了一种新的时间序列分类算法——嵌入典型时间序列特征的随机Shapelet森林。该算法基于以下3个关键策略:1)随机选取Shapelet并限制Shapelet的作用范围以提高效率;2)在Shapelet中嵌入多个典型时间序列特征以提高算法对不同分类问题的适应性,并弥补随机选取Shapelet带来的精度损失;3)在新的特征表示的基础上构建随机森林分类器以确保算法的泛化能力。112个UCR时间序列数据集上的实验结果表明,本文算法的准确性超越了基于Shapelet精确搜索和Shapelet转换技术的STC算法,以及多个其他类型的先进时间序列分类算法。此外,广泛的实验对比验证了本文算法在效率上的显著优势。

关键词: Shapelet, 典型时间序列特征, 分类, 时间序列, 随机森林

Abstract: In recent years,the research on the classification of time series has attracted more and more attention.Advanced time series classification methods are usually based on great feature representations.Shapelet refers to the discriminative subsequences in time series,which can effectively express the local shape characteristics of time series.However,the high computational cost greatly limits the practicability of the Shapelet-based time series classification methods.In addition,traditional Shapelet can only describe the overall shape characteristics of the subsequence under Euclidean distance metric,so it is easy to be disturbed by noise and is difficult to mine other types of discriminative information contained in the subsequence.To deal with the aforementioned problems,a new time series classification algorithm,named random Shapelet forest embedded with canonical time series features,is proposed in this paper.The proposed algorithm is based on the following three key strategies:1)randomly select Shapelet and limit the scope of Shapelet to improve efficiency;2)embed multiple canonical time series features in Shapelet to improve the adaptability of the algorithm to different classification problems and make up for the accuracy loss caused by the random selection of Shapelet;3)build a random forest classifier based on the new feature representations to ensure the generalization ability of the algorithm.Experimental results on 112 UCR time series datasets show that the proposed algorithm is more accurate than the STC algorithm which is based on Shapelet exact search and the Shapelet transform technique,as well as many other types of state-of-the-art time series classification algorithms.Moreover,extensive experimental comparisons verify the significant advantages of the proposed algorithm in terms of efficiency.

Key words: Canonical time series features, Classification, Random forest, Shapelet, Time series

中图分类号: 

  • TP181
[1]GRABOCKA J,SCHILLING N,WISTUBA M,et al.Learningtime-series shapelets[C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:392-401.
[2]YAN W H,LI G L.Research on time series classification based on shapelet[J].Computer Science,2019,46(1):29-35.
[3]MIDDLEHURST M,LARGE J,BAGNALL A.The canonical interval forest(CIF) classifier for time series classification[J].arXiv:2008.09172,2020.
[4]LEE W,PARK S,JOO W,et al.Diagnosis prediction via medical context attention networks using deep generative modeling[C]//2018 IEEE International Conference on Data Mining.IEEE,2018:1104-1109.
[5]LUCAS B,SHIFAZ A,PELLETIER C,et al.Proximity forest:an effective and scalable distance-based classifier for time series[J].Data Mining and Knowledge Discovery,2019,33(3):607-635.
[6]FAWAZ H I,FORESTIER G,WEBER J,et al.Deep learningfor time series classification:a review[J].Data Mining and Knowledge Discovery,2019,33(4):917-963.
[7]DAU H A,KEOGH E,KAMGAR K,et al.The UCR Time Series Classification Archive[EB/OL].https://www.cs.ucr.edu/~eamonn/time_series_data_2018.
[8]YE L,KEOGH E.Time series shapelets:a novel technique that allows accurate,interpretable and fast classification[J].Data Mining and Knowledge Discovery,2011,22(1):149-182.
[9]RAKTHANMANON T,KEOGH E.Fast shapelets:a scalablealgorithm for discovering time series shapelets[C]//Proceedings of the 13th SIAM International Conference on Data Mining.SIAM,2013:668-676.
[10]KARLSSON I,PAPAPETROU P,BOSTRÖM H.Generalizedrandom shapelet forests[J].Data Mining and Knowledge Discovery,2016,30(5):1053-1085.
[11]LI G L,YAN W H,WU Z D.Discovering shapelets with key points in time series classification[J].Expert Systems with Applications,2019,132:76-86.
[12]ZHANG Z G,ZHANG H W,WEN Y L,et al.Discriminative extraction of features from time series[J].Neurocomputing,2018,275:2317-2328.
[13]MUEEN A,KEOGH E,YOUNG N.Logical-shapelets:an expressive primitive for time series classification[C]//Proceedings of the 17th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.ACM,2011:1154-1162.
[14]DENG H,RUNGER G,TUV E,et al.A time series forest for classification and feature extraction[J].Information Sciences,2013,239:142-153.
[15]LUBBA C H,SETHI S S,KNAUTE P,et al.Catch22:canonical time-series characteristics[J].Data Mining and Knowledge Discovery,2019,33(6):1821-1852.
[16]HILLS J,LINES J,BARANAUSKAS E,et al.Classification of time series by shapelet transformation[J].Data Mining and Knowledge Discovery,2014,28(4):851-881.
[17]BOSTROM A,BAGNALL A.Binary shapelet transform formulticlass time series classification[J/OL].Transactions on Large-Scale Data-and Knowledge-Centered Systems XXXII.Springer,2017:24-46.https://doi.org/10.1007/978-662-55608-5_2.
[18]BAGNALL A,LINES J,VICKERS W,et al.The UEA & UCR Time Series Classification Repository [EB/OL].http://www.timeseriesclassification.com.
[19]WANG Z,YAN W,OATES T.Time series classification from scratch with deep neural networks:a strong baseline[C]//2017 International Joint Conference on Neural Networks.IEEE,2017:1578-1585.
[20]LIN J,KEOGH E,WEI L,et al.Experiencing SAX:a novelsymbolic representation of time series[J].Data Mining and Knowledge Discovery,2007,15(2):107-144.
[21]LINES J,TAYLOR S,BAGNALL A.Time series classification with HIVE-COTE:the hierarchical vote collective of transformation-based ensembles[J].ACM Transactions on Knowledge Discovery from Data,2018,12(5):1-35.
[22]LIN J,KHADE R,LI Y.Rotation-invariant similarity in time series using bag-of-patterns representation[J].Journal of Intelligent Information Systems,2012,39(2):287-315.
[23]SENIN P,MALINCHIK S.SAX-VSM:interpretable time series classification using SAX and vector space model[C]//2013 IEEE 13rd International Conference on Data Mining.IEEE,2013:1175-1180.
[24]SCHÄFER P.The BOSS is concerned with time series classification in the presence of noise[J].Data Mining and Knowledge Discovery,2015,29(6):1505-1530.
[25]SCHÄFER P,LESER U.Fast and accurate time series classification with WEASEL[C]//Proceedings of the 2017 ACM Conference on Information and Knowledge Management.ACM,2017:637-646.
[26]SCHÄFER P,HÖGQVIST M.SFA:a symbolic fourier approximation and index for similarity search in high dimensional datasets[C]//Proceedings of the 15th International Conference on Extending Database Technology.Springer,2012:516-527.
[27]MIDDLEHURST M,VICKERS W,BAGNALL A.Scalable dictionary classifiers for time series classification[C]//Internatio-nal Conference on Intelligent Data Engineering and Automated Learning.Springer,2019:11-19.
[28]ZHANG W,WANG Z H,YUAN J D,et al.Time series discri-minative feature dictionary construction algorithm[J].Ruan Jian Xue Bao:Journal of Software,2020,31(10):3216-3237.
[29]YUAN J D,LIN Q H,ZHANG W,et al.Locally slope-based dynamic time warping for time series classification[C]//Procee-dings of the 28th ACM International Conference on Information and Knowledge Management.ACM,2019:1713-1722.
[30]BAGNALL A,LINES J,BOSTROM A,et al.The great time series classification bake off:A review and experimental evaluation of recent algorithmic advances[J].Data Mining and Knowledge Discovery,2017,31(3):606-660.
[31]LINES J,BAGNALL A.Time series classification with ensembles of elastic distance measures[J].Data Mining and Know-ledge Discovery,2015,29(3):565-592.
[32]BAGNALL A,LINES J,HILLS J,et al.Time-series classification with COTE:the collective of transformation-based ensembles[J].IEEE Transactions on Knowledge and Data Enginee-ring,2015,27(9):2522-2535.
[33]FULCHER B D,JONES N S.Hctsa:a computational framework for automated time-series phenotyping using massive feature extraction[J].Cell Systems,2017,5(5):527-531.
[34]DEMŠAR J.Statistical comparisons of classifiers over multiple datasets[J].Journal of Machine Learning Research,2006,7(1):1-30.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 周旭, 钱胜胜, 李章明, 方全, 徐常胜.
基于对偶变分多模态注意力网络的不完备社会事件分类方法
Dual Variational Multi-modal Attention Network for Incomplete Social Event Classification
计算机科学, 2022, 49(9): 132-138. https://doi.org/10.11896/jsjkx.220600022
[3] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[4] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[7] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[8] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[9] 张洪博, 董力嘉, 潘玉彪, 萧宗志, 张惠臻, 杜吉祥.
视频理解中的动作质量评估方法综述
Survey on Action Quality Assessment Methods in Video Understanding
计算机科学, 2022, 49(7): 79-88. https://doi.org/10.11896/jsjkx.210600028
[10] 杜丽君, 唐玺璐, 周娇, 陈玉兰, 程建.
基于注意力机制和多任务学习的阿尔茨海默症分类
Alzheimer's Disease Classification Method Based on Attention Mechanism and Multi-task Learning
计算机科学, 2022, 49(6A): 60-65. https://doi.org/10.11896/jsjkx.201200072
[11] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[12] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[13] 黄少滨, 孙雪薇, 李熔盛.
基于跨句上下文信息的神经网络关系分类方法
Relation Classification Method Based on Cross-sentence Contextual Information for Neural Network
计算机科学, 2022, 49(6A): 119-124. https://doi.org/10.11896/jsjkx.210600150
[14] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[15] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!