计算机科学 ›› 2019, Vol. 46 ›› Issue (12): 26-30.doi: 10.11896/jsjkx.181202289

• 大数据与数据科学 • 上一篇    下一篇

基于密度约束的对比模式挖掘

柴欣, 高一寒, 武优西, 刘靖宇   

  1. (河北工业大学人工智能与数据科学学院 天津300401);
    (河北省大数据重点实验室 天津300401)
  • 收稿日期:2018-12-11 出版日期:2019-12-15 发布日期:2019-12-17
  • 通讯作者: 武优西(1974-),男,博士,教授,CCF高级会员,主要研究方向为数据挖掘、智能计算,E-mail:wuc567@163.com。
  • 作者简介:柴欣(1962-),男,博士,教授,主要研究方向为数据挖掘、智能计算,E-mail:ch212@126.com;高一寒(1993-),男,硕士生,主要研究方向为数据挖掘;刘靖宇(1976-),男,博士,副教授,主要研究方向为数据挖掘、绿色存储。
  • 基金资助:
    本文受国家自然科学基金项目(61702157,61571180)资助。

Distinguishing Patterns Mining Based on Density Constraint

CHAI Xin, GAO Yi-han, WU You-xi, LIU Jing-yu   

  1. (School of Artificial Intelligence,Hebei University of Technology,Tianjin 300401,China);
    (Hebei Province Key Laboratory of Big Data Calculation,Tianjin 300401,China)
  • Received:2018-12-11 Online:2019-12-15 Published:2019-12-17

摘要: 序列模式挖掘是从序列数据中发现用户感兴趣的模式。对比模式挖掘是其中的一类挖掘方法,其特点是在两类或多类别的序列库中找到特征信息,在实际的生活和生产中应用十分广泛。随着数据规模的不断增加,算法的挖掘效率显得尤为重要,但是当前对比模式挖掘仍存在挖掘速度太慢的问题。为了快速挖掘满足密度约束和间隙约束的对比模式,文中提出了一种近似求解算法ADMD(Approximately Distinguishing Patterns Mining Based on Density Constraint),该算法在模式的挖掘过程中允许存在小部分的模式丢失,从而换取挖掘速度的大幅提升。该算法采用网树的特殊结构来计算模式的支持数;采用模式拼接的方式来生成候选模式;采用预判式剪枝策略对模式进行剪枝,以避免大量冗余模式的生成。但由于在剪枝过程中可能会剪掉一部分非冗余模式,造成挖掘结果并非完备,因此该算法是一种近似求解算法。在ADMD算法的基础上,通过在剪枝策略中设定参数k的方式来得到ADMD-k算法,该算法可以通过设定k的取值来调整剪枝程度,从而在挖掘效率和准确率方面取得平衡。最后在真实的蛋白质数据集上将所提算法与其他算法从挖掘的对比模式数量和挖掘速度方面进行对比实验。实验结果表明,在k=1.5的情况下,所提算法仅用不到原来13%的时间,就可以挖掘到99%以上的模式,具有近似度高、速度快的特点。

关键词: 对比模式, 剪枝策略, 密度约束, 挖掘速度, 网树

Abstract: Sequential patterns mining is to find interest patterns from sequential data.Distinguishing patterns mining is one of the mining methods,which is characterized by finding feature information in two or more categories of sequence databases.It is widely used in real life and production.With the increasing size of data,the efficiency of algorithm mi-ning is particularly important.However,the mining speed of distinguishing patterns mining is too slow at present.In order to quickly mine the distinguishing patterns that satisfy density constraint and gap constraint,this paper proposed an approximate solution algorithm ADMD (Approximately Distinguishing Patterns Mining Based on Density Constraint).This algorithm allows a small number of patterns to be lost in the process of patterns mining in exchange for a large increase in mining speed.In this algorithm,the support of the pattern is calculated by the special structure of the Net tree,the candidate patterns are generated by patterns growth approach,and the patterns are pruned by the prejudgment pruning strategy to avoid the generation of a large number of redundant patterns.However,some non-redundant patterns may be pruned in the pruning process,resulting in incomplete mining results,so the algorithm is an approximate algorithm.Based on ADMD,the ADMD-k algorithm was proposed by setting the parameter k in the pruning strategy.The algorithm can adjust the pruning degree by setting k,to achieve a balance between mining efficiency and accuracy.Finally,in real protein datasets,the number of mining patterns and mining speed are compared with other algorithms.The experimental results verify that when k is 1.5,the proposed algorithm costs no more than 13% of the time,but can find up more than 99% of patterns.Therefore,the proposed algorithm is very effective with high approximation rate and high speed.

Key words: Density constraint, Distinguishing patterns, Mining speed, Net tree, Pruning strategy

中图分类号: 

  • TP311
[1]AGRAWAL R,SRIKANT R.Mining sequential patterns[C]//Proceedings of 11th International Conference on Data Enginee-ring.1995:3-14.
[2]EXARCHOS T P,TSIPOURAS M G,PAPALOUKAS C,et al.A two-stage methodology for sequence classification based on sequential pattern mining and optimization [J].Data & Know-ledge Engineering,2008,66(3):467-487.
[3]DING B,LO D,HAN J,et al.Efficient mining of closed repetitive gapped subsequences from a sequence database[C]//IEEE International Conference on Data Engineering.Shanghai,China:IEEE Computer Society,2009:1024-1035.
[4]WANG H,DING S F.Research and development of sequential pattern mining [J].Computer Science,2009,36(12):14-17.
[5]MAO G J,HU D J,XIE S Y.Models and algorithms for classi- fying big data based on distributed data streams [J].Chinese Journal of Computers,2017,40(1):161-175.
[6]WU Y X,SHEN C,JIANG H,et al.Strict pattern matching under non-overlapping condition [J].Science China Information Sciences,2017,60(1):012101.
[7]TAN C D,MIN F,WANG M,et al.Discovering patterns with weak-wildcard gaps [J].IEEE Access,2016,4(1):4922-4932.
[8]MIN F,ZHANG Z H,ZHAI W J,et al.Frequent pattern disco- very with tri-partition alphabets [J].Information Sciences.doi:10.1016/j.ins.2018.04.013.
[9]GONG Y S,XU T T,DONG X J,et al.e-NSPFI:Efficient mi- ning negative sequential pattern from both frequent and infrequent positive sequential patterns [J].International Journal of Pattern Recognition and Artificial Intelligence,2017,31(2):1-20.
[10]DONG X J,GONG Y S,CAO L B.F-NSP+:A fast negative sequential patterns mining method with self-adaptive data storage [J].Pattern Recognition,2018(84):13-27.
[11]WU Y X,TONG Y,ZHU X Q,et al.NOSEP:Nonoverlapping sequence pattern mining with map constraints [J].IEEE Transactions on Cybernetics,2018,48(10):2809-2822.
[12]WANG H F,DUAN L,ZUO J,et al.Efficient mining of distinguishing sequential patterns without a predefined gap constraint[J].Chinese Journal of Computers,2016,39(10):1979-1991.
[13]YANG H,DUAN L,HU B,et al.Mining top-k distinguishing sequential patterns with gap constraint[J].Journal of Software,2015,26(11):2994-3009.
[14]WANG X M,DUAN L,DONG G Z,et al.Efficient mining of density-aware distinguishing sequential patterns with gap constraints[C]//International Conference on Database Systems for Advanced Applications.2014:372-387.
[15]GHOSH S,FENG M,NGUYEN H,et al.Risk prediction for acute hypotensive patients by using gap constrained sequential contrast patterns [C]//AMIA Annual Symposium Proceedings American Medical Informatics Association.2014:1748-1757.
[16]ZHENG Z G,WEI W,LIU C M,et al.An effective contrast sequential pattern mining approach to taxpayer behavior analysis [J].World Wide Web-Internet & Web Information Systems,2016,19(4):633-651.
[17]WEI Q S,WU Y X,LIU J Y,et al.Distinguishing sequence patterns mining based on density on density and gap constraints [J].Computer Science,2018,45(4):252-256.
[18]WU Y X,WANG L L,REN J D,et al.Mining sequential patterns with periodic wildcard gaps [J].Applied Intelligence,2014,41(1):99-116.
[1] 魏芹双,武优西,刘靖宇,朱怀忠.
基于密度约束和间隙约束的对比模式挖掘
Distinguishing Sequence Patterns Mining Based on Density and Gap Constraints
计算机科学, 2018, 45(4): 252-256. https://doi.org/10.11896/j.issn.1002-137X.2018.04.042
[2] 董雷刚,刘国华,崔晓微.
PPQ:一种基于区域划分的c-skyline查询算法
PPQ:Finding Combinatorial Skyline Based on Partition
计算机科学, 2018, 45(1): 267-272. https://doi.org/10.11896/j.issn.1002-137X.2018.01.047
[3] 季辉,丁泽军.
双人博弈问题中的蒙特卡洛树搜索算法的改进
Improvement of Monte Carlo Tree Search Algorithm in Two-person Game Problem
计算机科学, 2018, 45(1): 140-143. https://doi.org/10.11896/j.issn.1002-137X.2018.01.023
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!