计算机科学 ›› 2019, Vol. 46 ›› Issue (4): 57-65.doi: 10.11896/j.issn.1002-137X.2019.04.009

• 大数据与数据科学 • 上一篇    下一篇

面向多尺度数据挖掘的数据尺度划分方法

张昉, 赵书良, 武永亮   

  1. 河北师范大学数学与信息科学学院 石家庄050024
    河北师范大学河北省计算数学与应用重点实验室 石家庄050024
  • 收稿日期:2018-09-04 出版日期:2019-04-15 发布日期:2019-04-23
  • 通讯作者: 赵书良(1967-),男,博士,教授,博士生导师,CCF会员,主要研究方向为数据挖掘、智能信息处理,E-mail:zhaoshuliang@sina.com(通信作者)
  • 作者简介:张 昉(1993-),女,硕士生,主要研究方向为数据挖掘、智能信息处理,E-mail:zhangfangapril@outlook.com;武永亮(1986-),男,博士生,CCF会员,主要研究方向为数据挖掘、智能信息处理。
  • 基金资助:
    本文受国家自然科学基金资助项目(71271067),国家社科基金重大项目(13&ZD091,18ZDA200),河北师范大学硕士基金资助项目(CXZZSS2017048)资助。

Data Scaling Method for Multi-scale Data Mining

ZHANG Fang, ZHAO Shu-liang, WU Yong-liang   

  1. College of Mathematics & Information Science,Hebei Normal University,Shijiazhuang 050024,China
    Hebei Key Laboratory of Computational Mathematics & Applications,Hebei Normal University,Shijiazhuang,050024,China
  • Received:2018-09-04 Online:2019-04-15 Published:2019-04-23

摘要: 多尺度挖掘在图形图像、地理信息、信号分析、数据挖掘等领域已有应用,多尺度数据挖掘在关联规则、聚类、分类挖掘领域也有相关研究与应用,但对如何对数据集进行普适性的多尺度划分以及如何构建多尺度数据集仍未展开研究,已有相关研究缺乏深度。文中从多尺度数据挖掘任务入手,定义了尺度概念,并给出了多尺度化数据集模型,以及基准尺度评分模型;依据概率密度估计的离散化方法提出了多尺度划分算法,扩展了可划分尺度的数据类型,划分结果更贴近数据的多尺度特性,且具有较低的时间复杂度;提出了多尺度化数据集方法、构建多尺度数据集算法和基准尺度选择算法,将多尺度熵与信息熵作为评价方法,在扩充多尺度化数据集方法的基础上,有效减弱了多尺度数据挖掘中因尺度推衍而产生的尺度效应,算法的时间复杂性也较为可控。利用H省真实人口数据集、UCI公用数据集和T10I4D100K数据集对所提算法和模型进行验证与实验分析,结果表明多尺度划分算法和多尺度化数据集方法是可行的,提出的多尺度化数据集方法和基准尺度评分模型是有效的,多尺度划分方法、构建多尺度数据集方法和基准尺度选择方法的应用平均提高了尺度推衍过程中1.6%的覆盖率、2.1%的F1-measure和3.7%的正确率,且具有较低的平均支持度误差。

关键词: 多尺度划分, 多尺度熵, 多尺度数据挖掘, 构建多尺度数据集, 基准尺度选择, 离散化, 信息熵

Abstract: Multi-scale mining has been applied in the fields of graphic images,geographic information,signal analysis,data mining,etc,and also has related research and application in the fields of association rules,clustering and classification mining.Nevertheless how to divide datasets into common scales and how to construct multi-scale datasets have not been studied in depth.Starting with the task of multi-scale data mining,this paper defined the concept of scale and gave a multi-scale dataset model and a benchmark scale scoring model.This paper proposed a multi-scale partition algorithm based on the discretization method of probability density estimation,which extends the data types of divisible scales,and its partition results are closer to the multi-scale characteristics of data with lower time complexity.This paper also proposed a multi-scale dataset method,a multi-scale data set algorithm and a benchmark scale selection algorithm.Multi-scale entropy and information entropy were used as evaluation methods.On the basis of expanding the multi-scale dataset method,the scale effect produced by the meso-scale derivation of multi-scale data mining can be effectively reduced,and the time complexity can be controlled.The proposed algorithm and model were validated and analyzed by using the real population dataset of H province,UCI common dataset and IBM dataset.The experimental results show that the proposed method is feasible and the proposed model is effective.The application of the proposed methods improvescoverage by 1.6%,F1-measure by 2.1% andaccuracy by 3.7% in scale deduction process,and has low average support error.

Key words: Construction of multi-scale datasets, Discretization, Information entropy, Multi-scale data mining, Multi-scale entropy, Multi-scale scaling, Reference scale selection

中图分类号: 

  • TP391
[1]SUN Q X,LI M T,LU J X,et al.Scale of geospatial data and its research progress [J].Geography and Geographic Information Science,2007,23(4):53-56,80.(in Chinese) 孙庆先,李茂堂,路京选,等.地理空间数据的尺度问题及其研究进展[J].地理与地理信息科学,2007,23(4):53-56,80.
[2]LIU M M,ZHAO S L,HAN Y H,et al.Research on multi-scale data mining method[J].Journal of Software,2016,27(12):3030-3050.(in Chinese) 柳萌萌,赵书良,韩玉辉,等.多尺度数据挖掘方法[J],软件学报,2016,27(12):3030-3050.
[3]HAN Y H,ZHAO S L,LIU M M,et al.Multi-scale Clustering Mining Algorithm [J].Computer Science,2016,43(8):244-248.(in Chinese) 韩玉辉,赵书良,柳萌萌,等.多尺度聚类挖掘算法[J].计算机科学,2016,43(8):244-248.
[4]LIU Q,HANG R,SONG H,et al.Learning Multi-Scale Deep Features for High-Resolution Satellite Image Classification[J].IEEE Transactions on Geoscience & Remote Sensing,2016,PP(99):1-10.
[5]AZAMI H,FERNÁNDEZ A,ESCUDERO J.Refined multiscale fuzzy entropy based on standard deviation for biomedical signal analysis[J].Medical & Biological Engineering & Computing,2017,55(11):2037-2052.
[6]LI Z,WEI Z,WEN C,et al.Detail-Enhanced Multi-Scale Exposure Fusion[J].IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society,2017,26(3):1243-1252.
[7]SHEN L,SUN G,HUANG Q M,et al.Multi-Level Discriminative Dictionary Learning With Application to Large Scale Image Classification[J].IEEE Transactions on Image Processing,2015,24(10):3109-3123.
[8]LIAO S,ZHU Q,QIAN Y,et al.Multi-granularity feature selection on cost-sensitive data with measurement errors and variable costs[OL].https://www.onacademic.com/detail/journal_1000040426607310_1fb6.html.
[9]LANGARI B,VASEGHI S,PROCHAZKA A,et al.Edge- Guided Image Gap Interpolation Using Multi-Scale Transformation[J].IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society,2016,25(9):4394-4405.
[10]LIU M M,ZHAO S L,CHEN M,et al.Scaling-up mining algorithm of multi-scale association rules mining [J].Application Research of Computers,2015,32(10):2924-2929.(in Chinese) 柳萌萌,赵书良,陈敏,等.多尺度关联规则挖掘的尺度上推算法[J].计算机应用研究,2015,32(10):2924-2929.
[11]LI C,ZHAO S L,ZHAO J P,et al.Scaling-up Algorithm of Multi-scale association rules [J].Computer Science,2017,44(8):285-289.(in Chinese) 李超,赵书良,赵骏鹏,等.多尺度关联规则尺度上推算法[J].计算机科学,2017,44(8):285-289.
[12]LI J X,ZHAO S L,AN L,et al.Scaling-up Algorithm of Multi-scale Classification Based on Fractal Theory[J].Computer Scie-nce,2018,45(S1):453-459.(in Chinese) 李佳星,赵书良,安磊,等.基于分形理论的多尺度分类尺度上推算法[J].计算机科学,2018,45(S1):453-459.
[13]LI J X,ZHAO S L,AN L,et al.Scaling-down Algorithm of Multi-scale Classification Based on Fractal Theory[J].Application Research of Computers,2019(7):1-3.(in Chinese) 李佳星,赵书良,安磊,等.基于广义分形插值理论的多尺度分类尺度下推算法[J].计算机应用研究,2019(7):1-3.
[14]PETRY F E,YAGER R R.Fuzzy Concept Hierarchies and Evidence Resolution[J].IEEE Transactions on Fuzzy Systems,2014,22(5):1151-1161.
[15]KANG X,MIAO D.A study on information granularity in formal concept analysis based on concept-bases[J].Knowledge-Based Systems,2016,105(C):147-159.
[16]HAO C,LI J,FAN M,et al.Optimal scale selection in dynamic multi-scale decision tables based on sequential three-way decisions[J].Information Sciences,2017,415:213-232.
[17]ZHAO J P,ZHAO S L,LI C,et al.A multi-scale clustering algorithm based on grain calculation [J].Application Research of Computers,2018,35(2):362-366.(in Chinese) 赵骏鹏,赵书良,李超,等.基于粒计算的多尺度聚类尺度上推算法[J].计算机应用研究,2018,35(2):362-366.
[18]BIBA M,ESPOSITO F,FERILLI S,et al.Unsupervised discre- tization using kernel density estimation[C]∥Proceedings of the International Joint Conference on Artificial Intelligence,Hyderabad,India,January.DBLP,2008:696-701.
[19]ZHOU C H,ZHANG J T.A geospatial data mining model based on information entropy [J].Chinese Journal of Image and Graphics,1999,4(11):946-951.(in Chinese) 周成虎,张健挺.基于信息熵的地学空间数据挖掘模型[J].中国图象图形学报,1999,4(11):946-951.
[20]GOU J,LIU J Y,WEI Z B,et al.Analysis of power energy flow complexity based on multi-scale entropy [J].Acta Physica Sinica,2014(20):347-354.(in Chinese) 苟竞,刘俊勇,魏震波,等.基于多尺度熵的电力能量流复杂性分析[J].物理学报,2014(20):347-354.
[21]BRUNI R,BIANCHI G.Effective Classification Using a Small Training Set Based on Discretization and Statistical Analysis[J].IEEE Transactions on Knowledge and Data Engineering,2015,27(9):2349-2361.
[1] 夏源, 赵蕴龙, 范其林.
基于信息熵更新权重的数据流集成分类算法
Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight
计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047
[2] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[3] 赵钦炎, 李宗民, 刘玉杰, 李华.
基于信息熵的级联Siamese网络目标跟踪
Cascaded Siamese Network Visual Tracking Based on Information Entropy
计算机科学, 2020, 47(9): 157-162. https://doi.org/10.11896/jsjkx.190800160
[4] 刘子琦, 郭炳晖, 程臻, 杨小博, 殷子樵.
基于熵值模糊层次分析法的科技战略评价
Science and Technology Strategy Evaluation Based on Entropy Fuzzy AHP
计算机科学, 2020, 47(6A): 1-5. https://doi.org/10.11896/JsJkx.190700078
[5] 刘俊琦, 李智, 张学阳.
基于信息熵和残差神经网络的多层次船只目标鉴别方法
Multi-level Ship Target Discrimination Method Based on Entropy and Residual Neural Network
计算机科学, 2020, 47(11A): 253-257. https://doi.org/10.11896/jsjkx.191100006
[6] 王亚鸽, 康晓东, 郭军, 洪睿, 李博, 张秀芳.
一种联合Canny边缘检测和SPIHT的图像压缩方法
Image Compression Method Combining Canny Edge Detection and SPIHT
计算机科学, 2019, 46(6A): 222-225.
[7] 朱佩佩, 龙敏.
基于用户间接信任及高斯填充的推荐算法
Recommendation Methods Considering User Indirect Trust and Gaussian Filling
计算机科学, 2019, 46(11A): 178-184.
[8] 李佳星,赵书良,安磊,李长镜.
基于分形理论的多尺度分类尺度上推算法
Scaling-up Algorithm of Multi-scale Classification Based on Fractal Theory
计算机科学, 2018, 45(6A): 453-459.
[9] 郑书富,余高锋.
基于形式背景的属性转移与知识发现
Attribute Transfer and Knowledge Discovery Based on Formal Context
计算机科学, 2018, 45(6A): 117-119.
[10] 王锋, 刘吉超, 魏巍.
基于信息熵的半监督特征选择算法
Semi-supervised Feature Selection Algorithm Based on Information Entropy
计算机科学, 2018, 45(11A): 427-430.
[11] 邹娜, 田金文.
多特征融合红外舰船尾流检测方法研究
Research on Multi Feature Fusion Infrared Ship Wake Detection
计算机科学, 2018, 45(11A): 172-175.
[12] 曹峰,唐超,张婧.
一种结合二元蚁群和粗糙集的连续属性离散化算法
Algorithm of Continuous Attribute Discretization Based on Binary Ant Colony and Rough Sets
计算机科学, 2017, 44(9): 222-226. https://doi.org/10.11896/j.issn.1002-137X.2017.09.041
[13] 袁小艳,王安志,潘刚,王明辉.
多尺度下幅度谱与相位谱相融合的视觉注意建模
Visual Attention Modeling Based on Multi-scale Fusion of Amplitude Spectrum and Phase Spectrum
计算机科学, 2017, 44(7): 293-298. https://doi.org/10.11896/j.issn.1002-137X.2017.07.053
[14] 曹如胜,倪世宏,张鹏,奚显阳.
一种基于云模型的贝叶斯网络EM参数学习算法
EM Parameter Learning Algorithm of Bayesian Network Based on Cloud Model
计算机科学, 2016, 43(8): 194-198. https://doi.org/10.11896/j.issn.1002-137X.2016.08.039
[15] 陈旻骋,袁景凌,王啸岩,朱赛.
基于弱相关化特征子空间选择的离散化随机森林并行分类算法
Parallelization of Random Forest Algorithm Based on Discretization and Selection of Weak-correlation Feature Subspaces
计算机科学, 2016, 43(6): 55-58. https://doi.org/10.11896/j.issn.1002-137X.2016.06.011
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!