基于改进LLE的高维数据离散化方法

摘要/Abstract

摘要： 连续特征值离散化在数据挖掘、机器学习和模式识别等领域显得尤为重要。目前,现有的离散化方法主要处理低维数据,然而,现实世界中往往存在的是高维非线性数据。基于此,提出一种基于改进局部线性嵌入(LLE)的高维数据离散化方法——ILLE-HD3方法。首先,通过考虑数据的类信息对LLE方法进行改进,使其有效降维,以便于数据在低维空间中离散化。其次,在降维的基础上,提出了基于差异-相似集合(DSS)的连续特征值离散化算法,该算法利用类与特征之间的关联程度来决定连续域中断点的选取位置,并通过DSS理论定义分类错误标准,以控制连续域划分过程中产生的信息损失。最后,使用决策树分类工具C4.5和C5.0进行性能分析,结果表明,提出的ILLE-HD3方法处理高维非线性数据时具有较好的效果,与现有的方法相比,得到了较高的分类精度。

Abstract: Discretization algorithms for continuous features play a very important role in data mining,machine learning and pattern recognition.Existing methods mainly concentrate on discretizing low-dimensional data.However,there are high-dimensional nonlinear data in the real world.Based on this,this paper presented a high-dimensional data discretization method based on improved locally linear embedding(LLE),namely ILLE-HD3.First,LLE could be improved by considering class information of the data to effectively reduce dimensions of high-dimensional data.This facilitates the discretization method to be implemented in a low-dimensional space.Second,with the dimensionality reduction,we proposed a discretization algorithm for continuous features based on difference-similitude set(DSS).It uses class-feature interdependency to determine the selection of cut points in continuous value domain.Meanwhile,it defines a classification error criterion to control information loss generated by partition of continuous domain.Finally,by using the decision tree classification tools,C4.5 and C5.0,the proposed ILLE-HD3 algorithm achieves a better result on high-dimensional nonlinear data and higher classification accuracy than the existing algorithms.

Key words: High-dimensional data,Locally linear embedding(LLE),Discretization,Class-feature interdependency,Difference-similitude set(DSS)

许统德. 基于改进LLE的高维数据离散化方法[J]. 计算机科学, 2015, 42(Z6): 146-150. https://doi.org/

XU Tong-de. High-dimensional Data Discretization Method Based on Improved LLE[J]. Computer Science, 2015, 42(Z6): 146-150. https://doi.org/

参考文献

[1] Wu X D.Top 10 algorithms in data mining [J].Knowledge Information System,2008,14(1):1-37
[2] Vadera S.CSNL:a cost-sensitive non-linear decision tree algo-rithm [J].ACM Transactions on Knowledge Discovery from Data,2010,4(2):1-25
[3] Dougherty J,Kohavi R,Sahami M.Supervised and unsupervised discretization of continuous feature [C]∥ Proceedings of the 12th International Conference of Machine Learning.San Francisco:Morgan Kaufmann,1995:194-202
[4] Su C T,Hsu J H.An extended Chi2 algorithm for discretization of real value attributes [J].IEEE Transactions on Knowledge and Data Engineering,2005,17(3):437-441
[5] Fayyad U,Irani K.Multi-interval discretization of continuous-valued attributes for classification learning [C]∥Proceedings of the 13th International Joint Conference on Artificial Intelligence.San Mateo,CA:Morgan Kaufmann,1993:1022-1027
[6] Cios K J,Kurgan L.CAIM discretization algorithm [J].IEEE Transactions on Knowledge and Data Engineering,2004,16(2):145-153
[7] 杨萍,杨天社,杜小宁,等.一种基于类别属性关联程度最大化离散算法[J].控制与决策,2011,26(4):592-596
[8] 赵静娴,倪春鹏,詹原瑞,等.一种高效的连续属性离散化算法[J].系统工程与电子技术,2009,31(1):195-199
[9] Jin R,Breitbart Y,Muoh C.Data discretization unification [C]∥The Seventh IEEE International Conference on Data Mining(ICDM Best Paper).2007:183-192
[10] 史志才,夏永祥,周金祖.基于粒计算的离散化算法及其应用[J].计算机科学,2013,40(6A):133-135
[11] 汪凌.一种基于改进粒子群的连续属性离散化算法[J].计算机工程与应用,2013,49(21):29-32
[12] 徐菲菲,魏莱,杜海洲,等.一种基于互信息的模糊粗糙分类特征基因快速选取方法[J].计算机科学,2013,40(7):216-221
[13] Ruiz F J,Anguio C,Agell N.IDD:a supervised interval distance-based method for discretization [J].IEEE Transactions on Knowledge and Data Engineering,2008,20(9):1230-1238
[14] Bondu A,Boulle M,Lemaire V,et al.A non-parametric semi-supervised discretization method [C]∥The Eighth IEEE International Conference on Data Mining(ICDM).2008:53-62
[15] Armengol E,Garcia-cerdana A.Refining discretizations of continuous-valued attributes [C]∥Modeling of Decisions of Artificial Intelligence Conference,LNAI.Springer,Heidelberg,2012:258-269
[16] Salvador G,Julian L,Antonio S J,et al.A survey of discretization techniques:taxonomy and empirical analysis in supervised learning [J].IEEE Transactions on Knowledge and Data Engineering,2013,25(4):734-750
[17] Roweis S,Saul L.Nonlinear dimensionality reduction by locally linear embedding [J].Science,2000,290(5500):2323-2326
[18] Levina E,Bickel P J.Maximum likelihood estimation of intrinsic dimension [C]∥Advances in Neural Information Processing Systems.2005
[19] Wu M,Xia D L,Yan P L.A new knowledge reduction method based on difference-similitude set theory [C]∥ Proceedings of the Third International Conference on Machine Learning and Cybernetics.2004:1413-1418
[20] Wu M,Xia D L,Yan P L.Discretization algorithm based ondifference-similitude set theory [C]∥Proceedings of the Fourth International Conference on Machine Learning and Cybernetics.2005:1752-1755
[21] Blake C L,Merz C J.UCI repository of machine learning databases .http//:www.ics.uci.edu/~mlearn/MLRepository.html

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed