计算机科学 ›› 2016, Vol. 43 ›› Issue (4): 264-269.doi: 10.11896/j.issn.1002-137X.2016.04.054

• 人工智能 • 上一篇    下一篇

一种非均匀分布数据的非线性标准化方法

梁路,黎剑,霍颖翔,滕少华   

  1. 广东工业大学计算机学院 广州510006,广东工业大学计算机学院 广州510006,广东工业大学计算机学院 广州510006,广东工业大学计算机学院 广州510006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家863计划重大项目(2013AA01A212),国家自然科学基金资助

Nonlinear Normalization for Non-uniformly Distributed Data

LIANG Lu, LI Jian, HUO Ying-xiang and TENG Shao-hua   

  • Online:2018-12-01 Published:2018-12-01

摘要: 传统的数据标准化处理通常采用的是线性的变换方法,其在处理非均匀分布的数据集时,容易因局部区间内数据点间距过小导致后续的数据挖掘(尤其是基于距离的挖掘)结果不够精确。因此,为非均匀分布数据提出一种基于数据拟合的非线性变换标准化方法,该方法能够在不改变数据整体分布规律的前提下,依据统计找出对应的非线性变换函数,根据函数对各数据点的取值进行非线性放缩,将数据稠密的区间进行扩大的同时将数据稀疏的区间进行压缩,让挖掘的结果更加精确。实验采用BP(Back Propagation)神经网络、支持向量机(Support Vector Machine,SVM)、最近邻分类(K-Nearest Neighbor,KNN) 3种经典分类算法结合不同的数据集进行了挖掘,结果表明,分类的错误率有不同程度的下降,同时F1度量有所提高。

关键词: 非均匀分布,非线性标准化,数据预处理

Abstract: Traditional normalization method for continuous attributes is usually a linear transformation.When using li-near normalization to deal with some non-uniform datasets,it’s easy to cause the subsequent data mining (particularly some mining methods based on distance) results are inaccurate enough for the interval of each data point in the local space is too small .This paper suggested a nonlinear normalization based on data fitting,and we could find out the corresponding nonlinear transformation function in the premise of not changing the distribution rules of data.According to the function,we could nonlinearly zoom the data interval,expand the interval of dense data and shrink the interval of sparse data at the same time.It can make the data mining more accurate.We used the neural network,SVM and KNN combining with different data set to test.The results show that the error rate decreases and the F1 measure increases at the same time.

Key words: Non-uniform distribution,Nonlinear normalization,Data preprocessing

[1] Kamiran F,Calders T.Data preprocessing techniques for classification without discrimination[J].Knowledge and Information Systems,2012,33(1):1-33
[2] Guo Xi-yue,He Ting-ting.Survey about Research on Informa-tion Extraction[J].Computer Science,2015,42(2):14-17(in Chinese) 郭喜跃,何婷婷.信息抽取研究综述[J].计算机科学,2015,2(2):14-17
[3] Wang R Y,Storey V C,Firth C P.A framework for analysis of data quality research[J].IEEE Transactions on Knowledge and Data Engineering,1995,7(4):623-640
[4] Jiawei H,Kamber M.Data mining:concepts and techniques[M].San Francisco,CA,Itd:Morgan Kaufmann,2001
[5] Weigend A S.Time series prediction:forecasting the future and understanding the past[R].Santa Fe Institute Studies in the Scie-nces of Complexity,1994
[6] Mendelsohn L.Preprocessing data for neural networks.https://www.tradertech.com/mendelsohn/library/neural-networks/preprocessing-data
[7] Yu L,Wang S,Lai K K.An integrated data preparation scheme for neural network data analysis[J].IEEE Transactions on Knowledge and Data Engineering,2006,18(2):217-230
[8] Liping Y,Yuntao P,Yishan W.Research on data normalization methods in multi-attribute evaluation[C]∥International Conference on Computational Intelligence and Software Enginee-ring,2009(CiSE 2009).IEEE,2009:1-5
[9] Pyle D.Data preparation for data mining[M].Morgan Kauf-mann,1999
[10] Uragun B,Rajan R.Developing an appropriate data normalization method[C]∥2011 10th International Conference on Machine Learning and Applications and Workshops (ICMLA).IEEE,2011,2:195-199
[11] Zhang Yu-nong,Li Ming-ming,Chen Jin-hao,et al.Solving the problem of Runge phenomenon by coefficients and order determination method[J].Computer Engineering and Applications,2013,9(3):44-49(in Chinese) 张雨浓,李名鸣,陈锦浩,等.龙格现象难题被解之系数与阶次双确定方法[J].计算机工程与应用,2013,9(3):44-49

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!