计算机科学 ›› 2017, Vol. 44 ›› Issue (2): 98-102.doi: 10.11896/j.issn.1002-137X.2017.02.013
• 2016 第十三届全国Web 信息系统及其应用学术会议 • 上一篇 下一篇
王俊陆,王玲,王妍,宋宝燕
WANG Jun-lu, WANG Ling, WANG Yan and SONG Bao-yan
摘要: 随着互联网及信息技术的发展,数据缺失、损坏等问题越来越普遍,尤其随着数据收集工作从人工转向机器,存储介质的不稳定性及网络传输出现遗漏等原因都导致数据缺失更加严重。数据库中大量的缺失值不仅严重影响了用户查询质量,还对数据挖掘与数据分析结果的正确性造成了影响,进而误导决策。目前,对缺失数据的填补还没有一种比较通用的方法,大部分策略都是针对某一类型的缺失值问题进行处理。因此,针对不同缺失类型同时出现在不完备数据中的复杂情况,提出了一种基于元组相似度的不完备数据填补方法(IATS)。采用数据挖掘的方法提取出不完备数据集中的加权关联规则,并根据此规则进行常规缺失数据的填补,而对于数据集的异常缺失问题,又引入数据推荐算法,采用推荐筛选策略进行元组相似度的计算并实现相应填补,在很大程度上提高了数据的有效利用率和用户查询结果的质量。实验表明,IATS策略在保证填补率的前提下具有更好的准确率。
| [1] CHEN M,MAO S,LIU Y.Big Data:A Survey[J].Mobile Networks & Applications,2014,19(2):171-209. [2] AZADEH A,ASADZADEH S M,JAFARI-MARANDI R,et al.Optimum estimation of missing values in randomized complete block design by genetic algorithm[J].Knowledge-Based Systems,2013,37(2):37-47. [3] AHMED I,AZIZ A.Dynamic Approach for Data ScrubbingProcess[J].International Journal on Computer Science & Engineering,2015,2(2):416-423. [4] 韩家炜.数据挖掘:概念与技术[M].北京:机械工业出版社,2006. [5] PYLE D.Data preparation for data mining[M].AcademicPress,1999. [6] REN Y.Data preprocessing for data mining[D].Turku University,2013. [7] CHENG K O,LAW N F,SIU W C.Iterativebicluster basedleast square framework for estimation of missing values in microarray gene expression data[J].Pattern Recognit,2012,45(4):1281-1289. [8] XIANGCHAO G,ALAN W C L,HONG Y.Microarray missing data imputation based on a set theoretic framework and biological knowledge[C]∥International Conference on Pattern Recognition.IEEE Computer Society,2006:1608-1619. [9] JUNGER W L,LEON A P D.Imputation of missing data in time series for air pollutants[J].Atmospheric Environment,2015,102:96-104. [10] ZHANG G,HUANG K C,ZHENG X,et al.Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression[J].Genetic Epidemiology,2016,40(4):333-340. [11] ALLISON P D.Missing Data[D].Sage University Papers Series on Quantitative Applications in the Social Sciences,Thousand Oaks,Sage,CA,2001:7-136. [12] LITTLE,RUBIN R J A,STATISTICAL D B.Analysis withMissing Data(seconded)[M].John Wiley and sons,Hoboken,NJ,2002. [13] HULSE J V,KHOSHGOFTAAR T M.A comprehensive empirical evaluation of missing value imputation in noisy software measurement data[J].Journal of Systems & Software,2008,81(5):691-708. [14] YANG K,LI J,WANG C.Missing Values Estimation in Mi-croarray Data with Partial Least Squares Regression[M]∥Computational Science-ICCS 2006.Springer Berlin Heidelberg,2006:662-669. [15] SHAN Y,DENG G.Kernel PCA regression for missing data estimation in DNA microarray analysis[C]∥2009 IEEE International Symposium on Circuits and Systems.2009:1477-1480. [16] LIU C C,DAI D Q,YAN H.The theoretic framework of local weighted approximation for microarray missing value estimation[J].Pattern Recognition,2010,43(8):2993-3002. [17] RAGEL A,CRMILLEUX B.MVC—a preprocessing method to deal with missing values[J].Knowledge-Based Systems,1999,12(5/6):285-291. [18] AGRAWAL R,SRIKANT R.Mining Quantitative Association Rules in Large Relational Tables[C]∥ACM SIGMOD Conf.Management of Data.1996:1-12. [19] SCHNEIDER T.Analysis of Incomplete Climate Data:Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values[J].Journal of Climate,2001,14(5):853-871. [20] RAGEL A,CREMILLEUX B.Treatment of Missing Values for Association Rules[M]∥ Research and Development in Know-ledge Discovery and Data Mining.Springer Berlin Heidelberg,1998:258-270. [21] GUSTAVO E,BATISTA A P A,Monard M C.An Analysis of Four Missing Data Treatment Methods for Supervised Learning[J].Applied Artificial Intelligence,2003,17(5):519-533. [22] RAHMAN M G,ISLAM M Z.Missing value imputation using decision trees and decision forests by splitting and merging records:Two novel techniques[J].Knowledge-Based Systems,2013,53(9):51-65. [23] RUBIN D B.Inference and missing data[J].Biometrika,1976,63:581-592. [24] LISTING J,SCHLITTGEN R.A Nonparametric Test for Random Dropouts[J].Biometrical Journal,2003,45(1):113-127. [25] PREISSER J S,WAGENKNECHT L E.Analysis of Smoking Trends with Incomplete Longitudinal Binary Responses[J].Journal of the American Statistical Association,2000,95(452):1021-1031. [26] WANG P,AN C,WANG L.An improved algorithm for Mining Association Rule in relational database[C]∥2014 International Conference on Machine Learning and Cybernetics (ICMLC).IEEE,2015:247-252. [27] KONONENKO I,BRATKO I,ROSKAR E.Experiments in automatic learning of medical diagnostic rules[R].Yugoslavia:Ljubljana,Lozef Institute,1984. | 
| No related articles found! | 
| 
 | ||