计算机科学 ›› 2017, Vol. 44 ›› Issue (2): 98-102.doi: 10.11896/j.issn.1002-137X.2017.02.013

• 2016 第十三届全国Web 信息系统及其应用学术会议 • 上一篇    下一篇

基于元组相似度的不完备数据填补方法研究

王俊陆,王玲,王妍,宋宝燕   

  1. 辽宁大学信息学院 沈阳110036,辽宁大学信息学院 沈阳110036,辽宁大学信息学院 沈阳110036;东北大学信息与工程学院 沈阳110819,辽宁大学信息学院 沈阳110036
  • 出版日期:2018-11-13 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金项目(61472169,61472072),国家科技支撑计划项目(2012BAF13B08),国家“973”重点基础研究发展计划前期研究专项(2014CB360509),辽宁省科学事业公益研究基金项目(2015003003),辽宁大学科研基金(科技类)项目(LDQN2015001)资助

Missing Data Imputation Approach Based on Tuple Similarity

WANG Jun-lu, WANG Ling, WANG Yan and SONG Bao-yan   

  • Online:2018-11-13 Published:2018-11-13

摘要: 随着互联网及信息技术的发展,数据缺失、损坏等问题越来越普遍,尤其随着数据收集工作从人工转向机器,存储介质的不稳定性及网络传输出现遗漏等原因都导致数据缺失更加严重。数据库中大量的缺失值不仅严重影响了用户查询质量,还对数据挖掘与数据分析结果的正确性造成了影响,进而误导决策。目前,对缺失数据的填补还没有一种比较通用的方法,大部分策略都是针对某一类型的缺失值问题进行处理。因此,针对不同缺失类型同时出现在不完备数据中的复杂情况,提出了一种基于元组相似度的不完备数据填补方法(IATS)。采用数据挖掘的方法提取出不完备数据集中的加权关联规则,并根据此规则进行常规缺失数据的填补,而对于数据集的异常缺失问题,又引入数据推荐算法,采用推荐筛选策略进行元组相似度的计算并实现相应填补,在很大程度上提高了数据的有效利用率和用户查询结果的质量。实验表明,IATS策略在保证填补率的前提下具有更好的准确率。

关键词: 海量数据,缺失类型,加权关联规则,元组相似度

Abstract: With the development of Internet and information technology,the data loss,damage and other problems become more and more popular.Especially with data collection from the manual to machine,storage medium is not stability,transmission omissions appear and other reasons,resulting that missing data are more serious.A large number of missing values in the database not only seriously affect the quality of the query,but also affect the accuracy of the results of data mining and data analysis.At present,there is not a general method to deal with missing data.Most of the strategies are based on the problem of the missing value of a certain type.Therefore,in view of this complex situation of that the different deletion types also appear in the incomplete data at the same time,this paper put forward missing data imputation approach based on tuple similarity(IATS).Incomplete data sets of weighted association rules are extracted by the method of data mining,and according to the rules imputate normal missing data,and for abnormal missing data,this paper introduced data recommendation algorithm,the recommended screening strategy of tuple similarity calculation and the realization of the corresponding fill,and then it greatly improves the data effective utilization rate and user query result quality.The experimental results show that the IATS strategy has better accuracy under the premise of ensuring the filling ratio.

Key words: Massive data,Deletion type,Weighted association rules,Tuple similarity

[1] CHEN M,MAO S,LIU Y.Big Data:A Survey[J].Mobile Networks & Applications,2014,19(2):171-209.
[2] AZADEH A,ASADZADEH S M,JAFARI-MARANDI R,et al.Optimum estimation of missing values in randomized complete block design by genetic algorithm[J].Knowledge-Based Systems,2013,37(2):37-47.
[3] AHMED I,AZIZ A.Dynamic Approach for Data ScrubbingProcess[J].International Journal on Computer Science & Engineering,2015,2(2):416-423.
[4] 韩家炜.数据挖掘:概念与技术[M].北京:机械工业出版社,2006.
[5] PYLE D.Data preparation for data mining[M].AcademicPress,1999.
[6] REN Y.Data preprocessing for data mining[D].Turku University,2013.
[7] CHENG K O,LAW N F,SIU W C.Iterativebicluster basedleast square framework for estimation of missing values in microarray gene expression data[J].Pattern Recognit,2012,45(4):1281-1289.
[8] XIANGCHAO G,ALAN W C L,HONG Y.Microarray missing data imputation based on a set theoretic framework and biological knowledge[C]∥International Conference on Pattern Recognition.IEEE Computer Society,2006:1608-1619.
[9] JUNGER W L,LEON A P D.Imputation of missing data in time series for air pollutants[J].Atmospheric Environment,2015,102:96-104.
[10] ZHANG G,HUANG K C,ZHENG X,et al.Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression[J].Genetic Epidemiology,2016,40(4):333-340.
[11] ALLISON P D.Missing Data[D].Sage University Papers Series on Quantitative Applications in the Social Sciences,Thousand Oaks,Sage,CA,2001:7-136.
[12] LITTLE,RUBIN R J A,STATISTICAL D B.Analysis withMissing Data(seconded)[M].John Wiley and sons,Hoboken,NJ,2002.
[13] HULSE J V,KHOSHGOFTAAR T M.A comprehensive empirical evaluation of missing value imputation in noisy software measurement data[J].Journal of Systems & Software,2008,81(5):691-708.
[14] YANG K,LI J,WANG C.Missing Values Estimation in Mi-croarray Data with Partial Least Squares Regression[M]∥Computational Science-ICCS 2006.Springer Berlin Heidelberg,2006:662-669.
[15] SHAN Y,DENG G.Kernel PCA regression for missing data estimation in DNA microarray analysis[C]∥2009 IEEE International Symposium on Circuits and Systems.2009:1477-1480.
[16] LIU C C,DAI D Q,YAN H.The theoretic framework of local weighted approximation for microarray missing value estimation[J].Pattern Recognition,2010,43(8):2993-3002.
[17] RAGEL A,CRMILLEUX B.MVC—a preprocessing method to deal with missing values[J].Knowledge-Based Systems,1999,12(5/6):285-291.
[18] AGRAWAL R,SRIKANT R.Mining Quantitative Association Rules in Large Relational Tables[C]∥ACM SIGMOD Conf.Management of Data.1996:1-12.
[19] SCHNEIDER T.Analysis of Incomplete Climate Data:Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values[J].Journal of Climate,2001,14(5):853-871.
[20] RAGEL A,CREMILLEUX B.Treatment of Missing Values for Association Rules[M]∥ Research and Development in Know-ledge Discovery and Data Mining.Springer Berlin Heidelberg,1998:258-270.
[21] GUSTAVO E,BATISTA A P A,Monard M C.An Analysis of Four Missing Data Treatment Methods for Supervised Learning[J].Applied Artificial Intelligence,2003,17(5):519-533.
[22] RAHMAN M G,ISLAM M Z.Missing value imputation using decision trees and decision forests by splitting and merging records:Two novel techniques[J].Knowledge-Based Systems,2013,53(9):51-65.
[23] RUBIN D B.Inference and missing data[J].Biometrika,1976,63:581-592.
[24] LISTING J,SCHLITTGEN R.A Nonparametric Test for Random Dropouts[J].Biometrical Journal,2003,45(1):113-127.
[25] PREISSER J S,WAGENKNECHT L E.Analysis of Smoking Trends with Incomplete Longitudinal Binary Responses[J].Journal of the American Statistical Association,2000,95(452):1021-1031.
[26] WANG P,AN C,WANG L.An improved algorithm for Mining Association Rule in relational database[C]∥2014 International Conference on Machine Learning and Cybernetics (ICMLC).IEEE,2015:247-252.
[27] KONONENKO I,BRATKO I,ROSKAR E.Experiments in automatic learning of medical diagnostic rules[R].Yugoslavia:Ljubljana,Lozef Institute,1984.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!