Computer Science ›› 2017, Vol. 44 ›› Issue (2): 98-102, 106.doi: 10.11896/j.issn.1002-137X.2017.02.013

Previous Articles     Next Articles

Missing Data Imputation Approach Based on Tuple Similarity

WANG Jun-lu, WANG Ling, WANG Yan and SONG Bao-yan   

  • Online:2018-11-13 Published:2018-11-13

Abstract: With the development of Internet and information technology,the data loss,damage and other problems become more and more popular.Especially with data collection from the manual to machine,storage medium is not stability,transmission omissions appear and other reasons,resulting that missing data are more serious.A large number of missing values in the database not only seriously affect the quality of the query,but also affect the accuracy of the results of data mining and data analysis.At present,there is not a general method to deal with missing data.Most of the strategies are based on the problem of the missing value of a certain type.Therefore,in view of this complex situation of that the different deletion types also appear in the incomplete data at the same time,this paper put forward missing data imputation approach based on tuple similarity(IATS).Incomplete data sets of weighted association rules are extracted by the method of data mining,and according to the rules imputate normal missing data,and for abnormal missing data,this paper introduced data recommendation algorithm,the recommended screening strategy of tuple similarity calculation and the realization of the corresponding fill,and then it greatly improves the data effective utilization rate and user query result quality.The experimental results show that the IATS strategy has better accuracy under the premise of ensuring the filling ratio.

Key words: Massive data,Deletion type,Weighted association rules,Tuple similarity

[1] CHEN M,MAO S,LIU Y.Big Data:A Survey[J].Mobile Networks & Applications,2014,19(2):171-209.
[2] AZADEH A,ASADZADEH S M,JAFARI-MARANDI R,et al.Optimum estimation of missing values in randomized complete block design by genetic algorithm[J].Knowledge-Based Systems,2013,37(2):37-47.
[3] AHMED I,AZIZ A.Dynamic Approach for Data ScrubbingProcess[J].International Journal on Computer Science & Engineering,2015,2(2):416-423.
[4] 韩家炜.数据挖掘:概念与技术[M].北京:机械工业出版社,2006.
[5] PYLE D.Data preparation for data mining[M].AcademicPress,1999.
[6] REN Y.Data preprocessing for data mining[D].Turku University,2013.
[7] CHENG K O,LAW N F,SIU W C.Iterativebicluster basedleast square framework for estimation of missing values in microarray gene expression data[J].Pattern Recognit,2012,45(4):1281-1289.
[8] XIANGCHAO G,ALAN W C L,HONG Y.Microarray missing data imputation based on a set theoretic framework and biological knowledge[C]∥International Conference on Pattern Recognition.IEEE Computer Society,2006:1608-1619.
[9] JUNGER W L,LEON A P D.Imputation of missing data in time series for air pollutants[J].Atmospheric Environment,2015,102:96-104.
[10] ZHANG G,HUANG K C,ZHENG X,et al.Across-Platform Imputation of DNA Methylation Levels Incorporating Nonlocal Information Using Penalized Functional Regression[J].Genetic Epidemiology,2016,40(4):333-340.
[11] ALLISON P D.Missing Data[D].Sage University Papers Series on Quantitative Applications in the Social Sciences,Thousand Oaks,Sage,CA,2001:7-136.
[12] LITTLE,RUBIN R J A,STATISTICAL D B.Analysis withMissing Data(seconded)[M].John Wiley and sons,Hoboken,NJ,2002.
[13] HULSE J V,KHOSHGOFTAAR T M.A comprehensive empirical evaluation of missing value imputation in noisy software measurement data[J].Journal of Systems & Software,2008,81(5):691-708.
[14] YANG K,LI J,WANG C.Missing Values Estimation in Mi-croarray Data with Partial Least Squares Regression[M]∥Computational Science-ICCS 2006.Springer Berlin Heidelberg,2006:662-669.
[15] SHAN Y,DENG G.Kernel PCA regression for missing data estimation in DNA microarray analysis[C]∥2009 IEEE International Symposium on Circuits and Systems.2009:1477-1480.
[16] LIU C C,DAI D Q,YAN H.The theoretic framework of local weighted approximation for microarray missing value estimation[J].Pattern Recognition,2010,43(8):2993-3002.
[17] RAGEL A,CRMILLEUX B.MVC—a preprocessing method to deal with missing values[J].Knowledge-Based Systems,1999,12(5/6):285-291.
[18] AGRAWAL R,SRIKANT R.Mining Quantitative Association Rules in Large Relational Tables[C]∥ACM SIGMOD Conf.Management of Data.1996:1-12.
[19] SCHNEIDER T.Analysis of Incomplete Climate Data:Estimation of Mean Values and Covariance Matrices and Imputation of Missing Values[J].Journal of Climate,2001,14(5):853-871.
[20] RAGEL A,CREMILLEUX B.Treatment of Missing Values for Association Rules[M]∥ Research and Development in Know-ledge Discovery and Data Mining.Springer Berlin Heidelberg,1998:258-270.
[21] GUSTAVO E,BATISTA A P A,Monard M C.An Analysis of Four Missing Data Treatment Methods for Supervised Learning[J].Applied Artificial Intelligence,2003,17(5):519-533.
[22] RAHMAN M G,ISLAM M Z.Missing value imputation using decision trees and decision forests by splitting and merging records:Two novel techniques[J].Knowledge-Based Systems,2013,53(9):51-65.
[23] RUBIN D B.Inference and missing data[J].Biometrika,1976,63:581-592.
[24] LISTING J,SCHLITTGEN R.A Nonparametric Test for Random Dropouts[J].Biometrical Journal,2003,45(1):113-127.
[25] PREISSER J S,WAGENKNECHT L E.Analysis of Smoking Trends with Incomplete Longitudinal Binary Responses[J].Journal of the American Statistical Association,2000,95(452):1021-1031.
[26] WANG P,AN C,WANG L.An improved algorithm for Mining Association Rule in relational database[C]∥2014 International Conference on Machine Learning and Cybernetics (ICMLC).IEEE,2015:247-252.
[27] KONONENKO I,BRATKO I,ROSKAR E.Experiments in automatic learning of medical diagnostic rules[R].Yugoslavia:Ljubljana,Lozef Institute,1984.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[2] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[3] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[4] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[5] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[6] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[7] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[8] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[9] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .
[10] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99, 116 .