计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 56-63.doi: 10.11896/jsjkx.210600180
李霞, 马茜, 白梅, 王习特, 李冠宇, 宁博
LI Xia, MA Qian, BAI Mei, WANG Xi-te, LI Guan-yu, NING Bo
摘要: 随着数据来源的不断丰富,数据的获取变得愈发容易,但质量难以得到保证,从而导致缺失值在真实数据集中普遍存在且难以避免,缺失值填补也就成为数据质量管理领域的经典问题之一。目前,大多数的缺失值填补算法均是针对静态数据提出的,并不适用于高速到达的动态数据流,且现有算法大多未同时考虑数据的稀疏性和异构性问题。基于此,文中提出了一种新的基于独立模型的在线缺失值填补算法RIIM。该算法同时考虑了数据的稀疏性和异构性问题,并结合近邻填补和回归填补的基本思想对缺失值进行有效填补。首先,针对数据的动态实时性,提出了高效的填补模型增量更新算法;其次,针对数据近邻查找时间代价高以及近邻个数难以确定的问题,提出了最优近邻自适应周期性更新策略;最后基于真实数据集通过大量实验验证了所提算法的有效性。
中图分类号:
[1]LI J Z,WANG H Z.Data quality,a new aspect of big data[N].Technology Daily,2015-06-23(7). [2]YU G,GU Y.Large-scale graph data processing technology in cloud computing environment[J].Journal of Computer,2011,34(10):1753-1767. [3]LI K L,LIU ALEX X,YU S.Special issue on natural computation fuzzy systems and knowledge discovery from the ICNC&FSKD 2017[J].Neurocomputing,2020,393:112-114. [4]ALTMAN N S.An introduction to kernel and nearest-neighbor nonpara metric regression[J].The American Statistician,1992,46(3):175-185. [5]BUTERA N M,LI S Y,EVENSON K R,et al.Hot Deck Multiple Imputation for Handling Missing Accelerometer Data[J].Statistics in Biosciences,2019,11(2):422-448. [6]LITTTLE R J.Regression with missing x’s:a review[J].Journal of the American Statistical Association,1992,87(420):1227-1237. [7]ZHANG A Q,SONG X X,SUN Y,et al.Learning IndividualModels for Imputation[C]//ICDE.2019:160-171. [8]ZHANG C Q,ZHU X F,ZHANG J L,et al.GBKII:An imputation method for missing values[C]//Advances in Knowledge Discovery and Data Mining,11th Pacific-Asia Conference.2007. [9]DOMENICONI C,YAN B.Nearest neighbor ensemble[C]//17thInternational Conference on Pattern Recognition.2004:228-231. [10]CAI Z P,HEYDARI M,LIN G H.Microarray missing value imputation by iterated local least squares[C]//Proceedings of 4th Asia-Pacifific Bioinformatics Conference.2006:159-168. [11]WANG Q H,RAO J N K.Empirical likelihood-based inference in linear models with missing data[J].Scandinavian Journal of Statistics,2002,29(3):563-576. [12]RACINEA J,LI Q.Nonparametric estimation of regressionfunctions with both categorical and continuous data[J].Journal of Econometrics,2004,119(1):99-130. [13]ZHU X,ZHANG S,JIN Z,et al.Missing Value Estimation for Mixed-Attribute Data Sets[J].IEEE Transactions on Know-ledge and Data Engineering,2011,23(1):110-121. [14]CLEVEL W S,LOADER C.Smoothing by local regression:Principles and methods[J].Technical Report,1996,64(3):167-169. [15]CHRIS M,JENNIFER N,SUNIL P.ERACER:a database approach for statistical inference and data cleaning[C]//Procee-dings of the 2010 ACM SIGMOD International Conference on Management of data.2010:75-86. [16]KHAYATI M,BOHLEN M H.REBOM:recovery of blocks of missing values in time series[C]//COMAD.2012:44-55. [17]YI B,SIDIROPOULOS N,JOHNSON T,et al.Online data mi-ning for co-evolving time sequences[C]//ICDE.2000:13-22. [18]SUN J,PAPADIMITRIOU S,FALOUTSOS C.Online latentvariable detection in sensor networks[C]//ICDE.2005:1126-1127. [19]BOX G E P,JENKINS G.Time Series Analysis,Forecasting and Control[M].Prentice Hall PTR,1994. [20]KEVIN W M H.Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series[C]//20th International Conference on Extending Database Technology(EDBT).2017:2367-2005. [21]VITO S D,MASSERA E,PIGA M,et al.On field calibration ofan electronic nose for benzene estimation in an urban pollution monitoring scenario[J].Sensors & Actuators B Chemical,2008,129(2):750-757. |
[1] | 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙. 数据流概念漂移处理方法研究综述 Survey of Concept Drift Handling Methods in Data Streams 计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112 |
[2] | 胡安祥, 尹小康, 朱肖雅, 刘胜利. 基于数据流特征的比较类函数识别方法 Strcmp-like Function Identification Method Based on Data Flow Feature Matching 计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163 |
[3] | 陈圆圆, 王志海. 基于聚类分区的多维数据流概念漂移检测方法 Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition 计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155 |
[4] | 孙晓寒, 张莉. 基于评分区域子空间的协同过滤推荐算法 Collaborative Filtering Recommendation Algorithm Based on Rating Region Subspace 计算机科学, 2022, 49(7): 50-56. https://doi.org/10.11896/jsjkx.210600062 |
[5] | 庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131 |
[6] | 夏源, 赵蕴龙, 范其林. 基于信息熵更新权重的数据流集成分类算法 Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight 计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047 |
[7] | 王美玲, 刘晓楠, 尹美娟, 乔猛, 荆丽娜. 基于评论和物品描述的深度学习推荐算法 Deep Learning Recommendation Algorithm Based on Reviews and Item Descriptions 计算机科学, 2022, 49(3): 99-104. https://doi.org/10.11896/jsjkx.210200170 |
[8] | 汤世征, 张岩峰. DragDL:一种易用的深度学习模型可视化构建系统 DragDL:An Easy-to-Use Graphical DL Model Construction System 计算机科学, 2021, 48(8): 220-225. https://doi.org/10.11896/jsjkx.200900045 |
[9] | 马凤飞, 蔺素珍, 刘峰, 王丽芳, 李大威. 基于语义对比生成对抗网络的高倍欠采MRI重建 Semantic-contrast Generative Adversarial Network Based Highly Undersampled MRI Reconstruction 计算机科学, 2021, 48(4): 169-173. https://doi.org/10.11896/jsjkx.200600047 |
[10] | 齐延荣, 周夏冰, 李斌, 周清雷. 基于FPGA的CNN图像识别加速与优化 FPGA-based CNN Image Recognition Acceleration and Optimization 计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089 |
[11] | 徐兵, 弋沛玉, 王金策, 彭舰. 知识图谱嵌入的高阶协同过滤推荐系统 High-order Collaborative Filtering Recommendation System Based on Knowledge Graph Embedding 计算机科学, 2021, 48(11A): 244-250. https://doi.org/10.11896/jsjkx.210100211 |
[12] | 邵政毅, 陈秀宏. 基于样本特征核矩阵的稀疏双线性回归 Sample Feature Kernel Matrix-based Sparse Bilinear Regression 计算机科学, 2021, 48(10): 185-190. https://doi.org/10.11896/jsjkx.200800219 |
[13] | 吉顺慧, 张鹏程. 基于支配关系的数据流测试用例生成方法 Test Case Generation Approach for Data Flow Based on Dominance Relations 计算机科学, 2020, 47(9): 40-46. https://doi.org/10.11896/jsjkx.200700021 |
[14] | 杨皓然, 方贤文. 基于概率和时间因素的Petri网业务流程一致性分析 Business Process Consistency Analysis of Petri Net Based on Probability and Time Factor 计算机科学, 2020, 47(5): 59-63. https://doi.org/10.11896/jsjkx.190500119 |
[15] | 卢爱红, 郭艳, 李宁, 王萌, 刘杰. 基于原子范数最小化的二维稀疏阵列波达角估计算法 Direction-of-arrival Estimation with Two-dimensional Sparse Array Based on Atomic NormMinimization 计算机科学, 2020, 47(5): 271-276. https://doi.org/10.11896/jsjkx.191200139 |
|