计算机科学 ›› 2022, Vol. 49 ›› Issue (8): 56-63.doi: 10.11896/jsjkx.210600180

• 数据库&大数据&数据科学* 上一篇    下一篇

RIIM:基于独立模型的在线缺失值填补

李霞, 马茜, 白梅, 王习特, 李冠宇, 宁博   

  1. 大连海事大学信息科学技术学院 辽宁 大连 116026
  • 收稿日期:2021-06-28 修回日期:2021-10-19 发布日期:2022-08-02
  • 通讯作者: 马茜(maqian@dlmu.edu.cn)
  • 作者简介:(lixia_email@163.com)
  • 基金资助:
    国家自然科学基金(62002039,61602076,61702072,61976032);中国博士后科学基金面上项目(2017M611211,2017M621122,2019M661077);辽宁省自然科学基金(20180540003);赛尔网络下一代互联网技术创新项目(NGII20190902);中央高校基本科研业务费(3132021239)

RIIM:Real-Time Imputation Based on Individual Models

LI Xia, MA Qian, BAI Mei, WANG Xi-te, LI Guan-yu, NING Bo   

  1. School of Information Science & Technology,Dalian Maritime University,Dalian,Liaoning 116026,China
  • Received:2021-06-28 Revised:2021-10-19 Published:2022-08-02
  • About author:LI Xia,born in 1997,postgraduate,is a member of China Computer Federation.Her main research interests include data cleaning and sensory data management.
    MA Qian,born in 1988,Ph.D,is a member of China Computer Federation.Her main research interests include data cleaning and sensory data management.
  • Supported by:
    National Natural Science Foundation of China(62002039,61602076,61702072,61976032),China Postdoctoral Science Foundation Funded Projects(2017M611211,2017M621122,2019M661077),Natural Science Foundation of Liaoning Province(20180540003),CERNET Innovation Project(NGII20190902) and Fundamental Research Funds for Central Universities(3132021239).

摘要: 随着数据来源的不断丰富,数据的获取变得愈发容易,但质量难以得到保证,从而导致缺失值在真实数据集中普遍存在且难以避免,缺失值填补也就成为数据质量管理领域的经典问题之一。目前,大多数的缺失值填补算法均是针对静态数据提出的,并不适用于高速到达的动态数据流,且现有算法大多未同时考虑数据的稀疏性和异构性问题。基于此,文中提出了一种新的基于独立模型的在线缺失值填补算法RIIM。该算法同时考虑了数据的稀疏性和异构性问题,并结合近邻填补和回归填补的基本思想对缺失值进行有效填补。首先,针对数据的动态实时性,提出了高效的填补模型增量更新算法;其次,针对数据近邻查找时间代价高以及近邻个数难以确定的问题,提出了最优近邻自适应周期性更新策略;最后基于真实数据集通过大量实验验证了所提算法的有效性。

关键词: 缺失值, 数据流, 稀疏性, 异构性, 在线填补

Abstract: With the enrichment of data sources,data can be obtained easily but with low quality,resulting that the MVs are ubi-quitous and hard to avoid.Consequently,MV imputation has become one of the classical problems in the field of data quality mana-gement.However,most existing MV imputation approaches are proposed for static data,which cannot handle dynamic data streams arriving at high-speed.Moreover,they do not consider data sparsity and heterogeneity simultaneously.Therefore,a novel MV imputation approach,real-time imputation based on individual models (RIIM) is proposed.In RIIM,the MVs are effectively filled by combining the basic ideas of neighbors-based imputation and regression-based imputation with consideration of sparsity and heterogeneity of data.For the dynamic and real time of data streams,the MV imputation model is updated incrementally.Moreover,an adaptive and periodic updating strategy for optimal neighbors search is proposed to solve the problem of high time cost and hard to determine the number of neighbors.Finally,the effectiveness of the proposed RIIM is evaluated based on extensive experiments over real-world datasets.

Key words: Data streams, Heterogeneity, Missing value, Real-time imputation, Sparsity

中图分类号: 

  • TP3-05
[1]LI J Z,WANG H Z.Data quality,a new aspect of big data[N].Technology Daily,2015-06-23(7).
[2]YU G,GU Y.Large-scale graph data processing technology in cloud computing environment[J].Journal of Computer,2011,34(10):1753-1767.
[3]LI K L,LIU ALEX X,YU S.Special issue on natural computation fuzzy systems and knowledge discovery from the ICNC&FSKD 2017[J].Neurocomputing,2020,393:112-114.
[4]ALTMAN N S.An introduction to kernel and nearest-neighbor nonpara metric regression[J].The American Statistician,1992,46(3):175-185.
[5]BUTERA N M,LI S Y,EVENSON K R,et al.Hot Deck Multiple Imputation for Handling Missing Accelerometer Data[J].Statistics in Biosciences,2019,11(2):422-448.
[6]LITTTLE R J.Regression with missing x’s:a review[J].Journal of the American Statistical Association,1992,87(420):1227-1237.
[7]ZHANG A Q,SONG X X,SUN Y,et al.Learning IndividualModels for Imputation[C]//ICDE.2019:160-171.
[8]ZHANG C Q,ZHU X F,ZHANG J L,et al.GBKII:An imputation method for missing values[C]//Advances in Knowledge Discovery and Data Mining,11th Pacific-Asia Conference.2007.
[9]DOMENICONI C,YAN B.Nearest neighbor ensemble[C]//17thInternational Conference on Pattern Recognition.2004:228-231.
[10]CAI Z P,HEYDARI M,LIN G H.Microarray missing value imputation by iterated local least squares[C]//Proceedings of 4th Asia-Pacifific Bioinformatics Conference.2006:159-168.
[11]WANG Q H,RAO J N K.Empirical likelihood-based inference in linear models with missing data[J].Scandinavian Journal of Statistics,2002,29(3):563-576.
[12]RACINEA J,LI Q.Nonparametric estimation of regressionfunctions with both categorical and continuous data[J].Journal of Econometrics,2004,119(1):99-130.
[13]ZHU X,ZHANG S,JIN Z,et al.Missing Value Estimation for Mixed-Attribute Data Sets[J].IEEE Transactions on Know-ledge and Data Engineering,2011,23(1):110-121.
[14]CLEVEL W S,LOADER C.Smoothing by local regression:Principles and methods[J].Technical Report,1996,64(3):167-169.
[15]CHRIS M,JENNIFER N,SUNIL P.ERACER:a database approach for statistical inference and data cleaning[C]//Procee-dings of the 2010 ACM SIGMOD International Conference on Management of data.2010:75-86.
[16]KHAYATI M,BOHLEN M H.REBOM:recovery of blocks of missing values in time series[C]//COMAD.2012:44-55.
[17]YI B,SIDIROPOULOS N,JOHNSON T,et al.Online data mi-ning for co-evolving time sequences[C]//ICDE.2000:13-22.
[18]SUN J,PAPADIMITRIOU S,FALOUTSOS C.Online latentvariable detection in sensor networks[C]//ICDE.2005:1126-1127.
[19]BOX G E P,JENKINS G.Time Series Analysis,Forecasting and Control[M].Prentice Hall PTR,1994.
[20]KEVIN W M H.Continuous Imputation of Missing Values in Streams of Pattern-Determining Time Series[C]//20th International Conference on Extending Database Technology(EDBT).2017:2367-2005.
[21]VITO S D,MASSERA E,PIGA M,et al.On field calibration ofan electronic nose for benzene estimation in an urban pollution monitoring scenario[J].Sensors & Actuators B Chemical,2008,129(2):750-757.
[1] 陈志强, 韩萌, 李慕航, 武红鑫, 张喜龙.
数据流概念漂移处理方法研究综述
Survey of Concept Drift Handling Methods in Data Streams
计算机科学, 2022, 49(9): 14-32. https://doi.org/10.11896/jsjkx.210700112
[2] 胡安祥, 尹小康, 朱肖雅, 刘胜利.
基于数据流特征的比较类函数识别方法
Strcmp-like Function Identification Method Based on Data Flow Feature Matching
计算机科学, 2022, 49(9): 326-332. https://doi.org/10.11896/jsjkx.220200163
[3] 陈圆圆, 王志海.
基于聚类分区的多维数据流概念漂移检测方法
Concept Drift Detection Method for Multidimensional Data Stream Based on Clustering Partition
计算机科学, 2022, 49(7): 25-30. https://doi.org/10.11896/jsjkx.210600155
[4] 孙晓寒, 张莉.
基于评分区域子空间的协同过滤推荐算法
Collaborative Filtering Recommendation Algorithm Based on Rating Region Subspace
计算机科学, 2022, 49(7): 50-56. https://doi.org/10.11896/jsjkx.210600062
[5] 庞兴龙, 朱国胜.
基于半监督学习的网络流量分析研究
Survey of Network Traffic Analysis Based on Semi Supervised Learning
计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
[6] 夏源, 赵蕴龙, 范其林.
基于信息熵更新权重的数据流集成分类算法
Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight
计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047
[7] 王美玲, 刘晓楠, 尹美娟, 乔猛, 荆丽娜.
基于评论和物品描述的深度学习推荐算法
Deep Learning Recommendation Algorithm Based on Reviews and Item Descriptions
计算机科学, 2022, 49(3): 99-104. https://doi.org/10.11896/jsjkx.210200170
[8] 汤世征, 张岩峰.
DragDL:一种易用的深度学习模型可视化构建系统
DragDL:An Easy-to-Use Graphical DL Model Construction System
计算机科学, 2021, 48(8): 220-225. https://doi.org/10.11896/jsjkx.200900045
[9] 马凤飞, 蔺素珍, 刘峰, 王丽芳, 李大威.
基于语义对比生成对抗网络的高倍欠采MRI重建
Semantic-contrast Generative Adversarial Network Based Highly Undersampled MRI Reconstruction
计算机科学, 2021, 48(4): 169-173. https://doi.org/10.11896/jsjkx.200600047
[10] 齐延荣, 周夏冰, 李斌, 周清雷.
基于FPGA的CNN图像识别加速与优化
FPGA-based CNN Image Recognition Acceleration and Optimization
计算机科学, 2021, 48(4): 205-212. https://doi.org/10.11896/jsjkx.200600089
[11] 徐兵, 弋沛玉, 王金策, 彭舰.
知识图谱嵌入的高阶协同过滤推荐系统
High-order Collaborative Filtering Recommendation System Based on Knowledge Graph Embedding
计算机科学, 2021, 48(11A): 244-250. https://doi.org/10.11896/jsjkx.210100211
[12] 邵政毅, 陈秀宏.
基于样本特征核矩阵的稀疏双线性回归
Sample Feature Kernel Matrix-based Sparse Bilinear Regression
计算机科学, 2021, 48(10): 185-190. https://doi.org/10.11896/jsjkx.200800219
[13] 吉顺慧, 张鹏程.
基于支配关系的数据流测试用例生成方法
Test Case Generation Approach for Data Flow Based on Dominance Relations
计算机科学, 2020, 47(9): 40-46. https://doi.org/10.11896/jsjkx.200700021
[14] 杨皓然, 方贤文.
基于概率和时间因素的Petri网业务流程一致性分析
Business Process Consistency Analysis of Petri Net Based on Probability and Time Factor
计算机科学, 2020, 47(5): 59-63. https://doi.org/10.11896/jsjkx.190500119
[15] 卢爱红, 郭艳, 李宁, 王萌, 刘杰.
基于原子范数最小化的二维稀疏阵列波达角估计算法
Direction-of-arrival Estimation with Two-dimensional Sparse Array Based on Atomic NormMinimization
计算机科学, 2020, 47(5): 271-276. https://doi.org/10.11896/jsjkx.191200139
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!