计算机科学 ›› 2019, Vol. 46 ›› Issue (2): 30-34.doi: 10.11896/j.issn.1002-137X.2019.02.005

• 大数据与数据科学 • 上一篇    下一篇

重复数据中关键属性值缺失填补的改进ROUSTIDA算法

樊哲宁, 杨秋辉, 翟宇鹏, 万莹, 王帅   

  1. 四川大学计算机学院(软件学院) 成都610065
  • 收稿日期:2017-12-05 出版日期:2019-02-25 发布日期:2019-02-25
  • 通讯作者: 杨秋辉(1970-),女,副教授,CCF会员,主要研究方向为软件测试、经验软件工程、数据库系统及其应用,E-mail:yangqiuhui@scu.edu.cn
  • 作者简介:樊哲宁(1994-),女,硕士,主要研究方向为软件分析与测试,E-mail:fanzheningchn@163.com;翟宇鹏(1992-),男,硕士,主要研究方向为软件自动化测试;万 莹(1993-),女,硕士,主要研究方向为软件分析与测试;王 帅(1992-),男,硕士,主要研究方向为数据挖掘。

Improved ROUSTIDA Algorithm for Missing Data Imputation with Key Attribute in Repetitive Data

FAN Zhe-ning, YANG Qiu-hui, ZHAI Yu-peng, WAN Ying, WANG Shuai   

  1. College of Computer Science(Software Engineering),Sichuan University,Chengdu 610065,China
  • Received:2017-12-05 Online:2019-02-25 Published:2019-02-25

摘要: 随着数据分析研究的兴起,数据预处理越来越得到研究者的重视,其中缺失数据填补问题的重要性也逐渐显现。在ROUSTIDA数据补齐算法的基础上,针对具有关键属性的重复数据的特点,文中提出了一种改进的ROUSTIDA算法——Key&Rpt_RS算法。Key&Rpt_RS算法继承了ROUSTIDA算法的优势,同时考虑了目标数据的重复性特点,分析了关键属性对填补效果的影响,得到了更加准确且有效的填补结果。

关键词: ROUSTIDA算法, 缺失填补, 数据预处理, 重复数据

Abstract: With the rise of data analysis,the importance of data pre-processing has attracted more and more attention,especially the imputation of missing data.Based on the ROUSTIDA algorithm,this paper proposed an improved ROUSTIDA algorithm-Key&Rpt_RS algorithm.Key&Rpt_RS algorithm inherits the advantages of ROUSTIDA algorithm,considers the characteristic of repeatability in objective data,and analyzes the influence of key attribute on imputation effect.At last,this paper conducted the experiments based on the alarm data in communication network.The results show that Key&Rpt_RS algorithm outperforms the traditional ROUSTIDA algorithm in terms of the imputation effect for missing data.

Key words: Data pre-processing, Missing data imputation, Repeated data, ROUSTIDA algorithm

中图分类号: 

  • TP391
[1]RUBIND B.Multiple imputation for nonresponse in surveys[J].Journal of Marketing Research,1987,137(1):180.
[2]SHUAI P,LI X S,ZHOU X H,et al.Theresearchprocesson statistical processing of missing data[J].Chinese Journal of Health Statistics,2013,30(1):135-139.(in Chinese)
帅平,李晓松,周晓华,等.缺失数据统计处理方法的研究进展[J].中国卫生统计,2013,30(1):135-139.
[3]YUE Y,TIAN K C.Review of data missing and its imputation method[J].Journal of Preventive Medicine Information,2005,21(6):683-685.(in Chinese)
岳勇,田考聪.数据缺失及其填补方法综述[J].预防医学情报杂志,2005,21(6):683-685.
[4]JIN Y J.Imputation adjustment method for missing data[J].Journal of applied statistics and management,2001,20(6):47-53.(in Chinese)
金勇进.缺失数据的插补调整[J].数理统计与管理,2001,20(6):47-53.
[5]DEMPSTER A P.Maximum likelihood estimation from incomplete data via the EM algorithm[J].Journal of the Royal Statistical Society,1977,39(1):1-38.
[6]JIN Y J.Adjusting for Missing Data by Weighting in Survey Analysis[J].Journal of applied statistics and management,2001(5):61-64.(in Chinese)
金勇进.缺失数据的加权调整(系列之IV)[J].数理统计与管理,2001(5):61-64.
[7]ROBINS J M,ROTNITZKY A,ZHAO L P.Estimation of Regression Coefficients When Some Regressors Are Not Always Observed[J].Journal of the American Statistical Association,1994,89(427):846-866.
[8]ZHANG Z H,LIU W Q.An Improved Algorithm Based on the Incomplete Data of the Rough Set Theory[J].Computer Engineering & Science,2002,24(4):41-42.(in Chinese)
张振华,刘文奇.一种基于粗集理论不完备数据的改进算法[J].计算机工程与科学,2002,24(4):41-42.
[9]DUAN P,ZHUANG H L,HE L,et al.Improved algorithm based on incomplete data analysis method[J].Computer Engineering and Design,2009,30(7):1681-1684.(in Chinese)
段鹏,庄红林,何磊,等.不完备数据分析方法(ROUSTIDA)的改进算法[J].计算机工程与设计,2009,30(7):1681-1684.
[10]TIAN S X,WU X P,WANG H X.Improved method for data reinforcement based on ROUSTIDA[J].Journal of Naval University of Engineering,2011,23(5):11-15.(in Chinese)
田树新,吴晓平,王红霞.一种基于改进的ROUSTIDA算法的数据补齐方法[J].海军工程大学学报,2011,23(5):11-15.
[11]DING C R,LI L S.Improved ROUSTIDA algorithm based on similarity relation vector[J].Computer Engineering and Applications,2014,50(13):133-136.(in Chinese)
丁春荣,李龙澍.基于相似关系向量的改进ROUSTIDA算法[J].计算机工程与应用,2014,50(13):133-136.
[12]PAWLAK Z.Rough set[J].International Journal of Computer & Information Sciences,1982,11(5):341-356.
[13]张文修.粗糙集理论与方法[M].北京:科学出版社,2001.
[14]SKOWRON A,RAUSZER C.The Discernibility Matrices and Functions in Information Systems[M]∥Intelligent Decision Support. Springer, Dordrecht,1992:331-362.
[15]王国胤.Rough集理论与知识获取[M].西安:西安交通大学出版社,2001.
[16]ZHANG W,LIAO X F,WU Z F.An incomplete data analysis approach based on rough set theory[J].Pattern Recognition and Artificial Intelligence,2003,16(2):158-163.(in Chinese)
张伟,廖晓峰,吴中福.一种基于Rough集理论的不完备数据分析方法[J].模式识别与人工智能,2003,16(2):158-163.
[17]MENG J,LIU Y C,MO H B.New method of packing missing data based on rough set theory[J].Computer Engineering and Applications,2008,44(6):175-177.(in Chinese)
孟军,刘永超,莫海波.基于粗糙集理论的不完备数据填补方法[J].计算机工程与应用,2008,44(6):175-177.
[1] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[2] 徐堃, 付印金, 陈卫卫, 张亚男.
基于区块链的云存储安全研究进展
Research Progress on Blockchain-based Cloud Storage Security Mechanism
计算机科学, 2021, 48(11): 102-115. https://doi.org/10.11896/jsjkx.210600015
[3] 倪晓军, 佘戌豪.
面向无线传感网络应用的改进LZW算法
Improvement of LZW Algorithms for Wireless Sensor Networks
计算机科学, 2020, 47(5): 260-264. https://doi.org/10.11896/jsjkx.190400108
[4] 陈佳,欧阳金源,冯安琪,吴远,钱丽萍.
边缘计算构架下基于孤立森林算法的DoS异常检测
DoS Anomaly Detection Based on Isolation Forest Algorithm Under Edge Computing Framework
计算机科学, 2020, 47(2): 287-293. https://doi.org/10.11896/jsjkx.190100047
[5] 周蓓, 黄永忠, 许瑾晨, 郭绍忠.
向量数学库的向量化方法研究
Study on SIMD Method of Vector Math Library
计算机科学, 2019, 46(1): 320-324. https://doi.org/10.11896/j.issn.1002-137X.2019.01.050
[6] 檀朝东,闵帆,吴霄,李欣伦.
带弱通配符的模式匹配及其在时序分析中的应用
Pattern Matching with Weak-wildcard in Application of Time Series Analysis
计算机科学, 2018, 45(1): 103-107. https://doi.org/10.11896/j.issn.1002-137X.2018.01.016
[7] 梁路,龚奔龙,黎剑,滕少华.
一种缓解分类面交错的样本点扩散方法
Diffusion Method of Sample Points for Alleviating Staggered Situation of Classification
计算机科学, 2017, 44(9): 286-289. https://doi.org/10.11896/j.issn.1002-137X.2017.09.053
[8] 池云仙,赵书良,罗燕,高琳,赵骏鹏,李超.
基于词频统计规律的文本数据预处理方法
Text Data Preprocessing Based on Term Frequency Statistics Rules
计算机科学, 2017, 44(10): 276-282. https://doi.org/10.11896/j.issn.1002-137X.2017.10.050
[9] 李锋,陆婷婷,郭建华.
一种基于重复数据删除的镜像文件存储方法研究
Effective Image File Storage Technique Using Improved Data Deduplication
计算机科学, 2016, 43(Z11): 495-498. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.111
[10] 梁路,黎剑,霍颖翔,滕少华.
一种非均匀分布数据的非线性标准化方法
Nonlinear Normalization for Non-uniformly Distributed Data
计算机科学, 2016, 43(4): 264-269. https://doi.org/10.11896/j.issn.1002-137X.2016.04.054
[11] 徐奕奕,唐培和.
基于分数阶Fourier变换的云存储系统重复数据删除算法
Duplicate Data Remove Algorithm of Cloud Storage System Based on Fractional Fourier Transform
计算机科学, 2015, 42(7): 174-177. https://doi.org/10.11896/j.issn.1002-137X.2015.07.038
[12] 刘解放,赵斌,周宁.
基于有效载荷的多级实时入侵检测系统框架
Multilevel Real-time Payload-based Intrusion Detection System Framework
计算机科学, 2014, 41(4): 126-133.
[13] 谢平.
存储系统重复数据删除技术研究综述
Survey on Data Deduplication Techniques for Storage Systems
计算机科学, 2014, 41(1): 22-30.
[14] 周敬利,聂雪军,秦磊华,刘科,朱建峰,王宇.
基于存储环境感知的重复数据删除算法优化
Optimization for Data De-duplication Algorithm Based on Storage Environment Aware
计算机科学, 2011, 38(2): 63-67.
[15] 于化龙,顾国昌,赵靖,刘海波,沈晶.
基于DNA微阵列数据的癌症分类问题研究进展
State of the Art on Cancer Classification Problems Based on DNA Microarray Data
计算机科学, 2010, 37(10): 16-22.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!