计算机科学 ›› 2012, Vol. 39 ›› Issue (Z11): 207-211.

• 软件工程 • 上一篇    下一篇

数据清洗及其一般性系统框架

曹建军,刁兴春,陈 爽,邵衍振   

  1. (总参第63研究所 南京210007);(解放军理工大学指挥自动化学院 南京210007);(中国人民解放军71435部队 淄博255000)
  • 出版日期:2018-11-16 发布日期:2018-11-16

Data Cleaning and its General System Framework

  • Online:2018-11-16 Published:2018-11-16

摘要: 数据清洗是提高数据质量的重要手段之一。从数据产品与传统的有形产品、软件产品相类比的视角,研究数据清洗及其系统框架。数据清洗是数据质量研究的起点,从数据质量发展的角度明确数据清洗的地位和作用,并将其类比为其他产品形式的故障诊断与维修。对数据清洗做了10点说明,进一步澄清了其基本内涵;将数据清洗与数据集成进行了比较分析,指出二者是同等的数据质量概念。提出了数据清洗的一般性系统框架,其由准备、检测、定位、修正、验证J部分组成,允许在多处停止以完成不同的数据清洗任务,是一个柔性的、可扩展的、交互性好的、松藕合的框架。

关键词: 数据质量,数据清洗,相似重复记录,不完整记录,框架

Abstract: Data cleaning is one of the important methods to improve data quality. This pare studies data cleaning and system framework from the perspective of comparing data product with physical product and software product, Data quality research is begun from the data cleaning. The status and function of data cleaning is identified from the point of view of the data quality development,and it is compared to default diagnosis and servicing. 10 items of explaination for the data cleaning arc given, and its basic meaning is elucidated roundly. We compare data cleaning and data integration,and point that they arc the two coequal concepts of data quality. A general system framework of data cleaning is constructed. The framework consists of five phases, and they are preparation, detection, location, modification and validation. It could apply to different data cleaning tasks, and is a framework with good flexibility, extensibility, interactivity and loose coupling.

Key words: Data quality,Data cleaning,Approximate duplicate records,Incompleted records,Framework

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!