Computer Science ›› 2018, Vol. 45 ›› Issue (4): 1-10.doi: 10.11896/j.issn.1002-137X.2018.04.001

    Next Articles

History and Development Tendency of Data Quality

CAI Li, LIANG Yu, ZHU Yang-yong and HE Jing   

  • Online:2018-04-15 Published:2018-05-11

Abstract: In the Internet age,data becomes new factors of production,becomes the basic resources and strategic resources,and are important productive forces.Big data services have been widely carried out in China,and data exchanges have been established.Now,data quality has become a key issue restricting the development of data industry.This paper divided the research issues of data quality into three stages according to the chronological order,and summarized the representative results of each stage,including methodologies,techniques,models,tools and frameworks.Then,it analyzed the challenges and opportunities faced by data quality research in the new environment of big data,the internet of things and cloud computing.Finally,it prospected research focuses and development trend of data quality from six aspects:data quality model,quality management of big data,related quality techniques for big data,crowdsourcing,internet of things and data sharing.

Key words: Data quality,History,Development trend,Big data

[1] SCANNAPIECO M,CATARCI T.Data Quality under the Computer Science Perspective [J].Archivi & Computer,2002,2:1-13.
[2] Financial Accounting Standards Board.Qualitative Characteristics of Accounting Information,Statement of Financial Accoun-ting Concepts No.2 [R].Financial Accounting Standards Board,2008:6.
[3] 曹建军,刁兴春,徐永平,等.信息质量 [M].北京:国防工业出版社,2013.
[4] NAUMANN F,ROLKWE C.Assessment Methods for Information Quality Criteria[C]∥Proceedings of 5th International Conference on Information Quality.2000:148-162.
[5] HUANG X Y,ZHANG H.Statistical Data Quality Manage-ment:From a Multidisciplinary Perspective [J].Journal of Business Economics,2011,239(9):90-96.(in Chinese) 黄向阳,张皓.多学科视角下的统计数据质量管理 [J].商业经济与管理,2011,239(9):90-96.
[6] WANG R Y,STRONG D M.Beyond accuracy:What data quality means to data consumers [J].Journal of management information systems,1996,12(4):5-33.
[7] REDMAN T C.Data quality:the field guide[M].Boston:DigitalPress,2001.
[8] ZEIST R H J,HENDRIKS P R H.Specifying software quality with the extended ISO model [J].Software Quality Journal,1996,5(4):273-284.
[9] KATERATTANAKUL P,SIAU K.Measuring informationquality of web sites:Development of an instrument[C]∥Proceedings of the 20th International Conference on Information Systems.North Carolina:ACM,1999:279-285.
[10] DEDEKE A.A Conceptual Framework for Developing Quality Measures for Information Systems[C]∥Conference on Information Quality.DBLP,2000:126-128.
[11] FAN B W.Study on the quality of crowdsourcing geographic data-a case of Kunming [D].Kunming:Yunnan University,2015.(in Chinese) 范博文.众源地理数据质量研究——以昆明市为例 [D].昆明:云南大学,2015.
[12] CAI L,ZHU Y Y.Big Data Quality [M].Shanghai:Scientific & Technical Publishers,2017.(in Chinese) 蔡莉,朱扬勇.大数据质量 [M].上海:科学技术出版社,2017.
[13] ZOOK M,GRAHAM M,SHELTON T,et al.Volunteered geographic information and crowdsourcing disaster relief:a case study of the Haitian earthquake [J].World Medical & Health Policy,2010,2(2):7-33.
[14] PIPINO L L,LEE Y W,WANG R Y.Data quality assessment [J].Communications of the ACM,2002,45(4):211-218.
[15] BALLOU D,WANG R,PAZER H,et al.Modeling information manufacturing systems to determine information product quality [J].Management Science,1998,44(4):462-484.
[16] 徐子沛.大数据[M].桂林:广西师范大学出版社,2013.
[17] SILBERSCHATZ A.Database System Concepts:Fifth Edition[M].Beijing:China Machine Press,2010.
[18] CHENG L Q.Data Constraints on the Impact of Data Quality [J].Journal of Yangtze University (Natural Science Edition),2011,8(5):100-102.(in Chinese) 程录庆.数据约束对数据质量的影响研究 [J].长江大学学报(自然科学版),2011,8(5):100-102.
[19] BOHANNON P,FAN W,GEERTS F,et al.Conditional functional dependencies for data cleaning[C]∥IEEE 23rd International Conference on Data Engineering,2007(ICDE 2007).IEEE,2007:746-755.
[20] CONG G,FAN W,GEERTS F,et al.Improving data quality:Consistency and accuracy[C]∥Proceedings of the 33rd International Conference on Very Large Data Bases.VLDB Endowment,2007:315-326.
[21] INMON W H.Building the data warehouse (2nd ed)[M].John Wiley & Sons,1996.
[22] 李志刚,马刚.数据仓库与数据挖掘的原理及应用[M].北京:高等教育出版社,2007.
[23] BATINI C,CAPPIELLO C,FRANCALANCI C,et al.Methodo-logies for data quality assessment and improvement [J].ACM Computing Surveys (CSUR),2009,41(3):16-68.
[24] Chinese Academy of Sciences Computer Network InformationCenter.Data Quality Evaluation Method and Index System[EB/OL].[2015-10-17].http://www.nsdata.cn/pronsdchtml/1.compservice.standards/pages/3123.html.(in Chinese) 中国科学院计算机网络信息中心.数据质量评测方法与指标体系[EB/OL].[2015-10-17].http://www.nsdata.cn/pronsdchtml/1.compservice.standards/pages/3123.html.
[25] SAATY T L.Decision making with the analytic hierarchyprocess [J].International Journal of Services Sciences,2008,1(1):83-98.
[26] 陈水利,李敬功,王向公.模糊集理论及其应用 [M].北京:科学出版社,2005:156-207.
[27] LI D Y,LIU C Y.Study on the Universality of the NormalCloud Model [J].Engineering Sciences,2004,6(8):28-34.(in Chinese) 李德毅,刘常昱.论正态云模型的普适性 [J].中国工程科学,2004,6(8):28-34.
[28] LIU C.Sampling Theory and Method of Accuracy Measurement and Quality Assurance for GIS Attribute Data [D].Shanghai:Tongji University,2000.(in Chinese) 刘春.GIS 属性数据的精度度量及质量控制的抽样原理与方法 [D].上海:同济大学,2000.
[29] FAN W,GEERTS F.Foundations of data quality management[J].Synthesis Lectures on Data Management,2012,4(5):1-217.
[30] MONGE A E,ELKAN C.The Field Matching Problem:Algo-rithms and Applications[C]∥KDD.1996:267-270.
[31] WANG Y F,ZHANG C Z,ZHANG B B,et al.A Survey of Data Cleaning [J].New Technology of Library & Information Ser-vice,2007,2(12):50-56.(in Chinese) 王曰芬,章成志,张蓓蓓,等.数据清洗研究综述 [J].现代图书情报技术,2007,2(12):50-56.
[32] CAO J J,DIAO X C,CHEN S,et al.Data Cleaning and itsGeneral System Framework [J].Computer Science,2012,39(S3):207-211.(in Chinese) 曹建军,刁兴春,陈爽,等.数据清洗及其一般性系统框架 [J].计算机科学,2012,39(S3):207-211.
[33] GALHARDAS H,FLORESCU D,SHASHA D,et al.AJAX:an extensible data cleaning tool[J].ACM Sigmod Record,2000,29(2):590.
[34] RAMAN V,HELLERSTEIN J M.Potter’s wheel:An interactive data cleaning system[C]∥VLDB.2001:381-390.
[35] VASSILIADIS P,VAGENA Z,SKIADOPOULOS S,et al.ARKTOS:towards the modeling,design,control and execution of ETL processes [J].Information Systems,2001,26(8):537-561.
[36] CUI Y,WIDOM J,WIENER J L.Tracing the lineage of viewdata in a warehousing environment [J].ACM Transactions on Database Systems (TODS),2000,25(2):179-227.
[37] BUNEMAN P,TAN W C.Provenance in databases[C]∥Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data.ACM,2007:1171-1173.
[38] DENG Z H,WEI Y Z.Study on the Method of Provenance in Science Workflow for Data Publishing [J].Library & Information,2014,158(3):61-66.(in Chinese) 邓仲华,魏银珍.面向数据发布的科学工作流数据溯源方法研究 [J].图书与情报,2014,158(3):61-66.
[39] BUNEMAN P,KHANNA S,WANG C T.Why and where:A characterization of data provenance[C]∥International Con-ference on Database Theory.Springer Berlin Heidelberg,2001:316-330.
[40] GREEN T J,KARVOUNARAKIS G,TANNEN V.Provenance semirings[C]∥Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems.ACM,2007:31-40.
[41] RAM S,LIU J,GEORGE R T.PROMS:A System for Harvesting and Managing Data Provenance [EB/OL].[2010-11-01].http://kartik.eller.arizona.edu/WITS_DEMO_final.pdf.
[42] MING H,ZHANG Y,FU X H.Survey of Data Provenance [J].Journal of Chinese Computer Systems,2012,33(9):1917-1923.(in Chinese) 明华,张勇,符小辉.数据溯源技术综述 [J].小型微型计算机系统,2012,33(9):1917-1923.
[43] WANG J L,LI H,WANG Q.Research on ISO 8000 SeriesStandards for Data Quality [J].Standard Science,2010,439(12):44-46.(in Chinese) 王军玲,李华,王强.ISO8000 数据质量系列标准探析 [J].标准科学,2010,439(12):44-46.
[44] RADACK G.Improving Data Portability and Long Term Data Retention through ISO Standards 8000 and 22745 [C]∥The Fifth MIT Information Quality Industry Symposium.2011:13-15.
[45] SONG L R,PENG J.Introduction and Inspirations of the “Information Quality Act” in the American Federal Government [J].Journal of Intelligence,2012,31(2):12-18.(in Chinese) 宋立荣,彭洁.美国政府 “信息质量法” 的介绍及其启示 [J].情报杂志,2012,31(2):12-18.
[46] SCANNAPIECO M,VIRGILLITO A,M ARCHETTI C,et al.The DaQuinCIS architecture:a platform for exchanging and improving data quality in cooperative information systems [J].Information systems,2004,29(7):551-582.
[47] SIADAT M R,SOLTANIAN-ZADEH H,F OTOUHI F,et al.Data modeling for content-based support environment (C-BASE):Application on epilepsy data mining[C]∥Seventh IEEE International Conference on Data Mining Workshops,2007(ICDM Workshops 2007).IEEE,2007:181-188.
[48] CHU E,BAID A,CHEN T,et al.A relational approach to incrementally extracting and querying structure in unstructured data[C]∥Proceedings of the 33rd International Conference on Very Large Data Bases.VLDB Endowment,2007:1045-1056.
[49] MARCUS S,SUBRAHMANIAN V S.Foundations of multimedia database systems [J].Journal of the ACM (JACM),1996,43(3):474-523.
[50] AMATO G,MAINETTO G,SAVINO P.An approach to a content-based retrieval of multimedia data[C]∥Multimedia Information Systems.Springer US,1998:9-36.
[51] LI W,LANG B.A tetrahedral data model for unstructured data management [J].Science China Information Sciences,2010,53(8):1497-1510.
[52] MCGILVRAY D.Executing Data Quality Projects:Ten Steps to Quality Data and Trusted Information (TM) [M].California:Morgan Kaufmann,2007.
[53] CAI L,ZHU Y.The challenges of data quality and data quality assessment in the big data era [J].Data Science Journal,2015,14(2):2-10.
[54] YANG D,MA Y A,WANG Z,et al.Exploration and reflection of data quality management system of operators under the big data background [J].China Internet,2016(1):73-79.(in Chinese) 杨迪,马怡安,王铮,等.运营商在大数据背景下对数据质量管理体系的探索及思考 [J].互联网天地,2016(1):73-79.
[55] WANG J,SONG Z,LI Q,et al.Semantic-based Intelligent Data Clean Framework for Big Data[C]∥2014 International Con-ference on Security,Pattern Analysis,and Cybernetics.IEEE,2014:448-453.
[56] CRAWL D,WANG J,ALTINTAS I.Provenance for mapre-duce-based data-intensive workflows[C]∥Proceedings of the 6th Workshop on Workflows in Support of Large-scale Science.ACM,2011:21-30.
[57] PARK H,IKEDA R,WIDOM J.RAMP:A System for Capturing and Tracing Provenance in MapReduce Workflows [C]∥Proceedings of the VLDB Endowment,2011,4(12):1-4.
[58] AKOUSH S,SOHAN R,HOPPER A.HadoopProv:TowardsProvenance as a First Class Citizen in MapReduce[C]∥TaPP.2013.
[59] AMSTERDAMER Y,DAVIDSON S B,D EUTCH D,et al.Putting lipstick on pig:Enabling database-style workflow provenance [J].Proceedings of the VLDB Endowment,2011,5(4):346-357.
[60] HAKLAY M.How good is volunteered geographical informa-tion? A comparative study of OpenStreetMap and Ordnance Survey datasets [J].Environment and Planning B:Planning and Design,2010,37(4):682-703.
[61] CIEPUCH B,JACOB R,MOONEY P,et al.Comparison of the accuracy of OpenStreetMap for Ireland with Google Maps and Bing Maps[C]∥Proceedings of the Ninth International Symposium on Spatial Accuracy Assessment in Natural Resuorces and Enviromental Sciences.University of Leicester,2010:337.
[62] GIRRES J F,TOUYA G.Quality assessment of the FrenchOpenStreetMap dataset [J].Transactions in GIS,2010,14(4):435-459.
[63] ARSANJANI J J,ZIPF A,MOONEY P,et al.An introduction to OpenStreetMap in Geographic Information Science:Experien-ces,research,and applications[M]∥OpenStreetMap in GIScience.Springer International Publishing,2015:1-15.
[64] SUN S,KRAJWEWSKI J L B,LYNGGAARD-JENSEN A,et al.Literature review for data validation methods[EB/OL].[2011-6-8].http://www.prepared-fp7.eu/viewer/file.aspx?fileinfoID=215.
[65] FAN H.Study on Unreliable RFID Data Cleaning and Storage techniques for Internet of Things[D].Changsha:National University of Defense Technology,2013.(in Chinese) 樊华.面向物联网的 RFID 不确定数据清洗与存储技术研究 [D].长沙:国防科学技术大学,2013.
[66] JEFFERY S R,ALONSO G,FRANKLIN M J,et al.A pipelined framework for online cleaning of sensor data streams[C]∥Proceedings of the 22nd International Conference on Data Enginee-ring.IEEE,2006:140.
[67] WANG C.Study on Quality Assurance Method for Internet of Things of Location-based Service[D].Nanjing:Nanjing University of Science and Technology,2015.(in Chinese) 王川.面向位置服务的物联网数据质量保证方法研究[D].南京:南京理工大学,2015.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[3] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[4] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[5] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[6] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[7] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[8] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[9] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[10] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .