数据集分类可用性评估的置信区间方法

doi:10.11896／j.issn.1002-137X.2019.01.012

计算机科学 ›› 2019, Vol. 46 ›› Issue (1): 78-85.doi: 10.11896／j.issn.1002-137X.2019.01.012

• 2018 年第七届中国数据挖掘会议 • 上一篇下一篇

数据集分类可用性评估的置信区间方法

谈询滔, 顾依依, 阮彤, 袁玉波

(华东理工大学计算机科学与工程系上海200237)

收稿日期:2018-06-08 出版日期:2019-01-15 发布日期:2019-02-25
作者简介:谈询滔(1994-),男,硕士生,主要研究方向为数据质量评估、数据挖掘和机器学习;顾依依(1994-),女,硕士生,主要研究方向为数据质量评估、数据挖掘和机器学习;阮彤(1973-),女,博士,教授,主要研究方向为自然语言处理、数据质量评估等;袁玉波(1976-),男,博士,副教授,主要研究方向为数据质量评估、数据挖掘和机器学习等,E-mail:ybyuan@ecust.edu.cn(通信作者)。
基金资助:
国家自然科学基金项目(61772201),上海市科委基金项目(16511101000),上海市科委基金项目(17DZ11011003)资助

Confidence Interval Method for Classification Usability Evaluation of Data Sets

TAN Xun-tao, GU Yi-yi, RUAN Tong, YUAN Yu-bo

(Department of Computer Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)

Received:2018-06-08 Online:2019-01-15 Published:2019-02-25

摘要/Abstract

摘要： 如何有效评价训练数据集的可用性,一直是困扰智能分类系统应用的难点问题。针对机器学习领域的数据分类问题,提出了一种基于区间分析和信息粒化的数据集分类可用性的评估方法,用于评价数据集的可分程度。该方法将待评估的数据集定义为分类信息系统,提出了分类置信区间的概念,通过区间分析进行信息粒化。在此信息粒化策略下,定义分类可用性的数学模型,并进一步给出单个属性以及整体数据集的分类可用性的计算方法。选择18个UCI标准数据集作为评估对象,给出了部分数据集分类可用性的评估结果,并且选取3种分类器对所选数据集进行分类实验,最终通过对上述实验结果的分析证明了该评估方法的有效性和可行性。

关键词: 分类可用性, 分类系统, 区间分析, 数据可用性, 信息粒化

Abstract: It is always a difficult problem to evaluate the usability of training data sets effectively,which hinders the application of intelligent classification systems.Aiming at the issue of data classification in the field of machine learning,based on interval analysis and information granulation,this paper proposed an evaluation method of data classification usability to measure the separability of data sets.In this method,dataset is defined as the classification information system,and the concept of classification confidence interval is put forward,then the information granulation is carried out by interval analysis.Under this information granulation strategy,this paper defined the mathematical model of classification usability,and further gave the calculation method of the classification usability for single attribute and the total data set.In this paper,18 UCI standard data sets were selected as evaluation objects,the evaluation results of classification usability were given,and 3 classifiers were selected to classify the above data sets.Finally,the effectiveness and feasibility of this evaluation method are verified by the analysis of experimental results.

Key words: Classification system, Classification usability, Data usability, Information granulation, Interval analysis

中图分类号:

TP391

谈询滔, 顾依依, 阮彤, 袁玉波. 数据集分类可用性评估的置信区间方法[J]. 计算机科学, 2019, 46(1): 78-85. https://doi.org/10.11896／j.issn.1002-137X.2019.01.012

TAN Xun-tao, GU Yi-yi, RUAN Tong, YUAN Yu-bo. Confidence Interval Method for Classification Usability Evaluation of Data Sets[J]. Computer Science, 2019, 46(1): 78-85. https://doi.org/10.11896／j.issn.1002-137X.2019.01.012

参考文献

[1]NOH Y K,ZHANG B T,LEE D D.Generative local metric learning for nearest neighbor classification[C]//Annual Confe-rence on Neural Information Processing Systems.2018:106-118. [2]HOLLIFIELD T,SAILLET Y.Data quality assessment[J].Communications of the Acm,2017,45(4):211-218. [3]CHEN Y C.Research on classification algorithm for weakly usable data[D].Harbin:Harbin Institute of Technology,2014.(in Chinese) 陈懿诚.弱可用数据上的分类算法研究[D].哈尔滨:哈尔滨工业大学,2014. [4]LI J Z,WANG H Z,GAO H,et al.State-of-the-art of research on big data usability[J].Journal of Software,2016,27(7):1605-1625.(in Chinese) 李建中,王宏志,高宏,等.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625. [5]MERINO J,CABALLERO I,RIVAS B,et al.A data quality in use model for big data[J].Future Generation Computer Systems,2016,63(C):123-130. [6]BAHNSEN A C,AOUADA D,STOJANOVICA.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications an International Journal,2016,51(C):134-142. [7]LI J,LIU X.An important aspect of big data:data usability[J].Journal of Computer Research & Development,2013,50(6):1147-1162. [8]ZADEH L A.Toward a theory of fuzzy information granulation and itscentrality in human reasoning and fuzzy logic[J].Fuzzy Sets & Systems,1997,90(90):111-127. [9]LIN T Y.Granular computing on binary relations I:data mining and neighborhood systems[J].Rough Sets in KnowledgeDisco-very,1998,1(2):165-166. [10]LIN T Y.Granular computing on binary relations II:Rough set representations and belief functions[OL].http://core.ac.uk/display/24652632. [11]LIN T Y.Granular computing:Fuzzy logic and rough sets[M]//Computing with Words in Information/Intelligent Systems 1.Physica-Verlag HD,1999:183-200. [12]YAO Y Y.Information granulation and rough set approximation[J].International Journal of Intelligent Systems,2001,16(1):87-104. [13]YAO Y.Perspectives of granular computing[C]//IEEE International Conference on Granular Computing.IEEE,2005:85-90. [14]YAO J T,VASILAKOS A V,PEDRYCZ W.Granular computing:Perspectives and challenges[J].IEEE Transactions on Cybernetics,2013,43(6):1977-1989. [15]LI J,MEI C,XU W,et al.Concept learning via granular computing:A cognitive view point[J].Information Sciences,2015,298(1):447-467. [16]BATINI C,CAPPIELLO C,FRANCALANCI C,et al.Metho dologies for data quality assessment and improvement[J].Acm Computing Surveys,2009,41(3):16. [17]KORN F,MUTHUKRISHNAN S,ZHU Y.Checks and balances:monitoring data quality problems in network traffic databases[C]//International Conference on Very Large Data Bases.VLDB Endowment,2003:536-547. [18]XIONG H,PANDEY G,STEINBACH M,et al.Enhancing data analysis with noise removal[J].IEEE Transactions on Knowledge &Data Engineering,2006,18(3):304-319. [19]MIAO D,LIU X,LI J.On the complexity of sampling query feed back restricted data base repair of functional dependency violations[J].Theoretical Computer Science,2016,609:594-605. [20]MA S,FAN W,BRAVO L.Extending inclusion dependencies with conditions[J].Theoretical Computer Science,2014,515(1):64-95. [21]EMRAN N A.Data completeness measures[M]//Pattern Analysis,Intelligent Security and the Internet of Things.Springer International Publishing,2015:117-130. [22]EMRAN N A,EMBURY S,MISSIER P.Measuring population-based completeness for single nucleotide polymorphism (SNP) databases[J].Springer International Publishing,2014,551:173-182. [23]CAO Y,FAN W,YU W.Determining the relative accuracy of attributes[C]//ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013:565-576. [24]ZHANG Y,WANG H,GAO H,et al.Efficient accuracy evaluation for multi-modal sensed data[J].Journal of Combinatorial Optimization,2015,32(4):1-21. [25]ZHANG Y,WANG H,YANG Z,et al.Relative accuracy evaluation[J].Plos One,2014,9(8):e103853. [26]FAN W,GEERTS F,WIJSEN J.Determining the currency of data[J].Acm Transactions on Database Systems,2011,37(4):1-46. [27]LI M H,LI J Z,GAO H.Evaluation of data currency[J].Chinese Journal of Computers,2012,35(11):2348. [28]SHEN W,LI X,DOAN A H.Constraint-based entity matching[C]//National Conference on Artificial Intelligence.AAAI Press,2005:862-867. [29]LI L,LI J,GAO H.Evaluating entity-description conflict on duplicated data[M].Springer-Verlag New York,Inc.,2016,31(2):918-941. [30]QIAN Y H.Granulating mechanism and data modeling of complex data[D].Taiyuan:Shanxi University,2011.(in Chinese) 钱宇华.复杂数据的粒化机理与数据建模[D].太原:山西大学,2011. [31]SKOWRON A,WASILEWSKI P.Information systems in modeling inter active computation songranules[J].Theoretical Computer Science,2010,412(42):5939-5959. [32]PAWLAK Z.Theoretical aspect of reasoning about data[M]//Rough Sets:Theoretical Aspects of Reasoning about Data.Kluwer Academic Publishers,1991. [33]ZHANG Y P,ZHANG L,WU T.The representation of different granular worlds:A quotient space[J].Chinese Journal of Computers,2004,27(3):328-333. [34]JIANG L,WANG S,LI C,et al.Structure extended multinomial naive bayes[J].Information Sciences,2016,329(C):346-356. [35]SPEYBROECK N.Classification and regression trees.Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery,2012,57(1):243-246. [36]FAN R E,CHANG K W,HSIEH C J,et al.LIBLINEAR:A library forlarge linear classification[J].Journal of Machine Lear-ning Research,2008,9(9):1871-1874.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集分类可用性评估的置信区间方法

Confidence Interval Method for Classification Usability Evaluation of Data Sets

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

Metrics

本文评价

推荐阅读 0

[1]	王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032
[2]	赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156
[3]	李森有, 季新生, 游伟, 赵星. 一种基于差分隐私的数据查询分级控制策略 Hierarchical Control Strategy for Data Querying Based on Differential Privacy 计算机科学, 2019, 46(11): 130-136. https://doi.org/10.11896/jsjkx.180901690
[4]	赵利博,刘奇,付方玲,何凌. 基于小波变换和倒谱分析的腭裂高鼻音等级自动识别 Automatic Detection of Hypernasality Grades Based on Discrete Wavelet Transformation and Cepstrum Analysis 计算机科学, 2018, 45(4): 278-284. https://doi.org/10.11896/j.issn.1002-137X.2018.04.047
[5]	谢晓东，李清宝，王炜，牛小鹏，赵远. 基于位运算的固件代码变量区间分析法 Variable Intervals Analysis of Firmware Code Based on Binary-bit Operation 计算机科学, 2013, 40(1): 107-111.
[6]	苏超蔡铭姚玉荣. 面向领域资源的智能元搜索技术研究计算机科学, 2006, 33(9): 107-109.
[7]	万国根秦志光. 面向信息内容安全的文本过滤和分类系统研究与实现计算机科学, 2005, 32(7): 159-161.
[8]	张雪英刘凤玉 JürgenKrause. 粗糙集分类算法中的近似决策规则和规则匹配方法计算机科学, 2005, 32(6): 129-132.
[9]	李志君王国胤吴渝. 基于Rough Set的电子邮件分类系统计算机科学, 2004, 31(3): 58-60.