计算机科学 ›› 2019, Vol. 46 ›› Issue (1): 78-85.doi: 10.11896/j.issn.1002-137X.2019.01.012
谈询滔, 顾依依, 阮彤, 袁玉波
TAN Xun-tao, GU Yi-yi, RUAN Tong, YUAN Yu-bo
摘要: 如何有效评价训练数据集的可用性,一直是困扰智能分类系统应用的难点问题。针对机器学习领域的数据分类问题,提出了一种基于区间分析和信息粒化的数据集分类可用性的评估方法,用于评价数据集的可分程度。该方法将待评估的数据集定义为分类信息系统,提出了分类置信区间的概念,通过区间分析进行信息粒化。在此信息粒化策略下,定义分类可用性的数学模型,并进一步给出单个属性以及整体数据集的分类可用性的计算方法。选择18个UCI标准数据集作为评估对象,给出了部分数据集分类可用性的评估结果,并且选取3种分类器对所选数据集进行分类实验,最终通过对上述实验结果的分析证明了该评估方法的有效性和可行性。
中图分类号:
[1]NOH Y K,ZHANG B T,LEE D D.Generative local metric learning for nearest neighbor classification[C]//Annual Confe-rence on Neural Information Processing Systems.2018:106-118.<br /> [2]HOLLIFIELD T,SAILLET Y.Data quality assessment[J].Communications of the Acm,2017,45(4):211-218.<br /> [3]CHEN Y C.Research on classification algorithm for weakly usable data[D].Harbin:Harbin Institute of Technology,2014.(in Chinese)<br /> 陈懿诚.弱可用数据上的分类算法研究[D].哈尔滨:哈尔滨工业大学,2014.<br /> [4]LI J Z,WANG H Z,GAO H,et al.State-of-the-art of research on big data usability[J].Journal of Software,2016,27(7):1605-1625.(in Chinese)<br /> 李建中,王宏志,高宏,等.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625.<br /> [5]MERINO J,CABALLERO I,RIVAS B,et al.A data quality in use model for big data[J].Future Generation Computer Systems,2016,63(C):123-130.<br /> [6]BAHNSEN A C,AOUADA D,STOJANOVICA.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications an International Journal,2016,51(C):134-142.<br /> [7]LI J,LIU X.An important aspect of big data:data usability[J].Journal of Computer Research & Development,2013,50(6):1147-1162.<br /> [8]ZADEH L A.Toward a theory of fuzzy information granulation and itscentrality in human reasoning and fuzzy logic[J].Fuzzy Sets & Systems,1997,90(90):111-127.<br /> [9]LIN T Y.Granular computing on binary relations I:data mining and neighborhood systems[J].Rough Sets in KnowledgeDisco-very,1998,1(2):165-166.<br /> [10]LIN T Y.Granular computing on binary relations II:Rough set representations and belief functions[OL].http://core.ac.uk/display/24652632.<br /> [11]LIN T Y.Granular computing:Fuzzy logic and rough sets[M]//Computing with Words in Information/Intelligent Systems 1.Physica-Verlag HD,1999:183-200.<br /> [12]YAO Y Y.Information granulation and rough set approximation[J].International Journal of Intelligent Systems,2001,16(1):87-104.<br /> [13]YAO Y.Perspectives of granular computing[C]//IEEE International Conference on Granular Computing.IEEE,2005:85-90.<br /> [14]YAO J T,VASILAKOS A V,PEDRYCZ W.Granular computing:Perspectives and challenges[J].IEEE Transactions on Cybernetics,2013,43(6):1977-1989.<br /> [15]LI J,MEI C,XU W,et al.Concept learning via granular computing:A cognitive view point[J].Information Sciences,2015,298(1):447-467.<br /> [16]BATINI C,CAPPIELLO C,FRANCALANCI C,et al.Metho dologies for data quality assessment and improvement[J].Acm Computing Surveys,2009,41(3):16.<br /> [17]KORN F,MUTHUKRISHNAN S,ZHU Y.Checks and balances:monitoring data quality problems in network traffic databases[C]//International Conference on Very Large Data Bases.VLDB Endowment,2003:536-547.<br /> [18]XIONG H,PANDEY G,STEINBACH M,et al.Enhancing data analysis with noise removal[J].IEEE Transactions on Knowledge &Data Engineering,2006,18(3):304-319.<br /> [19]MIAO D,LIU X,LI J.On the complexity of sampling query feed back restricted data base repair of functional dependency violations[J].Theoretical Computer Science,2016,609:594-605.<br /> [20]MA S,FAN W,BRAVO L.Extending inclusion dependencies with conditions[J].Theoretical Computer Science,2014,515(1):64-95.<br /> [21]EMRAN N A.Data completeness measures[M]//Pattern Analysis,Intelligent Security and the Internet of Things.Springer International Publishing,2015:117-130.<br /> [22]EMRAN N A,EMBURY S,MISSIER P.Measuring population-based completeness for single nucleotide polymorphism (SNP) databases[J].Springer International Publishing,2014,551:173-182.<br /> [23]CAO Y,FAN W,YU W.Determining the relative accuracy of attributes[C]//ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013:565-576.<br /> [24]ZHANG Y,WANG H,GAO H,et al.Efficient accuracy evaluation for multi-modal sensed data[J].Journal of Combinatorial Optimization,2015,32(4):1-21.<br /> [25]ZHANG Y,WANG H,YANG Z,et al.Relative accuracy evaluation[J].Plos One,2014,9(8):e103853.<br /> [26]FAN W,GEERTS F,WIJSEN J.Determining the currency of data[J].Acm Transactions on Database Systems,2011,37(4):1-46.<br /> [27]LI M H,LI J Z,GAO H.Evaluation of data currency[J].Chinese Journal of Computers,2012,35(11):2348.<br /> [28]SHEN W,LI X,DOAN A H.Constraint-based entity matching[C]//National Conference on Artificial Intelligence.AAAI Press,2005:862-867.<br /> [29]LI L,LI J,GAO H.Evaluating entity-description conflict on duplicated data[M].Springer-Verlag New York,Inc.,2016,31(2):918-941.<br /> [30]QIAN Y H.Granulating mechanism and data modeling of complex data[D].Taiyuan:Shanxi University,2011.(in Chinese)<br /> 钱宇华.复杂数据的粒化机理与数据建模[D].太原:山西大学,2011.<br /> [31]SKOWRON A,WASILEWSKI P.Information systems in modeling inter active computation songranules[J].Theoretical Computer Science,2010,412(42):5939-5959.<br /> [32]PAWLAK Z.Theoretical aspect of reasoning about data[M]//Rough Sets:Theoretical Aspects of Reasoning about Data.Kluwer Academic Publishers,1991.<br /> [33]ZHANG Y P,ZHANG L,WU T.The representation of different granular worlds:A quotient space[J].Chinese Journal of Computers,2004,27(3):328-333.<br /> [34]JIANG L,WANG S,LI C,et al.Structure extended multinomial naive bayes[J].Information Sciences,2016,329(C):346-356.<br /> [35]SPEYBROECK N.Classification and regression trees.Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery,2012,57(1):243-246.<br /> [36]FAN R E,CHANG K W,HSIEH C J,et al.LIBLINEAR:A library forlarge linear classification[J].Journal of Machine Lear-ning Research,2008,9(9):1871-1874. |
[1] | 王美珊, 姚兰, 高福祥, 徐军灿. 面向医疗集值数据的差分隐私保护技术研究 Study on Differential Privacy Protection for Medical Set-Valued Data 计算机科学, 2022, 49(4): 362-368. https://doi.org/10.11896/jsjkx.210300032 |
[2] | 赵会群, 吴凯锋. 一种大数据估价算法 Big Data Valuation Algorithm 计算机科学, 2020, 47(9): 110-116. https://doi.org/10.11896/jsjkx.191000156 |
[3] | 李森有, 季新生, 游伟, 赵星. 一种基于差分隐私的数据查询分级控制策略 Hierarchical Control Strategy for Data Querying Based on Differential Privacy 计算机科学, 2019, 46(11): 130-136. https://doi.org/10.11896/jsjkx.180901690 |
[4] | 赵利博,刘奇,付方玲,何凌. 基于小波变换和倒谱分析的腭裂高鼻音等级自动识别 Automatic Detection of Hypernasality Grades Based on Discrete Wavelet Transformation and Cepstrum Analysis 计算机科学, 2018, 45(4): 278-284. https://doi.org/10.11896/j.issn.1002-137X.2018.04.047 |
[5] | 谢晓东,李清宝,王 炜,牛小鹏,赵 远. 基于位运算的固件代码变量区间分析法 Variable Intervals Analysis of Firmware Code Based on Binary-bit Operation 计算机科学, 2013, 40(1): 107-111. |
[6] | 苏超 蔡铭 姚玉荣. 面向领域资源的智能元搜索技术研究 计算机科学, 2006, 33(9): 107-109. |
[7] | 万国根 秦志光. 面向信息内容安全的文本过滤和分类系统研究与实现 计算机科学, 2005, 32(7): 159-161. |
[8] | 张雪英 刘凤玉 JürgenKrause. 粗糙集分类算法中的近似决策规则和规则匹配方法 计算机科学, 2005, 32(6): 129-132. |
[9] | 李志君 王国胤 吴渝. 基于Rough Set的电子邮件分类系统 计算机科学, 2004, 31(3): 58-60. |
|