Computer Science ›› 2019, Vol. 46 ›› Issue (1): 78-85.doi: 10.11896/j.issn.1002-137X.2019.01.012

• CCDM2018 • Previous Articles     Next Articles

Confidence Interval Method for Classification Usability Evaluation of Data Sets

TAN Xun-tao, GU Yi-yi, RUAN Tong, YUAN Yu-bo   

  1. (Department of Computer Science and Engineering,East China University of Science and Technology,Shanghai 200237,China)
  • Received:2018-06-08 Online:2019-01-15 Published:2019-02-25

Abstract: It is always a difficult problem to evaluate the usability of training data sets effectively,which hinders the application of intelligent classification systems.Aiming at the issue of data classification in the field of machine learning,based on interval analysis and information granulation,this paper proposed an evaluation method of data classification usability to measure the separability of data sets.In this method,dataset is defined as the classification information system,and the concept of classification confidence interval is put forward,then the information granulation is carried out by interval analysis.Under this information granulation strategy,this paper defined the mathematical model of classification usability,and further gave the calculation method of the classification usability for single attribute and the total data set.In this paper,18 UCI standard data sets were selected as evaluation objects,the evaluation results of classification usability were given,and 3 classifiers were selected to classify the above data sets.Finally,the effectiveness and feasibility of this evaluation method are verified by the analysis of experimental results.

Key words: Classification system, Classification usability, Data usability, Information granulation, Interval analysis

CLC Number: 

  • TP391
[1]NOH Y K,ZHANG B T,LEE D D.Generative local metric learning for nearest neighbor classification[C]//Annual Confe-rence on Neural Information Processing Systems.2018:106-118.<br /> [2]HOLLIFIELD T,SAILLET Y.Data quality assessment[J].Communications of the Acm,2017,45(4):211-218.<br /> [3]CHEN Y C.Research on classification algorithm for weakly usable data[D].Harbin:Harbin Institute of Technology,2014.(in Chinese)<br /> 陈懿诚.弱可用数据上的分类算法研究[D].哈尔滨:哈尔滨工业大学,2014.<br /> [4]LI J Z,WANG H Z,GAO H,et al.State-of-the-art of research on big data usability[J].Journal of Software,2016,27(7):1605-1625.(in Chinese)<br /> 李建中,王宏志,高宏,等.大数据可用性的研究进展[J].软件学报,2016,27(7):1605-1625.<br /> [5]MERINO J,CABALLERO I,RIVAS B,et al.A data quality in use model for big data[J].Future Generation Computer Systems,2016,63(C):123-130.<br /> [6]BAHNSEN A C,AOUADA D,STOJANOVICA.Feature engineering strategies for credit card fraud detection[J].Expert Systems with Applications an International Journal,2016,51(C):134-142.<br /> [7]LI J,LIU X.An important aspect of big data:data usability[J].Journal of Computer Research & Development,2013,50(6):1147-1162.<br /> [8]ZADEH L A.Toward a theory of fuzzy information granulation and itscentrality in human reasoning and fuzzy logic[J].Fuzzy Sets & Systems,1997,90(90):111-127.<br /> [9]LIN T Y.Granular computing on binary relations I:data mining and neighborhood systems[J].Rough Sets in KnowledgeDisco-very,1998,1(2):165-166.<br /> [10]LIN T Y.Granular computing on binary relations II:Rough set representations and belief functions[OL].<br /> [11]LIN T Y.Granular computing:Fuzzy logic and rough sets[M]//Computing with Words in Information/Intelligent Systems 1.Physica-Verlag HD,1999:183-200.<br /> [12]YAO Y Y.Information granulation and rough set approximation[J].International Journal of Intelligent Systems,2001,16(1):87-104.<br /> [13]YAO Y.Perspectives of granular computing[C]//IEEE International Conference on Granular Computing.IEEE,2005:85-90.<br /> [14]YAO J T,VASILAKOS A V,PEDRYCZ W.Granular computing:Perspectives and challenges[J].IEEE Transactions on Cybernetics,2013,43(6):1977-1989.<br /> [15]LI J,MEI C,XU W,et al.Concept learning via granular computing:A cognitive view point[J].Information Sciences,2015,298(1):447-467.<br /> [16]BATINI C,CAPPIELLO C,FRANCALANCI C,et al.Metho dologies for data quality assessment and improvement[J].Acm Computing Surveys,2009,41(3):16.<br /> [17]KORN F,MUTHUKRISHNAN S,ZHU Y.Checks and balances:monitoring data quality problems in network traffic databases[C]//International Conference on Very Large Data Bases.VLDB Endowment,2003:536-547.<br /> [18]XIONG H,PANDEY G,STEINBACH M,et al.Enhancing data analysis with noise removal[J].IEEE Transactions on Knowledge &Data Engineering,2006,18(3):304-319.<br /> [19]MIAO D,LIU X,LI J.On the complexity of sampling query feed back restricted data base repair of functional dependency violations[J].Theoretical Computer Science,2016,609:594-605.<br /> [20]MA S,FAN W,BRAVO L.Extending inclusion dependencies with conditions[J].Theoretical Computer Science,2014,515(1):64-95.<br /> [21]EMRAN N A.Data completeness measures[M]//Pattern Analysis,Intelligent Security and the Internet of Things.Springer International Publishing,2015:117-130.<br /> [22]EMRAN N A,EMBURY S,MISSIER P.Measuring population-based completeness for single nucleotide polymorphism (SNP) databases[J].Springer International Publishing,2014,551:173-182.<br /> [23]CAO Y,FAN W,YU W.Determining the relative accuracy of attributes[C]//ACM SIGMOD International Conference on Ma-nagement of Data.ACM,2013:565-576.<br /> [24]ZHANG Y,WANG H,GAO H,et al.Efficient accuracy evaluation for multi-modal sensed data[J].Journal of Combinatorial Optimization,2015,32(4):1-21.<br /> [25]ZHANG Y,WANG H,YANG Z,et al.Relative accuracy evaluation[J].Plos One,2014,9(8):e103853.<br /> [26]FAN W,GEERTS F,WIJSEN J.Determining the currency of data[J].Acm Transactions on Database Systems,2011,37(4):1-46.<br /> [27]LI M H,LI J Z,GAO H.Evaluation of data currency[J].Chinese Journal of Computers,2012,35(11):2348.<br /> [28]SHEN W,LI X,DOAN A H.Constraint-based entity matching[C]//National Conference on Artificial Intelligence.AAAI Press,2005:862-867.<br /> [29]LI L,LI J,GAO H.Evaluating entity-description conflict on duplicated data[M].Springer-Verlag New York,Inc.,2016,31(2):918-941.<br /> [30]QIAN Y H.Granulating mechanism and data modeling of complex data[D].Taiyuan:Shanxi University,2011.(in Chinese)<br /> 钱宇华.复杂数据的粒化机理与数据建模[D].太原:山西大学,2011.<br /> [31]SKOWRON A,WASILEWSKI P.Information systems in modeling inter active computation songranules[J].Theoretical Computer Science,2010,412(42):5939-5959.<br /> [32]PAWLAK Z.Theoretical aspect of reasoning about data[M]//Rough Sets:Theoretical Aspects of Reasoning about Data.Kluwer Academic Publishers,1991.<br /> [33]ZHANG Y P,ZHANG L,WU T.The representation of different granular worlds:A quotient space[J].Chinese Journal of Computers,2004,27(3):328-333.<br /> [34]JIANG L,WANG S,LI C,et al.Structure extended multinomial naive bayes[J].Information Sciences,2016,329(C):346-356.<br /> [35]SPEYBROECK N.Classification and regression trees.Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery,2012,57(1):243-246.<br /> [36]FAN R E,CHANG K W,HSIEH C J,et al.LIBLINEAR:A library forlarge linear classification[J].Journal of Machine Lear-ning Research,2008,9(9):1871-1874.
Full text



No Suggested Reading articles found!