Computer Science ›› 2022, Vol. 49 ›› Issue (1): 80-88.doi: 10.11896/jsjkx.210200124

Special Issue: Big Data & Data Scinece

• Database & Big Data & Data Science • Previous Articles     Next Articles

Imbalanced Data Classification:A Survey and Experiments in Medical Domain

JIANG Hao-chen1, WEI Zi-qi1, LIU Lin1, CHEN Jun2   

  1. 1 School of Software,Beijing National Research Center for Information Science and Technology,Tsinghua University,Beijing 100084,China
    2 Baidu Corp.,Beijing 100094,China
  • Received:2021-02-20 Revised:2021-05-20 Online:2022-01-15 Published:2022-01-18
  • About author:JIANG Hao-chen,born in 1996,postgraduate,is a student member of China Computer Federation.His main research interests include big data techniques in health care and imbalanced data analysis.
    LIU Lin,born in 1973,Ph.D,is a member of China Computer Federation.Her main research interests include software requirements engineering and health data analytics.
  • Supported by:
    Baidu-Tsinghua Collaborated Medical AI Project.

Abstract: In recent years,AI technology has been widely adopted in many application domains,amongst which,intelligent medical applications such as clinical decision support systems have attracted much attention.However,since the current wave of AI applications are based on predictive models crystalized from historical data,the feature and quality of data will affect AI applications' performance directly.Medical data are inherently imbalanced as rare disease cases are always the scarce in existing case archives,while considered more important.The “data imbalance problem” is still considered a difficult research problem in machine lear-ning.This paper conducts a literature review on the research efforts targeting at techniques to handle “imbalanced data” in gene-ral as well as the ones in intelligent medical area.We then use research publications from the SIGKDD conference dedicated to knowledge discovery and data mining as a sample pool,to find people's preferred approach to address “imbalanced data” problem in a given domain.Finally,based on approaches,we identify from the survey,and conduct experiments on two typical medical predictive model learning scenarios,to validate the know-how we acquired in this study.

Key words: Data analysis, Imbalanced datasets, Intelligent medical, Over-sampling

CLC Number: 

  • TP311.5
[1]MAIMON O,ROKACH L.Introduction to Knowledge Disco-very and Data Mining[J/OL].Springer US.https://link.sprin-ger.com/chapter/10.1007/978-0-387-09823-4_1.
[2]SHEARERC.The CRISP-DM model:the new blueprint for data mining[J].Journal of Data Warehousing,2000,5(4):13-22.
[3]JOHNSON A E W,POLLARD T J,SHEN L,et al.MIMIC-III,a freely accessible critical care database[J].Scientific Data,2016,3(1):1-9.
[4]HAO W,LIU F.Imbalanced Data Fault Diagnosis Based on anEvolutionary Online Sequential Extreme Learning Machine[J].Symmetry,2020,12(8):1204.
[5]WANG G,YANG J,LI R.Imbalanced SVM-Based AnomalyDetection Algorithm for Imbalanced Training Datasets[J].ETRI Journal,2017,39(5):621-631.
[6]ZHAO C,XIN Y,LI X,et al.A heterogeneous ensemble lear-ning framework for spam detection in social networks with imbalanced data[J].Applied Sciences,2020,10(3):936.
[7]SUN Y,WONG A K,KAMEL M S.Classification of imbalanced data:A review[J].International Journal of Pattern Recognition and Artificial Intelligence,2009,23(4):687-719.
[8]TOMEK I.Two modifications of CNN[J].IEEE Transactions on Systems,Man,and Cybernetics,1976,6:769-772.
[9]VEROPOULOS K,CAMPBELL C,CRISTIANINI N.Controlling the sensitivity of support vector machines [C]//Procee-dings of the International Joint Conference on AI.1999.
[10]Al-STOUHI S,REDDY C K.Transfer learning for class imba-lance problems with inadequate data[J].Knowledge and Information Systems,2016,48(1):201-228.
[11]ALCALA-FDEZ J,FERNANDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository[J].Journal of Multiple -Valued Logic and Soft Computing,2011,17(2/3):255-287.
[12]FERNÁNDEZ A,GARCÍA S,GALAR M,et al.Learningfrom imbalanced data sets [M].Berlin:Springer,2018.
[13]GU Q,YUAN L,NING B,et al.A Novel Classification Algorithm for Imbalanced Datasets Based on Hybrid Resampling Strategy[J].Computer Engineering & Science,2012,34(10):128-134.
[14]WILSON D.Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J].IEEE Transactions on Systems,Man,and Cybernetrics,1972,2(3):408-421.
[15]CHAWLA N V,BOWYER K W,HALL L O,et al.:SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[16]HAN H,WANG W Y,MAO B H.Borderline-SMoTE:a newover-sampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing.2005:878-887.
[17]HE H,BAI Y,GARCIA E,et al.Adaptive synthetic sampling approach for imbalanced learning [C]//IEEE International Joint Conference on Neural Networks.2008.
[18]CHRISTOPHER M.Neural Networks for Pattern Recognition[M].New York:Oxford University Press,1995.
[19]TAX D M,DUIN R P.Support vector domain description[J].Pattern Recognition Letters,1999,20(11/12/13):1191-1199.
[20]ELKAN C.The foundations of cost-sensitive learning [C]//International Joint Conference on Artificial Intelligence.2001:973-978.
[21]BATUWITA R,PALADE V.FSVM-CIL:fuzzy support vector machines for class imbalance learning[J].IEEE Transactions on Fuzzy Systems,2010,18(3):558-571.
[22]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting [C]//European Conference on Principles of Data Mining and Knowledge Discovery.2003:107-119.
[23]SAGI O,ROKACH L.Ensemble learning:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and KnowledgeDisco-very,2018,8(5):e1249.
[24]SUN Y,KAMEL M S,WONG A K,et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378.
[25]LIU X Y,WU J,ZHOU Z H.Exploratory Undersampling for Class-Imbalance Learning[J].IEEE Transactions on Systems Man & Cybernetics Part B,2009,39(2):539-550.
[26]HANSON S J,BRUNSWICK S N,KULIKOWSKI C,et al.Concept-Learning in the Absence of Counter-Examples:an Autoassociation-Based Approach to Classification[J/OL].New Brunswick Rutters the State of New Jersey,1999.http://dl.acm.org/citation.cfm?id=929980.
[27]YPMA E,DUIN R.Novelty detection using Self-OrganizingMaps[J/OL].Proc. of Iconip,1998.https://www.researchgate.net/publication/2722672_Novelty_detection_using_Self-Organizing_Maps.
[28]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//IEEE Transactions on Pattern Analysis &Machine Intelligence.IEEE,2017:2999-3007.
[29]DOUZAS G,BACAO F.Effective data generation for imbal-anced learning using conditional generative adversarial networks[J].Expert Systems with Application,2018,91(Jan.):464-471.
[30]LIANG G,JING G,NGO H,et al.On handling negative transfer and imbalanced distributions in multiple source transfer learning[J].Statistical Analysis & Data Mining,2014,7(4):254-271.
[31]ZHANG X,YANG T,SRINIVASAN P.Online AsymmetricActive Learning with Imbalanced Data [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:2055-2064.
[32]BIFET A,DE FRANCISCI MORALES G,READ J,et al.Efficient online evaluation of big data stream classifiers [C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2015:59-68.
[33]SHAABANI E,ALEALI A,SHAKARIAN P,et al.Early identification of violent criminal gang members [C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2015:2079-2088.
[34]DU B,LIU C,ZHOU W,et al.Catch me if you can:Detecting pickpocket suspects from large-scale transit records [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:87-96.
[35]HA J W,PYO H,KIM J.Large-scale item categorization inecommerce using multiple recurrent neural networks[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:107-115.
[36]NGUYEN H,PATRICK J.Text Mining in Clinical Domain:Dealing with Noise [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:549-558.
[37]WANG H,KIFER D,GRAIF C,et al.Crime rate inference with big data [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:635-644.
[38]DADKHAHI H,MARLIN B M.Learning Tree-Structured Detection Cascades for Heterogeneous Networks of Embedded Devices [C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2017:1773-1781.
[39]KOPTELOV M,ZIMMERMANN A,BONNET P,et al.PrePeP:A Tool for the Identification and Characterization of Pan Assay Interference Compounds [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2018:462-471.
[40]SATO I,NOMURA Y,HANAOKA S,et al.Managing Compu-ter-Assisted Detection System Based on Transfer Learning with Negative Transfer Inhibition [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2018:695-704.
[41]SUGIURA H,KIWAKI T,YOUSEFI S,et al.Estimating Glaucomatous Visual Sensitivity from Retinal Thickness with Pattern-Based Regularization and Visualization [C]//Proceedings of the 24th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2018:783-792.
[42]SUN M,TANG F,YI J,et al.Identify susceptible locations in medical records via adversarial attacks on deep predictive models [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:793-801.
[43]WANG J,ZHANG M L.Towards mitigating the class-imba-lance problem for partial label learning [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2427-2436.
[44]ZHANG Y,ZHAO P,CAO J,et al.Online adaptive asymmetric active learning for budgeted imbalanced data [C]//Proceedings of the 24th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2018:2768-2777.
[45]ZHOU D,HE J,YANG H,et al.Sparc:Self-paced network representation for few-shot rare category characterization [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2807-2816.
[46]DING D,ZHANG M,PAN X,et al.Modeling Extreme Events in Time Series Prediction.knowledge discovery and data mining [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:1114-1122.
[47]KITADA S,IYATOMI H,SEKI Y.Conversion PredictionUsing Multitask Conditional Attention Networks to Support the Creation of Effective Ad Creative [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2019:2069-2077.
[48]SAHOO D,HAO W,KE S,et al.FoodAI:Food ImageRecognition via Deep Learning for Smart Food Logging [C]//Procee-dings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2260-2268.
[49]SCHON C,DITTRICH J,MULLER R.The Error is the Feature:How to Forecast Lightning using a Model Prediction Error [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2979-2988.
[50]SICILIA A,PELECHRINIS K,GOLDSBERRY K.DeepHoops:Evaluating Micro-Actions in Basketball Using Deep Feature Representations of Spatio-Temporal Data [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2979-2988.
[51]YAN R,LE R,SONG Y,et al.Interview Choice Reveals Your Preference on the Market:To Improve Job-Resume Matching through Profiling Memories [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:914-922.
[52]LI H,YANG Q,CAO Y,et al.Cracking Tabular PresentationDiversity for Automatic Cross-Checking over Numerical Facts [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:2599-2607.
[53]PHAM P,JAIN V,DAUTERMAN L,et al.DeepTriage:Automated Transfer Assistance for Incidents in Cloud Services [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3281-3289.
[54]KAMANI M M,FARHANG S,MAHDAVI M,et al.TargetedData-driven Regularization for Out-of-Distribution Generalization [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:882-891.
[55]WANG L,ZHANG W,HE X,et al.Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2447-2456.
[56]ZHANG Y,CHEN R,TANG J,et al.LEAP:learning to prescribe effective and safe treatment combinations for multimorbidity [C]//Proceedings of the 23rd ACM SIGKDD Internatio-nal Conference on Knowledge Discovery and Data Mining.2017:1315-1324.
[57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[OL].https://arxiv.org/abs/1810.04805.
[1] CONG Ying-nan, WANG Zhao-yu, ZHU Jin-qing. Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law [J]. Computer Science, 2022, 49(4): 74-79.
[2] YU Yue-zhang, XIA Tian-yu, JING Yi-nan, HE Zhen-ying, WANG Xiao-yang. Smart Interactive Guide System for Big Data Analytics [J]. Computer Science, 2021, 48(9): 110-117.
[3] WU Guang-zhi, GUO Bin, DING Ya-san, CHENG Jia-hui, YU Zhi-wen. Cognitive Mechanisms of Fake News [J]. Computer Science, 2021, 48(6): 306-314.
[4] ZHANG Han-shuo, YANG Dong-ju. Technology Data Analysis Algorithm Based on Relational Graph [J]. Computer Science, 2021, 48(3): 174-179.
[5] HU Teng, WANG Yan-ping, ZHANG Xiao-song, NIU Wei-na. Data and Behavior Analysis of Blockchain-based DApp [J]. Computer Science, 2021, 48(11): 116-123.
[6] ZHU Di-chen, XIA Huan, YANG Xiu-zhang, YU Xiao-min, ZHANG Ya-cheng and WU Shuai. Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis [J]. Computer Science, 2020, 47(6A): 530-534.
[7] JIA Jing-dong, ZHANG Xiao-man, HAO Lu, TAN Huo-bin. Analysis of Focuses of Requirements Engineering in Industry [J]. Computer Science, 2020, 47(12): 25-34.
[8] YAO Mu-yan, TAO Dan. Implicit Authentication Mechanism of Pattern Unlock Based on Over-sampling and One-class Classification for Smartphones [J]. Computer Science, 2020, 47(11): 19-24.
[9] HUANG Mei-rong, OU Bo, HE Si-yuan. Access Control Method Based on Feature Extraction [J]. Computer Science, 2019, 46(2): 109-114.
[10] WANG Li, CHEN Hong-mei. NKSMOTE Algorithm Based Classification Method for Imbalanced Dataset [J]. Computer Science, 2018, 45(9): 260-265.
[11] DA Yi-fei, LIU Xu-dong, SUN Hai-long. Big Data Driven Analysis of Knowledge Exchange Network in Developer Community [J]. Computer Science, 2018, 45(9): 113-118.
[12] CHEN Gui-ping,WANG Zi-niu. Multiple Encrypted Storage Technology of User Information Based on Big Data Analysis [J]. Computer Science, 2018, 45(7): 150-153.
[13] LEI Xue-mei, XIE Yi-tong. Improved XGBoostModel Based on Genetic Algorithm for Hypertension Recipe Recognition [J]. Computer Science, 2018, 45(6A): 476-481.
[14] GUO Li-xuan, ZHUO Zi-han, HE Yue-ying, LI Qiang and LI Zhou-jun. IP Geolocation Method Based on Neighbor Sequence [J]. Computer Science, 2018, 45(1): 200-204.
[15] HAO Yan-ni, WU Su-ping and TIAN Wei-li. Research on Data Mining Algorithm in Wine Information Data Analysis System [J]. Computer Science, 2017, 44(Z6): 491-494.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!