计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 80-88.doi: 10.11896/jsjkx.210200124

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学 • 上一篇    下一篇

非均衡数据分类经典方法综述与面向医疗领域的实验分析

江昊琛1, 魏子麒1, 刘璘1, 陈俊2   

  1. 1 清华大学软件学院 北京信息科学与技术国家研究中心 北京100084
    2 百度公司 北京100094
  • 收稿日期:2021-02-20 修回日期:2021-05-20 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 刘璘(linliu@tsinghua.edu.cn)
  • 作者简介:jhc18@mails.tsinghua.edu.cn
  • 基金资助:
    百度-清华大学AI医疗联合科研项目

Imbalanced Data Classification:A Survey and Experiments in Medical Domain

JIANG Hao-chen1, WEI Zi-qi1, LIU Lin1, CHEN Jun2   

  1. 1 School of Software,Beijing National Research Center for Information Science and Technology,Tsinghua University,Beijing 100084,China
    2 Baidu Corp.,Beijing 100094,China
  • Received:2021-02-20 Revised:2021-05-20 Online:2022-01-15 Published:2022-01-18
  • About author:JIANG Hao-chen,born in 1996,postgraduate,is a student member of China Computer Federation.His main research interests include big data techniques in health care and imbalanced data analysis.
    LIU Lin,born in 1973,Ph.D,is a member of China Computer Federation.Her main research interests include software requirements engineering and health data analytics.
  • Supported by:
    Baidu-Tsinghua Collaborated Medical AI Project.

摘要: 近年来,人工智能技术被广泛地应用于多个领域。其中,智慧医疗场景得到了普遍关注,并产生了大量临床辅助诊断和医疗方案推荐的实际应用。然而,由于人工智能技术的本质在于通过从大量真实数据中进行模式抽取,从而预测未知情况,因此真实数据的数据特征和数据质量将直接影响人工智能应用的效果。相比其他智能应用领域,由于罕见病患者在人群中总是占极少数,医疗数据具有天然的非均衡的特点,而高度非均衡的数据在机器学习领域被认为是难于学习的。针对这一应用现状,文中首先围绕“数据非均衡”问题开展了文献调研,尝试通过寻找该问题的通用解决办法来指导在智慧医疗环境下的应用。之后,以数据挖掘领域的会议SIGKDD(ACM SIGKDD Conference on Knowledge Discovery and Data Mining)近年来涉及非均衡数据集的工作为分析样本,统计针对特定领域的“数据非均衡”问题人们倾向选择的处理方法。最后,通过医学数据分析中的两个典型应用场景,对调研获得的知识和方法进行实验应用,从而验证了调研和统计分析中所得出方法的可用性。

关键词: 非均衡数据集, 过采样, 数据分析, 智慧医疗

Abstract: In recent years,AI technology has been widely adopted in many application domains,amongst which,intelligent medical applications such as clinical decision support systems have attracted much attention.However,since the current wave of AI applications are based on predictive models crystalized from historical data,the feature and quality of data will affect AI applications' performance directly.Medical data are inherently imbalanced as rare disease cases are always the scarce in existing case archives,while considered more important.The “data imbalance problem” is still considered a difficult research problem in machine lear-ning.This paper conducts a literature review on the research efforts targeting at techniques to handle “imbalanced data” in gene-ral as well as the ones in intelligent medical area.We then use research publications from the SIGKDD conference dedicated to knowledge discovery and data mining as a sample pool,to find people's preferred approach to address “imbalanced data” problem in a given domain.Finally,based on approaches,we identify from the survey,and conduct experiments on two typical medical predictive model learning scenarios,to validate the know-how we acquired in this study.

Key words: Data analysis, Imbalanced datasets, Intelligent medical, Over-sampling

中图分类号: 

  • TP311.5
[1]MAIMON O,ROKACH L.Introduction to Knowledge Disco-very and Data Mining[J/OL].Springer US.https://link.sprin-ger.com/chapter/10.1007/978-0-387-09823-4_1.
[2]SHEARERC.The CRISP-DM model:the new blueprint for data mining[J].Journal of Data Warehousing,2000,5(4):13-22.
[3]JOHNSON A E W,POLLARD T J,SHEN L,et al.MIMIC-III,a freely accessible critical care database[J].Scientific Data,2016,3(1):1-9.
[4]HAO W,LIU F.Imbalanced Data Fault Diagnosis Based on anEvolutionary Online Sequential Extreme Learning Machine[J].Symmetry,2020,12(8):1204.
[5]WANG G,YANG J,LI R.Imbalanced SVM-Based AnomalyDetection Algorithm for Imbalanced Training Datasets[J].ETRI Journal,2017,39(5):621-631.
[6]ZHAO C,XIN Y,LI X,et al.A heterogeneous ensemble lear-ning framework for spam detection in social networks with imbalanced data[J].Applied Sciences,2020,10(3):936.
[7]SUN Y,WONG A K,KAMEL M S.Classification of imbalanced data:A review[J].International Journal of Pattern Recognition and Artificial Intelligence,2009,23(4):687-719.
[8]TOMEK I.Two modifications of CNN[J].IEEE Transactions on Systems,Man,and Cybernetics,1976,6:769-772.
[9]VEROPOULOS K,CAMPBELL C,CRISTIANINI N.Controlling the sensitivity of support vector machines [C]//Procee-dings of the International Joint Conference on AI.1999.
[10]Al-STOUHI S,REDDY C K.Transfer learning for class imba-lance problems with inadequate data[J].Knowledge and Information Systems,2016,48(1):201-228.
[11]ALCALA-FDEZ J,FERNANDEZ A,LUENGO J,et al.KEEL Data-Mining Software Tool:Data Set Repository[J].Journal of Multiple -Valued Logic and Soft Computing,2011,17(2/3):255-287.
[12]FERNÁNDEZ A,GARCÍA S,GALAR M,et al.Learningfrom imbalanced data sets [M].Berlin:Springer,2018.
[13]GU Q,YUAN L,NING B,et al.A Novel Classification Algorithm for Imbalanced Datasets Based on Hybrid Resampling Strategy[J].Computer Engineering & Science,2012,34(10):128-134.
[14]WILSON D.Asymptotic Properties of Nearest Neighbor Rules Using Edited Data[J].IEEE Transactions on Systems,Man,and Cybernetrics,1972,2(3):408-421.
[15]CHAWLA N V,BOWYER K W,HALL L O,et al.:SMOTE:Synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2002,16:321-357.
[16]HAN H,WANG W Y,MAO B H.Borderline-SMoTE:a newover-sampling method in imbalanced data sets learning [C]//International Conference on Intelligent Computing.2005:878-887.
[17]HE H,BAI Y,GARCIA E,et al.Adaptive synthetic sampling approach for imbalanced learning [C]//IEEE International Joint Conference on Neural Networks.2008.
[18]CHRISTOPHER M.Neural Networks for Pattern Recognition[M].New York:Oxford University Press,1995.
[19]TAX D M,DUIN R P.Support vector domain description[J].Pattern Recognition Letters,1999,20(11/12/13):1191-1199.
[20]ELKAN C.The foundations of cost-sensitive learning [C]//International Joint Conference on Artificial Intelligence.2001:973-978.
[21]BATUWITA R,PALADE V.FSVM-CIL:fuzzy support vector machines for class imbalance learning[J].IEEE Transactions on Fuzzy Systems,2010,18(3):558-571.
[22]CHAWLA N V,LAZAREVIC A,HALL L O,et al.SMOTEBoost:Improving prediction of the minority class in boosting [C]//European Conference on Principles of Data Mining and Knowledge Discovery.2003:107-119.
[23]SAGI O,ROKACH L.Ensemble learning:A survey[J].Wiley Interdisciplinary Reviews:Data Mining and KnowledgeDisco-very,2018,8(5):e1249.
[24]SUN Y,KAMEL M S,WONG A K,et al.Cost-sensitive boosting for classification of imbalanced data[J].Pattern Recognition,2007,40(12):3358-3378.
[25]LIU X Y,WU J,ZHOU Z H.Exploratory Undersampling for Class-Imbalance Learning[J].IEEE Transactions on Systems Man & Cybernetics Part B,2009,39(2):539-550.
[26]HANSON S J,BRUNSWICK S N,KULIKOWSKI C,et al.Concept-Learning in the Absence of Counter-Examples:an Autoassociation-Based Approach to Classification[J/OL].New Brunswick Rutters the State of New Jersey,1999.http://dl.acm.org/citation.cfm?id=929980.
[27]YPMA E,DUIN R.Novelty detection using Self-OrganizingMaps[J/OL].Proc. of Iconip,1998.https://www.researchgate.net/publication/2722672_Novelty_detection_using_Self-Organizing_Maps.
[28]LIN T Y,GOYAL P,GIRSHICK R,et al.Focal Loss for Dense Object Detection[C]//IEEE Transactions on Pattern Analysis &Machine Intelligence.IEEE,2017:2999-3007.
[29]DOUZAS G,BACAO F.Effective data generation for imbal-anced learning using conditional generative adversarial networks[J].Expert Systems with Application,2018,91(Jan.):464-471.
[30]LIANG G,JING G,NGO H,et al.On handling negative transfer and imbalanced distributions in multiple source transfer learning[J].Statistical Analysis & Data Mining,2014,7(4):254-271.
[31]ZHANG X,YANG T,SRINIVASAN P.Online AsymmetricActive Learning with Imbalanced Data [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:2055-2064.
[32]BIFET A,DE FRANCISCI MORALES G,READ J,et al.Efficient online evaluation of big data stream classifiers [C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2015:59-68.
[33]SHAABANI E,ALEALI A,SHAKARIAN P,et al.Early identification of violent criminal gang members [C]//Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2015:2079-2088.
[34]DU B,LIU C,ZHOU W,et al.Catch me if you can:Detecting pickpocket suspects from large-scale transit records [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:87-96.
[35]HA J W,PYO H,KIM J.Large-scale item categorization inecommerce using multiple recurrent neural networks[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:107-115.
[36]NGUYEN H,PATRICK J.Text Mining in Clinical Domain:Dealing with Noise [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:549-558.
[37]WANG H,KIFER D,GRAIF C,et al.Crime rate inference with big data [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2016:635-644.
[38]DADKHAHI H,MARLIN B M.Learning Tree-Structured Detection Cascades for Heterogeneous Networks of Embedded Devices [C]//Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2017:1773-1781.
[39]KOPTELOV M,ZIMMERMANN A,BONNET P,et al.PrePeP:A Tool for the Identification and Characterization of Pan Assay Interference Compounds [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2018:462-471.
[40]SATO I,NOMURA Y,HANAOKA S,et al.Managing Compu-ter-Assisted Detection System Based on Transfer Learning with Negative Transfer Inhibition [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2018:695-704.
[41]SUGIURA H,KIWAKI T,YOUSEFI S,et al.Estimating Glaucomatous Visual Sensitivity from Retinal Thickness with Pattern-Based Regularization and Visualization [C]//Proceedings of the 24th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2018:783-792.
[42]SUN M,TANG F,YI J,et al.Identify susceptible locations in medical records via adversarial attacks on deep predictive models [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:793-801.
[43]WANG J,ZHANG M L.Towards mitigating the class-imba-lance problem for partial label learning [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2427-2436.
[44]ZHANG Y,ZHAO P,CAO J,et al.Online adaptive asymmetric active learning for budgeted imbalanced data [C]//Proceedings of the 24th ACM SIGKDD International Conference on Know-ledge Discovery and Data Mining.2018:2768-2777.
[45]ZHOU D,HE J,YANG H,et al.Sparc:Self-paced network representation for few-shot rare category characterization [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2807-2816.
[46]DING D,ZHANG M,PAN X,et al.Modeling Extreme Events in Time Series Prediction.knowledge discovery and data mining [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:1114-1122.
[47]KITADA S,IYATOMI H,SEKI Y.Conversion PredictionUsing Multitask Conditional Attention Networks to Support the Creation of Effective Ad Creative [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Disco-very and Data Mining.2019:2069-2077.
[48]SAHOO D,HAO W,KE S,et al.FoodAI:Food ImageRecognition via Deep Learning for Smart Food Logging [C]//Procee-dings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2260-2268.
[49]SCHON C,DITTRICH J,MULLER R.The Error is the Feature:How to Forecast Lightning using a Model Prediction Error [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2979-2988.
[50]SICILIA A,PELECHRINIS K,GOLDSBERRY K.DeepHoops:Evaluating Micro-Actions in Basketball Using Deep Feature Representations of Spatio-Temporal Data [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:2979-2988.
[51]YAN R,LE R,SONG Y,et al.Interview Choice Reveals Your Preference on the Market:To Improve Job-Resume Matching through Profiling Memories [C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2019:914-922.
[52]LI H,YANG Q,CAO Y,et al.Cracking Tabular PresentationDiversity for Automatic Cross-Checking over Numerical Facts [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:2599-2607.
[53]PHAM P,JAIN V,DAUTERMAN L,et al.DeepTriage:Automated Transfer Assistance for Incidents in Cloud Services [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:3281-3289.
[54]KAMANI M M,FARHANG S,MAHDAVI M,et al.TargetedData-driven Regularization for Out-of-Distribution Generalization [C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.2020:882-891.
[55]WANG L,ZHANG W,HE X,et al.Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation [C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2018:2447-2456.
[56]ZHANG Y,CHEN R,TANG J,et al.LEAP:learning to prescribe effective and safe treatment combinations for multimorbidity [C]//Proceedings of the 23rd ACM SIGKDD Internatio-nal Conference on Knowledge Discovery and Data Mining.2017:1315-1324.
[57]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[OL].https://arxiv.org/abs/1810.04805.
[1] 丛颖男, 王兆毓, 朱金清.
关于法律人工智能数据和算法问题的若干思考
Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law
计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191
[2] 黄颖琦, 陈红梅.
基于代价敏感卷积神经网络的非平衡问题混合方法
Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification
计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[3] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[4] 张仁杰, 陈伟, 杭梦鑫, 吴礼发.
基于变分自编码器的不平衡样本异常流量检测
Detection of Abnormal Flow of Imbalanced Samples Based on Variational Autoencoder
计算机科学, 2021, 48(7): 62-69. https://doi.org/10.11896/jsjkx.200600022
[5] 吴广智, 郭斌, 丁亚三, 成家慧, 於志文.
假消息认知机理研究综述
Cognitive Mechanisms of Fake News
计算机科学, 2021, 48(6): 306-314. https://doi.org/10.11896/jsjkx.201200194
[6] 张寒烁, 杨冬菊.
基于关系图谱的科技数据分析算法
Technology Data Analysis Algorithm Based on Relational Graph
计算机科学, 2021, 48(3): 174-179. https://doi.org/10.11896/jsjkx.191200154
[7] 刘彤彤, 杨环, 西永明, 郭建伟, 潘振宽, 黄宝香.
机器学习在脊柱疾病智能诊治中的应用综述
Review on Intelligent Diagnosis of Spine Disease Based on Machine Learning
计算机科学, 2021, 48(11A): 597-607. https://doi.org/10.11896/jsjkx.201100006
[8] 胡腾, 王艳平, 张小松, 牛伟纳.
基于区块链的DApp数据与行为分析
Data and Behavior Analysis of Blockchain-based DApp
计算机科学, 2021, 48(11): 116-123. https://doi.org/10.11896/jsjkx.210200134
[9] 欧阳鹏, 陆璐, 张凡龙, 邱少健.
基于迁移学习和过采样技术的跨项目克隆代码一致性维护需求预测
Cross-project Clone Consistency Prediction via Transfer Learning and Oversampling Technology
计算机科学, 2020, 47(9): 10-16. https://doi.org/10.11896/jsjkx.200400041
[10] 朱涤尘, 夏换, 杨秀璋, 于小民, 张亚成, 武帅.
基于文本挖掘和决策树分析的中国手游产业发展研究
Research on Mobile Game Industry Development in China Based on Text Mining and Decision Tree Analysis
计算机科学, 2020, 47(6A): 530-534. https://doi.org/10.11896/JsJkx.190700124
[11] 贾经冬, 张筱曼, 郝璐, 谭火彬.
工业界需求工程关注点分析
Analysis of Focuses of Requirements Engineering in Industry
计算机科学, 2020, 47(12): 25-34. https://doi.org/10.11896/jsjkx.201200048
[12] 冯贵兰, 李正楠, 周文刚.
大数据分析技术在网络领域中的研究综述
Research on Application of Big Data Analytics in Network
计算机科学, 2019, 46(6): 1-20. https://doi.org/10.11896/j.issn.1002-137X.2019.06.001
[13] 夏英, 李刘杰, 张旭, 裴海英.
基于层次聚类的不平衡数据加权过采样方法
Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data
计算机科学, 2019, 46(4): 22-27. https://doi.org/10.11896/j.issn.1002-137X.2019.04.004
[14] 黄美蓉, 欧博, 何思源.
一种基于特征提取的访问控制方法
Access Control Method Based on Feature Extraction
计算机科学, 2019, 46(2): 109-114. https://doi.org/10.11896/j.issn.1002-137X.2019.02.017
[15] 周晓敏, 曹付元, 余丽琴.
一种基于样本分层的双向过采样方法
Bi-directional Oversampling Method Based on Sample Stratification
计算机科学, 2019, 46(12): 83-88. https://doi.org/10.11896/jsjkx.190400053
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!