计算机科学 ›› 2022, Vol. 49 ›› Issue (7): 73-78.doi: 10.11896/jsjkx.210500092

• 数据库&大数据&数据科学* 上一篇    下一篇

一种用于癌症分类的两阶段深度特征选择提取算法

胡艳羽, 赵龙, 董祥军   

  1. 齐鲁工业大学计算机科学与技术学院 济南250353
  • 收稿日期:2021-05-03 修回日期:2021-09-09 出版日期:2022-07-15 发布日期:2022-07-12
  • 通讯作者: 赵龙(zhaolong@qlu.edu.cn)
  • 作者简介:(1043119207@stu.qlu.edu.cn)
  • 基金资助:
    国家自然科学基金(62076143,61806105);山东省自然基金(ZR2017LF020).

Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification

HU Yan-yu, ZHAO Long, DONG Xiang-jun   

  1. College of Computer Science and Technology,Qilu University of Technology,Jinan 250353,China
  • Received:2021-05-03 Revised:2021-09-09 Online:2022-07-15 Published:2022-07-12
  • About author:HU Yan-yu,born in 1996,master.Her main research interests include deep feature selection and so on.
    ZHAO Long,born in 1984,Ph.D,lectu-rer,master supervisor.His main research interests include image proces-sing,machine learning and knowledge discovery.
  • Supported by:
    National Natural Science Foundation of China(62076143,61806105) and Natural Science Foundation of Shandong Province(ZR2017LF020).

摘要: 癌症是世界上最致命的疾病之一。利用机器学习处理基因微阵列数据集(Microarray Data)对于协助癌症的早期诊断具有重要作用,但微阵列数据集中基因特征的数目远大于样本数目,造成样本不平衡,影响了分类的效率和精度,因此对基因阵列数据进行特征选择就显得尤为重要。现有的特征选择算法多为单一条件的特征选择,很少考虑特征提取,且大多采用存在已久的神经网络,分类精度较低。因此,文中提出了一种两阶段深度特征选择(Two-Stage Deep Feature Selection,TSDFS)算法。第一阶段集成3种特征选择算法进行全面的特征选择,得到特征子集;第二阶段使用非监督神经网络获得特征子集的最佳表示,进而提高最终的分类精度。通过特征选择前后的分类效果和不同特征选择算法之间的对比来分析TSDFS的有效性,实验结果表明,TSDFS在减少特征数目的同时保持或者提高了分类的精度。

关键词: 变分自编码器, 深度学习, 随机森林, 特征选择, 微阵列数据

Abstract: Cancer is one of the deadliest diseases in the world.Using machine learning to process microarray data plays an important role in assisting the early diagnosis of cancer,but the numbers of genetic features are much more than samples,leading to an imbalance in the sample,and the efficiency and accuracy of classification are affected,so it is important to select the feature of gene array data.Most of the existing feature selection algorithms are single condition feature selection,which seldom consider feature extraction.Most of them use the long-existing neural network and have low classification accuracy.So,a two-stage deep feature selection(TSDFS) algorithm is proposed.The first stage aggregates three feature selection algorithms for comprehensive feature selection,and feature subsets are obtained.In the second stage,unsupervised neural network is used to obtain the best representation of feature subset and improve the final classification accuracy.This paper analyzes the effectiveness of TSDFS by comparing the classification effect before and after feature selection and different feature selection algorithms.Experimental results show that TSDFS algorithm can reduce the number of features while maintaining or improving the accuracy of classification.

Key words: Deep learning, Feature selection, Microarray data, Random forest, Variational auto-encoder

中图分类号: 

  • TP302
[1]SHI T W,MOORTHY K,MOHAMAD M S,et al.RandomForest and Gene Ontology for functional analysis of microarray data[C]//International Workshop on Computational Intelligence and Applications.IEEE,2014:29-34.
[2]LI Z Q,DU J Q,NIE B,et al.Summary of feature selection methods[J].Computer Engineering and Applications,2019,5(24):10-19.
[3]KINGMA D P,WELLING M.Auto-Encoding Variational Bayes[J/OL].International Conference on Learning Representations.https://arxiv.org/pdf/1312.6114v10.pdf.
[4]YANG Y,TANG P.Research of VAE_LSTM Algorithm inTime Series Prediction Model[J].Journal of Hunan University of Science and Technology(Natural Science Edition),2020,35(3):93-101.
[5]IBRAHIM R,YOUSRI N A,ISMAIL M A,et al.Multi-level gene/MiRNA feature selection using deep belief nets and active learning[C]//International Conference of the IEEE Engineering in Medicine and Biology Society.IEEE,2014:3957-3960.
[6]KOUL N,MANVI S S.A Scheme for Feature Selection from Gene Expression Data using Recursive Feature Elimination with Cross Validation and Unsupervised Deep Belief Network Classifier[C]//International Conference on Computing and Communications Technologies.IEEE,2019:31-36.
[7]SYAFIANDINI A F,WASITO I,YAZID S,et al.Multimodal Deep Boltzmann Machines for feature selection on gene expression data[C]//International Conference on Advanced Computer Science and Information Systems.IEEE,2016:407-412.
[8]SUTAWIKA L A,WASITO I.Restricted Boltzmann machinesfor unsupervised feature selection with partial least square feature extractor for microarray datasets[C]//International Conference on Advanced Computer Science and Information Systems.IEEE,2017:257-260.
[9]WISESTY U N,PRATAMA B P B,ADITSANIA A,et al.Cancer Detection Based on Microarray Data Classification Using Deep Belief Network and Mutual Information[C]//Internatio-nal Conference on Instrumentation,Communications,Information Technology,and Biomedical Engineering.IEEE,2017:157-162.
[10]KILICARSLANA S,ADEMB K,METE C.Diagnosis and classification of cancer using hybrid model based on ReliefF and con-volutional neural network[J].Medical Hypotheses,2020,137(5439):109577.
[11]ZEEBAREE D Q.Gene Selection and Classification of Micro-array Data Using Convolutional Neural Network[C]//International Conference on Advanced Science and Engineering.IEEE,2018:145-150.
[12]DING H,FENG P M,CHEN W,et al.Identification of bacteriophage virion proteins by the ANOVA feature selection and ana-lysis[J].Molecular Biosystems,2014,10(8):2229-2235.
[13]ROBNIK-ŠIKONJA M,KONONENKO I.Theoretical and Em-pirical Analysis of ReliefF and RReliefF[J].Machine Learning,2003,53(1/2):23-69.
[14]YANG Q.Research on Judging Method of N1+N2 Structure Grammatical Relation Based on Random Forest[J].Journal of Chongqing University of Technology(Natural Science),2021,35(7):125-130.
[15]HOU X X,SHEN L L,SUN K,et al.Deep Feature Consistent Variational Autoencoder[C]//Winter Conference on Applications of Computer Vision.IEEE,2017:1133-1141.
[16]SALEM H,ATTIYA G,EL-FISHAWY N.Classification of human cancer diseases by gene expression profiles[J].Applied Soft Computing,2017,50:124-134.
[17]AYYAD S M,SALEH A I,LABIB L M.Gene expression cancer classification using modified K-nearest neighbors technique[J].Biosystems,2019,176:41-51.
[18]YANG L.Cancer classification based on deep metric neural network for low sample size gene expression profile[D].Shenzhen:Harbin Institute of Technology,2019.
[19]NAIR V,HINTON G E.Rectified linear units improve restric-ted boltzmann machines[C]//International Conference on machine learning.New York:ACM,2010:807-814.
[20]KINGMA D P,BA J.Adam:A method for stochastic optimization[J/OL].International Conference on Learning Representations. https://arxiv.org/pdf/1412.6980v8.pdf.
[21]ZHANG H,BERG A C,MAIRE M,et al.SVM-KNN:Discriminative Nearest Neighbor Classification for Visual Category Reco-gnition[C]//Computer Society Conference on Computer Vision and Pattern Recognition.IEEE,2006:2126-2136.
[22]RATSCH G.Soft Margins for AdaBoost[J].Machine Learning,2001,42(3):287-320.
[23]UZMA,AL-OBEIDAT F,TUBAISHAT A,et al.Gene en-coder:a feature selection technique through unsupervised deep learning-based clustering for large gene expression data[J/OL].Neural Computing and Applications.https://doi.org/10.1007/s00521-020-05101-4.
[1] 王冠宇, 钟婷, 冯宇, 周帆.
基于矢量量化编码的协同过滤推荐方法
Collaborative Filtering Recommendation Method Based on Vector Quantization Coding
计算机科学, 2022, 49(9): 48-54. https://doi.org/10.11896/jsjkx.210700109
[2] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[3] 饶志双, 贾真, 张凡, 李天瑞.
基于Key-Value关联记忆网络的知识图谱问答方法
Key-Value Relational Memory Networks for Question Answering over Knowledge Graph
计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[4] 汤凌韬, 王迪, 张鲁飞, 刘盛云.
基于安全多方计算和差分隐私的联邦学习方案
Federated Learning Scheme Based on Secure Multi-party Computation and Differential Privacy
计算机科学, 2022, 49(9): 297-305. https://doi.org/10.11896/jsjkx.210800108
[5] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[6] 孙奇, 吉根林, 张杰.
基于非局部注意力生成对抗网络的视频异常事件检测方法
Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection
计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[7] 王剑, 彭雨琦, 赵宇斐, 杨健.
基于深度学习的社交网络舆情信息抽取方法综述
Survey of Social Network Public Opinion Information Extraction Based on Deep Learning
计算机科学, 2022, 49(8): 279-293. https://doi.org/10.11896/jsjkx.220300099
[8] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[9] 姜梦函, 李邵梅, 郑洪浩, 张建朋.
基于改进位置编码的谣言检测模型
Rumor Detection Model Based on Improved Position Embedding
计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[10] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[11] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[12] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[13] 周慧, 施皓晨, 屠要峰, 黄圣君.
基于主动采样的深度鲁棒神经网络学习
Robust Deep Neural Network Learning Based on Active Sampling
计算机科学, 2022, 49(7): 164-169. https://doi.org/10.11896/jsjkx.210600044
[14] 苏丹宁, 曹桂涛, 王燕楠, 王宏, 任赫.
小样本雷达辐射源识别的深度学习方法综述
Survey of Deep Learning for Radar Emitter Identification Based on Small Sample
计算机科学, 2022, 49(7): 226-235. https://doi.org/10.11896/jsjkx.210600138
[15] 祝文韬, 兰先超, 罗唤霖, 岳彬, 汪洋.
改进Faster R-CNN的光学遥感飞机目标检测
Remote Sensing Aircraft Target Detection Based on Improved Faster R-CNN
计算机科学, 2022, 49(6A): 378-383. https://doi.org/10.11896/jsjkx.210300121
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!