计算机科学 ›› 2021, Vol. 48 ›› Issue (8): 53-59.doi: 10.11896/jsjkx.200700211

• 数据库&大数据&数据科学* • 上一篇    下一篇

基于自编码器和流形正则的结构保持无监督特征选择

杨蕾, 降爱莲, 强彦   

  1. 太原理工大学信息与计算机学院 山西 晋中030600
  • 收稿日期:2020-07-31 修回日期:2020-09-03 发布日期:2021-08-10
  • 通讯作者: 降爱莲(ailianjiang@126.com)
  • 基金资助:
    国家自然科学基金(61872261);山西省回国留学人员科研资助项目(2017-051)

Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization

YANG Lei, JIANG Ai-lian, QIANG Yan   

  1. College of Information and Computer,Taiyuan University of Technology,Jinzhong,Shanxi 030600,China
  • Received:2020-07-31 Revised:2020-09-03 Published:2021-08-10
  • About author:YANG Lei,born in 1996,postgraduate.Her main research interests include machine learning and feature selection.(yanglei_l@163.com)JIANG Ai-lian,born in 1969,Ph.D, associate professor,is a member of China Computer Federation.Her main research interests include big data analysis and processing,feature selection,artificial intelligence and computer vision.
  • Supported by:
    National Natural Science Foundation of China(61872261) and Scientific Research Funding Project for Returned Overseas Scholars in Shanxi Province(2017-051).

摘要: 高维数据中存在着大量的冗余和不相关特征,严重影响了数据挖掘的效率、质量以及机器学习算法的泛化性能,因此特征选择成为计算机科学与技术领域的重要研究方向。文中利用自编码器的非线性学习能力提出了一种无监督特征选择算法。首先,基于自编码器的重建误差选择出单个特征对数据重建贡献大的特征子集。其次,利用单层自编码器的特征权重最终选择出对其他特征重建贡献大的特征子集,通过流形正则保持原始数据空间的局部与非局部结构,并且对特征权重增加L2/1稀疏正则来提高特征权重的稀疏性,使之选择出更具区别性的特征。最后,构造一个新的目标函数,并利用梯度下降算法对所提目标函数进行优化。在6个不同类型的典型数据集上进行实验,并将所提算法与5个常用的无监督特征选择算法进行对比。实验结果验证了所提算法能够有效地选择出重要特征,显著提高了分类准确率和聚类准确率。

关键词: 结构保持, 流形正则, 特征选择, 子空间学习, 自编码器

Abstract: There are a lot of redundant and irrelevant features in high-dimensional data,which seriously affect the efficiency and quality of data mining and the generalization performance of machine learning.Therefore,feature selection has become an important research direction in the computer field.In this paper,an unsupervised feature selection algorithm is proposed by using the non-linear learning ability of the autoencoder.First,based on the reconstruction error of the autoencoder,a single feature is selec-ted which is important for data reconstruction.Second,the feature weights finally select the feature subsets that contribute greatly to the reconstruction of other features.Manifold learning is introduced to capture the local and non-local structure of the original data space,and L2/1 sparse regularization is added to the feature weights to improve the sparsity of the feature weights so that they can select more distinctive features.Finally,a new objective function is constructed,and a gradient descent algorithm is used to optimize the proposed objective function.Experiments on six different types of typical data sets,and the proposed algorithm is compared with five commonly used unsupervised feature selection algorithms.Experiment results verify that the proposed algorithm can effectively select important features,significantly improve the classification accuracy rate and clustering accuracy rate.

Key words: Autoencoder, Feature selection, Manifold regularization, Structure preservation, Subspace learning

中图分类号: 

  • TP181
[1]DPATIL M,SANE S S.Dimension Reduction:A Review[J].International Journal of Computer Applications,1999,92(16):23-29.
[2]DY J G,BRODLEY C E,KAK A,et al.Unsupervised feature selection applied to content-based retrieval of lung images[C]//IEEE Trans.Pattern Anal.Mach.Intell.,2003(25):373-378.
[3]TANG J,LIU H.Unsupervised feature selection for linked social media data[C]//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mi-ning.2012:904-912.
[4]CAI D,ZHANG C,HE X.Unsupervised feature selection formulti-cluster data[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2010:333-342.
[5]DY J G,BRODLEY C E.Feature Selection for UnsupervisedLearning[C]//International Conference on Neural Information Processing.2012:845-889.
[6]ZHU P,ZUO W,ZHANG L,et al.Unsupervised feature selection by regularized self-representation[J].Pattern Recognition,2015,48(2):438-446.
[7]WANG W,ZHANG H,ZHU P,et al.Non-convex Regularized Self-representation for Unsupervised Feature Selection[M]//Intelligence Science and Big Data Engineering.Big Data and Machine Learning Techniques.Springer International Publishing,2015.
[8]TANG C,LIU X,LI M,et al.Robust unsupervised feature selection via dual self-representation and manifold regularization[J].Knowledge-Based Systems,2018,145(1):109-120.
[9]LI Y,LEI C,FANG Y,et al.Unsupervised feature selection by combining subspace learning with feature self-representation[J].Pattern Recognition Letters,2017,109(15):35-43.
[10]WANG Z Y,JIANG A L,et al.Unsupervised feature selection method based on regularized mutual representation[J].Chinese Journal of Computer Applications,2020,40(7):1896-1900.
[11]HAN K,WANG Y,ZHANG C,et al.Autoencoder InspiredUnsupervised Feature Selection[C]//International Conference on Acoustics,Speech and Signal Processing (ICASSP).2017:2941-2945.
[12]FENG S W,DUARTE M F.Graph autoencoder-based unsupervised feature selection with broad and local data structure pre-servation[J].Neurocomputing,2018,312(27):310-323.
[13]TAHERKHANI A,COSMA G,MCGINNITY T M.Deep-FS:A feature selection algorithm for Deep Boltzmann Machines[J].Neurocomputing,2018,322(17):22-37.
[14]SHARIFIPOUR S,FAYYAZI H.Unsupervised feature selection ranking and selection based on autoencoders[C]//IEEE,ICASSP.2019.
[15]CHANG T,MEIRU B,LIU X W.Unsupervised feature selection via latent representation learning and manifold regularization[J].Neural Networks,2019,117:163-178.
[16]LIU X,WANG L,ZHANG J,et al.Global and Local Structure Preservation for Feature Selection[J].IEEE Transactions on Neural Networks & Learning Systems,2014,25(6):1083-1095.
[17]HE X,CAI D,NIYOGI P.Laplacian score for feature selection[C]//Advances in Neural Information Processing Systems.2006:507-514.
[18]CAI D,ZHANG C Y,HE X F.Unsupervised feature selection for Multi-Cluster data[C]//Acm Sigkdd International Confe-rence on Knowledge Discovery & Data Mining.ACM,2010.
[19]NIE F,ZHU W,LI X.Unsupervised Feature Selection withStructured Graph Optimization[C]//Thirtieth AAAI Confe-rence on Artificial Intelligence.AAAI Press,2016.
[20]WANG S,TANG J,LIU H.Embedded Unsupervised Feature Selection[C]//Proceedings of the Twenty-Ninth AAAI Confe-rence on Arttificial Intelligence.2015:470-476.
[21]ZHOU N,XU Y,CHENG H,et al.Global and local structure preserving sparse subspace learning:An iterative approach to unsupervised feature selection[J].Pattern Recognition,2016,53:87-101.
[22]YU J.Manifold regularized stacked denoising autoencoders with feature selection[J].Neurocomputing,2019,358(17):235-245.
[23]HU R,ZHU X,CHENG D,et al.Graph self-representationmethod for unsupervised feature selection[J].Neurocomputing,2017,220(12):130-137.
[24]GLOROT X,BENGIO Y.Understanding the difficulty of trai-ning deep feedforward neural networks[J].Journal of Machine Learning Research,2010,9:249-256.
[1] 王冠宇, 钟婷, 冯宇, 周帆.
基于矢量量化编码的协同过滤推荐方法
Collaborative Filtering Recommendation Method Based on Vector Quantization Coding
计算机科学, 2022, 49(9): 48-54. https://doi.org/10.11896/jsjkx.210700109
[2] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3] 杜航原, 李铎, 王文剑.
一种面向电商网络的异常用户检测方法
Method for Abnormal Users Detection Oriented to E-commerce Network
计算机科学, 2022, 49(7): 170-178. https://doi.org/10.11896/jsjkx.210600092
[4] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[5] 郁舒昊, 周辉, 叶春杨, 王太正.
SDFA:基于多特征融合的船舶轨迹聚类方法研究
SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion
计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[6] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[7] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[8] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[9] 韩洁, 陈俊芬, 李艳, 湛泽聪.
基于自注意力的自监督深度聚类算法
Self-supervised Deep Clustering Algorithm Based on Self-attention
计算机科学, 2022, 49(3): 134-143. https://doi.org/10.11896/jsjkx.210100001
[10] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[11] 唐雨潇, 王斌君.
基于深度生成模型的人脸编辑研究进展
Research Progress of Face Editing Based on Deep Generative Model
计算机科学, 2022, 49(2): 51-61. https://doi.org/10.11896/jsjkx.210400108
[12] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[13] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[14] 张师鹏, 李永忠.
基于降噪自编码器和三支决策的入侵检测方法
Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions
计算机科学, 2021, 48(9): 345-351. https://doi.org/10.11896/jsjkx.200500059
[15] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!