计算机科学 ›› 2023, Vol. 50 ›› Issue (6): 159-166.doi: 10.11896/jsjkx.220500169

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于持续同调的过滤式特征选择算法

殷杏子, 彭宁宁, 詹学燕   

  1. 武汉理工大学理学院 武汉 430070
  • 收稿日期:2022-05-18 修回日期:2022-08-11 出版日期:2023-06-15 发布日期:2023-06-06
  • 通讯作者: 彭宁宁(pengn@whut.edu.cn)
  • 作者简介:(305849@whut.edu.cn)
  • 基金资助:
    国家自然科学基金(11701438)

Filtered Feature Selection Algorithm Based on Persistent Homology

YIN Xingzi, PENG Ningning, ZHAN Xueyan   

  1. School of Science,Wuhan University of Technology,Wuhan 430070,China
  • Received:2022-05-18 Revised:2022-08-11 Online:2023-06-15 Published:2023-06-06
  • About author:YIN Xingzi,born in 1998,postgraduate.Her main research interest is topology data analysis.PENG Ningning,born in 1985,associate professor,Ph.D.His main research interests include mathematical logic,computability theory and topology data analysis.
  • Supported by:
    National Natural Science Foundation of China(11701438).

摘要: 现有的过滤式特征选择算法忽略了特征之间的关联性。鉴于此,提出了一种新的过滤式特征选择算法——基于持续同调的特征选择算法(Rel-Betti算法),该算法能够识别特征之间的关联性以及组合效果。通过提出相关贝蒂数概念,筛选出数据集中重要的拓扑特征信息。该算法对数据集进行预处理后,根据类标签将数据集分类,计算不同类中的相关贝蒂数,获得数据信息的特征均值,按特征均值差值大小对特征进行重要性排序。利用UCI数据集中的8个数据,将该算法与其他常见算法在决策树、随机森林、K近邻和支持向量机这4种学习模型下进行比较实验。结果表明,该算法是一种有效的特征选择算法,其能够提高分类的准确率和F1值,并且不依赖于特定的机器学习模型。

关键词: 特征选择, 持续同调, 条形码, 贝蒂数, 机器学习

Abstract: The existing filtering feature selection algorithm ignores the correlation between features.This paper proposes a new filtering feature selection algorithm —the feature selection algorithm based on persistent homology(Rel-Betti algorithm),which can consider the correlation between features and the combined effect.This paper gives a new definition named by relevant Betti numbers,which can filter out the important topological features in the dataset.The algorithm first preprocesses the data set,classifies the data set according to the class labels,calculates the relevant Betti numbers in different classes,obtains the feature mean of the data information,and uses the feature mean difference to sort the importance of the features.Using eight data in UCI,the algorithm is compared with other common algorithms under four learning models:decision tree,random forest,K-nearest neighbor and support vector machine.Experimental results show that the Rel-Betti algorithm is an effective method that can improve classification accuracy and F1 value,and does not depend on a specific machine learning model.

Key words: Feature selection, Persistent homology, Barcode, Betti number, Machine learning

中图分类号: 

  • O189.22
[1]SHI Q J,PAN F,LONG F H,et al.A Review of Research on Feature Selection Methods[J].Microelectronics and Computers,2022,39(3):1-8.
[2]YU L,LIU H.Efficient feature selection via analysis of rele-vance and redundancy[J].The Journal of Machine Learning Research,2004,5:1205-1224.
[3]DROTÁR P,GAZDA J,SMÉKAL Z.An experimental comparison of feature selection methods on two-class biomedical datasets[J].Computers in Biology and Medicine,2015,66:1-10.
[4]BOMMERT A,SUN X,BISCHL B,et al.Benchmark for filter methods for feature selection in high-dimensional classification data[J].Computational Statistics & Data Analysis,2020,143:106839.
[5]JI Z W,HU M.A Double Filtering Feature Selection Algorithm [J].Computer Engineering and Applications,2011,47(19):190-193,206.
[6]XU Y,HU X G,LI P P.A Filtered Feature Selection Algorithm Based on Group Policy[J].Application Research of Computers,2016,33(5):1322-1326.
[7]SINGH N,SINGH P.A hybrid ensemble-filter wrapper feature selection approach for medical data classification[J].Chemome-trics and Intelligent Laboratory Systems,2021,217:104-131.
[8]PRISCILLA C V,PRABHA D P.A two-phase feature selection technique using mutual information and XGB-RFE for credit card fraud detection[J].International Journal of Advanced Technology and Engineering Exploration,2021,8(85):1656.
[9]PASHAEI E,PASHAEI E.Hybrid binary arithmetic optimization algorithm with simulated annealing for feature selection in high-dimensional biomedical data[J].The Journal of Supercomputing,2022:78(13):15598-15637.
[10]YANG Y K.Research on filtering feature selection algorithm based on mutual information[D].Changchun:Jilin University,2022.
[11]LUM P Y,SINGH G,LEHMAN A,et al.Extracting insights from the shape of complex data using topology[J].Scientific Reports,2013,3(1):1-8.
[12]ZOMORODIAN A,CARLSSON G.Computing persistent ho-mology[C]//Proceedings of the twentieth Annual Symposium on Computational Geometry.2004:347-356..
[13]CARLSSON G.Topology And Data[J].Bulletin of the American Mathematical Society,2009,46(2):255-308.
[14]LOCKWOOD S,KRISHNAMOORTHY B.Topological Fea-tures in Cancer Gene Expression Data[J].Pacific Symposium on Biocomputing Pacific Symposium on Biocomputing,2015,20(7):108-119.
[15]CANG Z X,MU L,WEI G W,et al.Representability of alge-braic topology for biomolecules in machine learning based scoring and virtual screening[J].Plos Computational Biology,2018,14(1):e1005929.
[16]CHI S P,XIA K,SI X L.Persistent-Homology-based Machine Learning and its Applications-A Survey[J].Artificial Intelligence Review,2022,55(7):5169-5213.
[17]OTTER N,PORTER M A,TILLMANN U,et al.A roadmap for the computation of persistent homology[J].Epj Data Science,2017,6(1):1-38.
[18]GHRIST R.Barcodes:The persistent topology of data[J].Bulletin of the American Mathematical Society,2008,45(1):61-75.
[19]GAO B L,ZHOU Z G,YANG W W,et,al.A Feature Selection Method Combining Category-Based and Improved CHI[J].Application Research of Computers,2018,35(6):1660-1662.
[20]WANG Y R,TANG M.Evaluation of side channel leakage basedon Bartlett and multi-class F-test [J].Journal of Communications,2021,42(12):35-43.
[21]WU Y,LIU Y H,YANG W W,et,al.Feature Selection Method Based on Nearest Farthest Neighbor and Mutual Information[J].Research of Computers 2017,34(12):3713-3716.
[22]WANG W Y,LIU C,ZHAO Q,et,al.Direct Verified Encapsulated Feature Selection Method [J].University of Electronic Science and Technology of China Journal,2016,45(4):607-615.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!