计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 98-108.doi: 10.11896/jsjkx.210900076

• 数据库&大数据&数据科学 • 上一篇    下一篇

动态部分标记混合数据的增量式特征选择算法

闫振超, 舒文豪, 谢昕   

  1. 华东交通大学信息工程学院 南昌 330013
  • 收稿日期:2021-09-09 修回日期:2021-12-28 出版日期:2022-11-15 发布日期:2022-11-03
  • 通讯作者: 舒文豪(shuwenhao@126.com)
  • 作者简介:(zhenchao_yan@163.com)
  • 基金资助:
    国家自然科学基金(61662023,61762037);江西省自然科学基金(20202BABL202037)

Incremental Feature Selection Algorithm for Dynamic Partially Labeled Hybrid Data

YAN Zhen-chao, SHU Wen-hao, XIE Xin   

  1. School of Information Engineering,East China Jiaotong University,Nanchang 330013,China
  • Received:2021-09-09 Revised:2021-12-28 Online:2022-11-15 Published:2022-11-03
  • About author:YAN Zhen-chao,born in 1997,postgraduate.His main research interests include granular computing,knowledge discovery,data mining,etc.
    SHU Wen-hao,born in 1985,Ph.D,associate professor,master supervisor.Her main research interests include data mining,knowledge discovery,etc.
  • Supported by:
    National Natural Science Foundation of China(61662023,61762037) and Natural Science Foundation of Jiangxi Province(20202BABL202037).

摘要: 许多实际应用中的数据集是由符号型、数值型和缺失型特征构成的混合数据。针对混合数据的决策标记,由于获取全部数据的决策标记需要耗费大量的人工和时间成本,只能为部分数据进行决策标记,因此产生了部分标记数据。同时,现实应用领域中数据是动态产生的,即数据维度随着不同的需求动态地增加或删减。针对混合数据的高维性、部分标记和动态性,文中提出了两种面向部分标记混合数据的增量式特征选择算法。首先,利用信息粒度对部分标记混合数据的特征进行重要度分析;其次,当特征集发生动态变化时,结合增量学习的思想,给出信息粒度的增量更新机制;然后,在此基础上提出了两种面向部分标记混合数据的增量式特征选择算法;最后,通过与其他算法在UCI数据集上的实验结果进行对比,进一步验证了所提算法的可行性和有效性。

关键词: 混合数据, 部分标记, 增量学习, 信息粒度, 特征选择

Abstract: Many real-world data sets are hybrid data consisting of symbolic,numerical and missing features.For the decision labels of hybrid data,it costs much labor and it is expensive to acquire the decision labels of all data,thus the partially labeled data is generated.Meanwhile,the data in real-world applications change dynamically,i.e.,the feature set is added into and deleted from the feature sets dynamically with different requirements.In this paper,according to the characteristics of high-dimensional,partial labeled and dynamic for the hybrid data,the incremental feature selection algorithms are proposed.Firstly,the information granularity is used to analyze the feature significance for partially labeled hybrid data.Then,the incremental updating mechanisms for information granularity are proposed with the variation of a feature set.On this basis,the incremental feature selection algorithms are proposed for the partially labeled hybrid data.Finally,extensive experimental results on UCI data set demonstrate that the proposed algorithms are feasible and efficient.

Key words: Hybrid data, Partially labeled, Incremental learning, Information granularity, Feature selection

中图分类号: 

  • TP391
[1]WANG C Z,HUANG Y,SHAO M W,et al.Feature selection based on neighborhood self-information[J].IEEE Transactions on Cybernetics,2019,99(7):1-12.
[2]WANG Q,QIAN Y H,LIANG X Y,et al.Local neighborhood rough set[J].Knowledge-Based Systems,2018,153(8):53-64.
[3]WANG D,CHEN H M,LI T R,et al.A novel quantum grasshopper optimization algorithm for feature selection[J].International Journal of Approximate Reasoning,2020,127(12):122-150.
[4]PAWLAK Z.Rough sets[J].International Journal of Computer and Information Sciences,1982,11(5):341-356.
[5]ZHENG N,WANG J Y.Evidence characteristics and attribute reduction of incomplete ordered information system[J].Computer Engineering and Applications,2018,54(21):43-47.
[6]JIANG Z H,LIU K Y,YANG X B,et al.Accelerator for supervised neighborhood based attribute reduction[J].International Journal of Approximate Reasoning,2020,119(4):122-150.
[7]WAN Y,CHEN X L,ZHANG J H,et al.Semi-supervised feature selection based on low-rank sparse graph embedding[J].Journal of Image and Graphics,2018,23(9):1316-1325.
[8]LIU K Y,YANG X B,YU H L,et al.Supervised information granulation strategy for attribute reduction[J].International Journal of Machine Learning and Cybernetics,2020,11(3):2149-2163.
[9]HU Q H,XIE Z X,YU D R.Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation[J].Pattern Recognition,2007,40(12):3509-3521.
[10]JING Y G,LI T R,FUJITA H,et al.An incremental attribute reduction method for dynamic data mining[J].Information Sciences,2018,465(7):202-218.
[11]WEI W,LIANG J Y,QIAN Y H.A comparative study of rough sets for hybrid data[J].Information Sciences,2012,190(6):1-16.
[12]WANG F,LIU J C,WEI W.Semi-supervised feature selectionalgorithm based on information entropy[J].Computer Science,2018,45(11):427-430.
[13]DAI J H,HU Q H,ZHANG J H,et al.Attribute selection for partially labeled categorical data by rough set approach[J].IEEE Transactions on Cybernetics,2017,47(9):2460-2471.
[14]LIU K Y,YANG X B,YU H L,et al.Rough set based semi-supervised feature selection via ensemble selector[J].Knowledge-Based Systems,2019,165(1):282-296.
[15]XIAO L S,WANG H J,YANG Y.Semi-supervised feature selection based on attribute dependency and hybrid constraint[J].Journal of Computer Applications,2015,35(12):80-84.
[16]MA F M,DING M W,ZHANG T F,et al.Compressed binary discernibility matrix based incremental attribute reduction algorithm for group dynamic data[J].Neurocomputing,2019,334(6):20-27.
[17]SHU W H,QIAN W B,XIE Y H.Incremental approaches for feature selection from dynamic data with the variation of multiple objects[J].Knowledge-Based System,2019,163(1):320-331.
[18]HUANG Q Q,LI T R,HUANG Y Y,et al.Incremental three-way neighborhood approach for dynamic incomplete hybrid data[J].Information Sciences,2020,541(12):98-122.
[19]LIU Y,ZHENG L D,XIU Y L,et al.Discernibility matrix based incremental feature selection on fused decision tables[J].International Journal of Approximate Reasoning,2020,118(3):1-26.
[20]ZENG A P,LI T R,LIU D,et al.A fuzzy rough set approach for incremental feature selection on hybrid information systems[J].Fuzzy Sets and Systems,2015,258(6):39-60.
[21]YU J H,CHEN M H,XU W H.Dynamic computing rough approximationsapproach to time-evolving information granule interval-valued ordered information system[J].Applied Soft Computing,2017,60(6):18-29.
[22]CAI M J,LANG G M,FUJITA H,et al.Incremental approaches to updating reducts under dynamic covering granularity[J].Knowledge-Based Systems,2019,172(1):130-140.
[23]WANG S,LI T R,LUO C,et al.A novel approach for efficient updating approximations in dynamic ordered information systems[J].Information Sciences,2020,507(8):197-219.
[24]HUANG Y Y,LI T R,LUO C,et al.Dynamic maintenance of rough approximations in multi-source hybrid information systems[J].Information Sciences,2020,530(8):108-127.
[25]LIU D,LI T R,ZHANG J B.Incremental updating approximations in probabilistic rough sets under the variation of attributes[J].Knowledge-Based System,2015,73(1):81-96.
[26]ZHANG Y Y,LI T R,LUO C,et al.Incremental updating of rough approximations in interval-valued information systems under attribute generalization[J].Information Sciences,2016,373(12):461-475.
[27]UCI Machine Learning Repository[OL].http://archive.ics.uci.edu/ml/datasets.html.
[28]Rosetta:A rough set toolkit for analysis of data[OL].http://www.lcb.uu.se/tools/rosetta/index.php.
[29]MARIELLO A,BATTITI R.Feature selection based on theneighborhood entropy[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(12):6313-6322.
[30]LIU Y,CAO J J,DIAO X C,et al.Survey on stability of feature selection[J].Journal of Software,2018,29(9):2559-2579.
[31]FRIEDMAN M.A comparison of alternative tests of significance for the problem of m rankings[J].The Annals of Mathematical Statistics,1940,11(1):86-92.
[1] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[2] 刘冬梅, 徐洋, 吴泽彬, 刘倩, 宋斌, 韦志辉.
基于边框距离度量的增量目标检测方法
Incremental Object Detection Method Based on Border Distance Measurement
计算机科学, 2022, 49(8): 136-142. https://doi.org/10.11896/jsjkx.220100132
[3] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[4] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[5] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[6] 沈少朋, 马洪江, 张智恒, 周相兵, 朱春满, 温佐承.
多元时序上状态转移模式的三支漂移检测
Three-way Drift Detection for State Transition Pattern on Multivariate Time Series
计算机科学, 2022, 49(4): 144-151. https://doi.org/10.11896/jsjkx.210600045
[7] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[8] 李宗然, 陈秀宏, 陆赟, 邵政毅.
鲁棒联合稀疏不相关回归
Robust Joint Sparse Uncorrelated Regression
计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[9] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[10] 杨蕾, 降爱莲, 强彦.
基于自编码器和流形正则的结构保持无监督特征选择
Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization
计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211
[11] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[12] 胡艳梅, 杨波, 多滨.
基于网络结构的正则化逻辑回归
Logistic Regression with Regularization Based on Network Structure
计算机科学, 2021, 48(7): 281-291. https://doi.org/10.11896/jsjkx.201100106
[13] 周钢, 郭福亮.
基于特征选择的高维数据集成学习方法研究
Research on Ensemble Learning Method Based on Feature Selection for High-dimensional Data
计算机科学, 2021, 48(6A): 250-254. https://doi.org/10.11896/jsjkx.200700102
[14] 李艳, 范斌, 郭劼, 林梓源, 赵曌.
基于k-原型聚类和粗糙集的属性约简方法
Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets
计算机科学, 2021, 48(6A): 342-348. https://doi.org/10.11896/jsjkx.201000053
[15] 丁思凡, 王锋, 魏巍.
一种基于标签相关度的Relief特征选择算法
Relief Feature Selection Algorithm Based on Label Correlation
计算机科学, 2021, 48(4): 91-96. https://doi.org/10.11896/jsjkx.200800025
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!