基于信息熵的半监督特征选择算法

摘要/Abstract

摘要： 诸多实际应用中,由于确定数据集的类信息通常比较“昂贵”,因此研究者只能为其中很少量的数据标记类信息。针对上述“少量标记数据问题”,文中基于粗糙集理论和信息熵的概念,提出了一种基于信息熵的粗糙特征选择算法。通过分析给定数据集上有标记数据集和无标记数据的信息熵,重新定义了整个数据集上的信息熵。在此基础上定义了半监督意义下基于信息熵的特征重要度,设计了一种基于信息熵的可有效处理含有少量标记数据的半监督粗糙特征选择算法。实验结果进一步验证了所提算法的可行性和高效性。

关键词: 半监督, 少量标记数据, 特征选择, 信息熵

Abstract: In applications,since it is usually expensive to determine data labels,researchers can only mark a very small amount of data.Hence,on the basis of rough set theory and entropy,this paper proposed an entropy-based rough feature selection algorithm for the problem of “small labeled samples”.In the context of semi-supervised learning,entropy and feature significance were defined in this paper.On this basis,a new semi-supervised feature selection algorithm was proposed to deal with datasets which contain only small labels.Experimental results show that the new algorithm is feasible and efficiency

Key words: Feature selection, Information entropy, Semi-supervised, Small labeled data

中图分类号:

TP181

王锋, 刘吉超, 魏巍. 基于信息熵的半监督特征选择算法[J]. 计算机科学, 2018, 45(11A): 427-430. https://doi.org/

WANG Feng, LIU Ji-chao, WEI Wei. Semi-supervised Feature Selection Algorithm Based on Information Entropy[J]. Computer Science, 2018, 45(11A): 427-430. https://doi.org/

参考文献

[1]BLUM A L,LANGLEY P.Selection of relevant features and examples in machine learning [J].Artificial Intelligence,1997,97(1-2):245-271.
[2]DASH M,CHOI K,SCHEUERMANN P,et al.Feature selection for clustering-a filter solution[C]∥Proceedings of the Se-cond International Conference on Data Mining.2002:115-122.
[3]LIU H,YU L.Toward integrating feature selection algorithms for classification and clustering[J].IEEE Transaction on Knowledge and Data Engineering,2005,17(4):491-502.
[4]HU Q H,YU D R,LIU J F,et al.Neighborhood rough set based heterogeneous feature subset selection[J].Information Sciences,2008,178(18):3577-3594.
[5]HU Q H,XIE Z X,YU D R.Hybrid attribute reduction based on a novel fuzzy-rough model and information granulation [J].Pattern Recognition,2007,40(12):3509-3521.
[6]CHEN H M,LI T R,QIAO S J,et al.A rough set based dyna-mic maintenance approach for approximations in coarsening and refining attribute values [J].International Journal of Intelligent Systems,2010,25(10):1005-1026.
[7]LIANG J Y,WANG F,DANG C Y,et al.An efficient rough feature selection algorithm with a multi-granulation view [J].International Journal of Approximate Reasoning,2012,53(6):912-926.
[8]LIU D,LI T R,RUAN D,et al.An incremental approach for inducing knowledge from dynamic information systems [J].Fundamenta Informaticae,2009,94(2):245- 260.
[9]LI T R,RUAN D,GEERT W,et al.A rough sets based characteristic relation approach for dynamic attribute Fgeneralization in data mining [J].Knowledge-Based Systems,2007,20(5):485-494.
[10]JING Y G,LI T R,HUANG J F,et al.An incremental attribute reduction approach based on knowledge granularity under the attribute generalization [J].International Journal of Approximate Reasoning,2016,76:80-95.
[11]JING Y G,LI T R,FUJITA H,et al.An incremental attribute reduction approach based on knowledge granularity with a multi-granulation view [J].Information Sciences,2017,411:23-38.
[12]JING Y G,LI T R,HUANG J F,et al.A group incremental reduction algorithm with varying data values [J].International Journal of Intelligent Systems,2016,32(9):900-925.
[13]杨明.一种基于改进差别矩阵的核增量式更新算法[J].计算机学报,2006,29(3):407-413.
[14]黄兵,周献中,张蓉蓉.基于信息量的不完备信息系统属性约简[J].系统工程理论与实践,2005,4(4):55-60.
[15]徐章艳,刘作鹏,杨炳儒,等.一个复杂度为max(O(|C||U|),O(|C|²|U/C|)) 的快速属性约简算法[J].计算机学报,2006,29(3):391-399.
[16]ZHAO M Y,JIAO L C,MA W P,et al.Classification and sa-liency detection by semi-supervised low-rank representation [J].Pattern Recognition,2016,51(C):281-294.
[17]BENABDESLEM K,HINDAWI M.Efficient Semi-supervised Feature Selection:Constraint,Relevance and Redundancy [J].IEEE Transactions on Knowledge and Data Engineering,2014,26(5):1131-1143.
[18]FORESTIER G,WEMMERT C.Semi-supervised learning using multiple clustering with limited labeled data [J].Information Sciences,2016,361-362(C):48-65.
[19]ZHAO Z,LIU H.Semi-supervised feature selection via spectral analysis[C]∥SIAM International Conference on Data Mining(SDM 2007).2007.
[20]NAKATANI Y,ZHU K,UEHARA K.Semi-supervised lear-ning using feature selection based on maximum density sub-graphs [J].Systems and Computers in Japan,2007,38(9):32-43.
[21]HANDL J,KNOWLES J.Semi-supervised feature selection via multi-objective optimization[C]∥The 2006 International Joint Conference on Neural Networks.2006.
[22]IZUTANI A,UEHARA K.A Modeling Approach Using Multiple Graphs for Semi-Supervised Learning[C]∥International Conference on Discovery Science.Springer-Verlag,2008:296-307.
[23]XU Z L,KING I,MICHAEL R-T L,et al.Discriminative Semi-Supervised Feature Selection Via Manifold Regularization [J].IEEE Transactions on Neural Networks,2010,21(7):1033-1046.
[24]REN J T,QIU Z Y,FAN W,et al.Forward semi-supervised feature selection[C]∥Proceedings of the 12th Pacific-Asia confe-rence on Advances in Knowledge Discovery and Data Mining(PAKDD’08).Berlin:Springer-Verlag,2008:970-976.
[25]王国胤,于洪,杨大春.基于条件信息熵的决策表约简[J].计算机学报,2002,25(7):759-766.
[26]LIANG J Y,CHIN K S,DANG C Y,et al.A new method for measuring uncertainty and fuzziness in rough set theory [J].International Journal of General Systems,2002,31(4):331-342.
[27]LIANG J Y,SHI Z Z,LI D Y,et al.The information entropy,rough entropy and knowledge granulation in incomplete information systems [J].International Journal of General Systems,2006,34(1):641-654.
[28]WANG F,LIANG J Y,QIAN Y H.Attribute reduction:a dimension incremental strategy [J].Knowledge-Based Systems,2013,39(2):95-108.
[29]LIANG J Y,WANG F,DANG C Y,et al.A group incremental approach to feature selection applying rough set technique [J].IEEE Transactions on Knowledge and Data Engineering,2014,26(2):294-308.
[30]王锋,魏巍.缺失数据数据集的组增量式特征选择[J].计算机科学,2015,42(7):285-290.
[31]刘薇,梁吉业,魏巍,等.一种基于条件熵的增量式属性约简求解算法[J].计算机科学,2011,38(1):229-231,239.
[32]QIAN Y H,LIANG J Y,PEDRYCZ W,et al.Positive approximation:an accelerator for attribute reduction in rough set theory[J].Artificial Intelligence,2010,174(9-10):597-618.

相关文章 15

[1]	武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[2]	李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3]	胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[4]	侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[5]	康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[6]	庞兴龙, 朱国胜. 基于半监督学习的网络流量分析研究 Survey of Network Traffic Analysis Based on Semi Supervised Learning 计算机科学, 2022, 49(6A): 544-554. https://doi.org/10.11896/jsjkx.210600131
[7]	王宇飞, 陈文. 基于DECORATE集成学习与置信度评估的Tri-training算法 Tri-training Algorithm Based on DECORATE Ensemble Learning and Credibility Assessment 计算机科学, 2022, 49(6): 127-133. https://doi.org/10.11896/jsjkx.211100043
[8]	储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[9]	孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[10]	许华杰, 陈育, 杨洋, 秦远卓. 基于混合样本自动数据增强技术的半监督学习方法 Semi-supervised Learning Method Based on Automated Mixed Sample Data Augmentation Techniques 计算机科学, 2022, 49(3): 288-293. https://doi.org/10.11896/jsjkx.210100156
[11]	夏源, 赵蕴龙, 范其林. 基于信息熵更新权重的数据流集成分类算法 Data Stream Ensemble Classification Algorithm Based on Information Entropy Updating Weight 计算机科学, 2022, 49(3): 92-98. https://doi.org/10.11896/jsjkx.210200047
[12]	李宗然, 陈秀宏, 陆赟, 邵政毅. 鲁棒联合稀疏不相关回归 Robust Joint Sparse Uncorrelated Regression 计算机科学, 2022, 49(2): 191-197. https://doi.org/10.11896/jsjkx.210300034
[13]	侯宏旭, 孙硕, 乌尼尔. 蒙汉神经机器翻译研究综述 Survey of Mongolian-Chinese Neural Machine Translation 计算机科学, 2022, 49(1): 31-40. https://doi.org/10.11896/jsjkx.210900006
[14]	张叶, 李志华, 王长杰. 基于核密度估计的轻量级物联网异常流量检测方法 Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method 计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[15]	杨蕾, 降爱莲, 强彦. 基于自编码器和流形正则的结构保持无监督特征选择 Structure Preserving Unsupervised Feature Selection Based on Autoencoder and Manifold Regularization 计算机科学, 2021, 48(8): 53-59. https://doi.org/10.11896/jsjkx.200700211

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed