计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 228-234.doi: 10.11896/j.issn.1002-137X.2018.06.041
陈福才, 李思豪, 张建朋, 黄瑞阳
CHEN Fu-cai, LI Si-hao, ZHANG Jian-peng, HUANG Rui-yang
摘要: 多标签特征选择是应对数据维度灾难现象的主要方法之一,可以在降低特征维度的同时提高学习效率,优化分类性能。针对目前特征选择算法没有考虑标签间的相互关系,以及信息量的衡量范围存在偏差的问题,提出一种基于标签关系改进的多标签特征选择算法。首先引入对称不确定性对信息量进行归一化处理,然后用归一化的互信息量作为相关性的衡量方法,并据此定义标签的重要性权重,对依赖度和冗余度中的标签相关项进行加权处理;进而提出一种特征评分函数作为特征重要性的评价指标,并依次选择出评分最高的特征组成最佳特征子集。实验结果表明,与其他算法相比,该算法在提取出更加精确的低维特征子集后,不仅能够有效提高面向实体信息挖掘的多标签学习算法的性能,也能提高基于离散特征的多标签学习算法的效率。
中图分类号:
[1]WU X,ZHU X,WU G Q,et al.Data mining with big data[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(1):97-107. [2]ZHANG J J,FANG M,LI X.Multi-label learning with discriminative features for each label[J].Neurocomputing,2015,154:305-316. [3]JIANG S,WANG L.Efficient feature selection based on correlation measure between continuous and discrete features[J].Information Processing Letters,2016,116(2):203-215. [4]ZHANG Y X,SUN Y,YANG J H,et al.Feature importance analysis for spammer detection in SinaWeibo[J].Journal on Communications,2016,37(8):24-33.(in Chinese) 张宇翔,孙菀,杨家海,等.新浪微博反垃圾中特征选择的重要性分析[J].通信学报,2016,37(8):24-33. [5]XIE J Y,XIE W X.Several Feature Selection Algorithms Based on the Discernibility of a Feature Subset and Support Vector Machines[J].Chinese Journal of Computers,2014,37(8):1704-1718.(in Chinese) 谢娟英,谢维信.基于特征子集区分度与支持向量机的特征选择算法[J].计算机学报,2014,37(8):1704-1718. [6]LIU H,LI X,ZHANG S.Learning instance correlation functions for multilabel classification[J].IEEE Transactions on Cyberne-tics,2017,47(2):499-510. [7]TANG J L,ALELYANI S,LIU H.Feature selection for classification:A review[M]//Data Classification:Algorithms and Applications.CRC Press,Chapman,2014:313-334. [8]SILVA A M D,LEONG P H W.Grammar-based feature generation for time-series prediction[M].Singapore:Springer Singapore,2015:13-23. [9]PENG H,LONG F,DING C.Feature selection based on mutual information criteria of max-dependency,max-relevance,and min-redundancy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1226-1238. [10]SHAO H,LI G Z,LIU G P,et al.Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine[J].Science China Information Sciences,2013,56(5):1-13. [11]YOU M,LIU J,LI G Z,et al.Embedded feature selection for multi-label classification of music emotions[J].International Journal of Computational Intelligence Systems,2012,5(4):668-678. [12]DOQUIRE G,VERLEYSEN M.Mutual information-based feature selection for multi-label classification[J].Neurocomputing,2013,122:148-155. [13]ZHANG Z H,LI S N,LI Z G,et al.Multi-Label Feature Selection Algorithm Based on Information Entropy[J].Journal of Computer Research and Development,2013,50(6):1177-1184.(in Chinese) 张振海,李士宁,李志刚,等.一类基于信息熵的多标签特征选择算法[J].计算机研究与发展,2013,50(6):1177-1184. [14]MANDAL M,MUKHOPADHYAY A.An improved minimum redundancy maximum relevance approach for feature selection in gene expression data[J].Procedia Technology,2013,10(1):20-27. [15]LIN Y,HU Q,LIU J,et al.Multi-label feature selection based on max-dependency and min-redundancy[J].Neurocomputing,2015,168(C):92-103. [16]WITTEN I H,FRANK E,HALL M A,et al..Data mining:Practical machine learning tools and techniques[M].Burlington:Morgan Kaufmann,2016:143-186. [17]ZHANG M L,ZHOU Z H.ML-KNN:A lazy learning approach to multi-label learning[J].Pattern Recognition,2007,40(7):2038-2048. [18]TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Random k-labelsets for multilabelclassification[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(7):1079-1089. [19]READ J,PFAHRINGER B,HOLMES G,et al.Classifier chains for multi-label classification[J].Machine Learning,2009,85(3):254-269. [20]TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Effective and efficient multilabel classification in domains with large number of labels[C]//Proccessing of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08).Antwerp,Belgium,2008:30-44. |
[1] | 朱旭东, 熊贇. 基于样本分布损失的图像多标签分类研究 Study on Multi-label Image Classification Based on Sample Distribution Loss 计算机科学, 2022, 49(6): 210-216. https://doi.org/10.11896/jsjkx.210300267 |
[2] | 林利祥, 刘旭东, 刘少腾, 徐跃东. 前向纠错编码在网络传输协议中的应用综述 Survey on the Application of Forward Error Correction Coding in Network Transmission Protocols 计算机科学, 2022, 49(2): 292-303. https://doi.org/10.11896/jsjkx.210500104 |
[3] | 陈洁婷, 王维莹, 金琴. 弹幕信息协助下的视频多标签分类 Multi-label Video Classification Assisted by Danmaku 计算机科学, 2021, 48(1): 167-174. https://doi.org/10.11896/jsjkx.200800198 |
[4] | 王生武,陈红梅. 基于粗糙集和改进鲸鱼优化算法的特征选择方法 Feature Selection Method Based on Rough Sets and Improved Whale Optimization Algorithm 计算机科学, 2020, 47(2): 44-50. https://doi.org/10.11896/jsjkx.181202285 |
[5] | 方波,陈红梅,王生武. 基于粗糙集和果蝇优化算法的特征选择方法 Feature Selection Algorithm Based on Rough Sets and Fruit Fly Optimization 计算机科学, 2019, 46(7): 157-164. https://doi.org/10.11896/j.issn.1002-137X.2019.07.025 |
[6] | 高山,刘炜,崔勇,张茜,王宗敏. 一种融合多种用户行为的协同过滤推荐算法 Collaborative Filtering Algorithm Integrating Multiple User Behaviors 计算机科学, 2016, 43(9): 227-231. https://doi.org/10.11896/j.issn.1002-137X.2016.09.045 |
[7] | 焦 娜. 基于差异关系的变精度粗糙集知识约简算法研究 Research on Knowledge Reduction Algorithm Based on Variable Precision Tolerance Rough Set Theory 计算机科学, 2015, 42(5): 265-269. https://doi.org/10.11896/j.issn.1002-137X.2015.05.053 |
[8] | 翟俊海,万丽艳,王熙照. 最小相关性最大依赖度属性约简 Attribute Reduction with Principle of Minimum Correlation and Maximum Dependency 计算机科学, 2014, 41(12): 148-150. https://doi.org/10.11896/j.issn.1002-137X.2014.12.031 |
[9] | 刘遵仁,吴耿锋. 基于邻域粗糙模型的高维数据集快速约简算法 Quick Reduction Algorithm for High-dimensional Data Sets Based on Neighborhood Rough Set Model 计算机科学, 2012, 39(10): 268-271. |
[10] | 林宏康,李豫颖,阮群生. 数据依赖与异常数据分离-应用 Data Dependence and Separation-application of Abnormal Data 计算机科学, 2011, 38(5): 203-207. |
[11] | . 一种改进的基于正区域的决策树算法 计算机科学, 2008, 35(5): 138-142. |
[12] | . 基于依赖关系的大规模主题数据库的分解模式 计算机科学, 2008, 35(5): 223-225. |
[13] | . 粗糙集理论中求取最小决策规则的研究 计算机科学, 2007, 34(4): 185-187. |
[14] | 胡顺仁 欧阳. 基于类之间的依赖关系确定类的规模 计算机科学, 2004, 31(3): 190-191. |
|