计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 228-234.doi: 10.11896/j.issn.1002-137X.2018.06.041

• 人工智能 • 上一篇    下一篇

基于标签关系改进的多标签特征选择算法

陈福才, 李思豪, 张建朋, 黄瑞阳   

  1. 国家数字交换系统工程技术研究中心 郑州450002
  • 收稿日期:2017-04-25 出版日期:2018-06-15 发布日期:2018-07-24
  • 作者简介:陈福才(1974-),男,研究员,硕士生导师,主要研究方向为网络大数据分析与电信网信息关防,E-mail:1242100831@qq.com(通信作者);李思豪(1991-),男,硕士生,主要研究方向为网络大数据分析,E-mail:michaelbournelisihao@outlook.com;张建朋(1988-),男,博士生,主要研究方向为网络大数据分析与数据流挖掘;黄瑞阳(1986-),男,博士,助理研究员,主要研究方向为网络大数据分析
  • 基金资助:
    本文受国家重点研发计划项目(2016YFB0800101),国家自然科学基金创新研究群体项目(61521003)资助

Multi-label Feature Selection Algorithm Based on Improved Label Correlation

CHEN Fu-cai, LI Si-hao, ZHANG Jian-peng, HUANG Rui-yang   

  1. National Digital Switching System Engineering and Technological R&D Center,Zhengzhou 450002,China
  • Received:2017-04-25 Online:2018-06-15 Published:2018-07-24

摘要: 多标签特征选择是应对数据维度灾难现象的主要方法之一,可以在降低特征维度的同时提高学习效率,优化分类性能。针对目前特征选择算法没有考虑标签间的相互关系,以及信息量的衡量范围存在偏差的问题,提出一种基于标签关系改进的多标签特征选择算法。首先引入对称不确定性对信息量进行归一化处理,然后用归一化的互信息量作为相关性的衡量方法,并据此定义标签的重要性权重,对依赖度和冗余度中的标签相关项进行加权处理;进而提出一种特征评分函数作为特征重要性的评价指标,并依次选择出评分最高的特征组成最佳特征子集。实验结果表明,与其他算法相比,该算法在提取出更加精确的低维特征子集后,不仅能够有效提高面向实体信息挖掘的多标签学习算法的性能,也能提高基于离散特征的多标签学习算法的效率。

关键词: 标签关系, 多标签特征选择, 冗余度, 特征评分, 依赖度

Abstract: Multi-label feature selection is one of the essential methods to overcome the curse of dimensionality.It reduces the feature dimension,improves the learning efficiency,and optimizes the classification performance.However,many existing feature selection algorithms hardly take label correlation into consideration,and the range of information entropies are biased within different data sets.To address those problems,this paper proposed a multi-label feature selection algorithm based on the improved label correlation.The algorithm firstly uses symmetrical uncertainty to norma-lize the information entropy,and takes normalized mutual information as relationship measurement to define the label importance,with which the label-related items in dependency and redundancy are weighted.In the end,the score function is put forward to evaluate the feature importance,and the best feature subset is selected with the highest score.Experiments demonstrate that after selecting out the concise and accurate feature subset,the multi-label classification is accelerated in terms of the performance and the efficiency with disperse features.

Key words: Dependency, Feature score, Label correlation, Multi-label feature selection, Redundancy

中图分类号: 

  • TP391
[1]WU X,ZHU X,WU G Q,et al.Data mining with big data[J].IEEE Transactions on Knowledge and Data Engineering,2014,26(1):97-107.
[2]ZHANG J J,FANG M,LI X.Multi-label learning with discriminative features for each label[J].Neurocomputing,2015,154:305-316.
[3]JIANG S,WANG L.Efficient feature selection based on correlation measure between continuous and discrete features[J].Information Processing Letters,2016,116(2):203-215.
[4]ZHANG Y X,SUN Y,YANG J H,et al.Feature importance analysis for spammer detection in SinaWeibo[J].Journal on Communications,2016,37(8):24-33.(in Chinese)
张宇翔,孙菀,杨家海,等.新浪微博反垃圾中特征选择的重要性分析[J].通信学报,2016,37(8):24-33.
[5]XIE J Y,XIE W X.Several Feature Selection Algorithms Based on the Discernibility of a Feature Subset and Support Vector Machines[J].Chinese Journal of Computers,2014,37(8):1704-1718.(in Chinese)
谢娟英,谢维信.基于特征子集区分度与支持向量机的特征选择算法[J].计算机学报,2014,37(8):1704-1718.
[6]LIU H,LI X,ZHANG S.Learning instance correlation functions for multilabel classification[J].IEEE Transactions on Cyberne-tics,2017,47(2):499-510.
[7]TANG J L,ALELYANI S,LIU H.Feature selection for classification:A review[M]//Data Classification:Algorithms and Applications.CRC Press,Chapman,2014:313-334.
[8]SILVA A M D,LEONG P H W.Grammar-based feature generation for time-series prediction[M].Singapore:Springer Singapore,2015:13-23.
[9]PENG H,LONG F,DING C.Feature selection based on mutual information criteria of max-dependency,max-relevance,and min-redundancy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,27(8):1226-1238.
[10]SHAO H,LI G Z,LIU G P,et al.Symptom selection for multi-label data of inquiry diagnosis in traditional Chinese medicine[J].Science China Information Sciences,2013,56(5):1-13.
[11]YOU M,LIU J,LI G Z,et al.Embedded feature selection for multi-label classification of music emotions[J].International Journal of Computational Intelligence Systems,2012,5(4):668-678.
[12]DOQUIRE G,VERLEYSEN M.Mutual information-based feature selection for multi-label classification[J].Neurocomputing,2013,122:148-155.
[13]ZHANG Z H,LI S N,LI Z G,et al.Multi-Label Feature Selection Algorithm Based on Information Entropy[J].Journal of Computer Research and Development,2013,50(6):1177-1184.(in Chinese)
张振海,李士宁,李志刚,等.一类基于信息熵的多标签特征选择算法[J].计算机研究与发展,2013,50(6):1177-1184.
[14]MANDAL M,MUKHOPADHYAY A.An improved minimum redundancy maximum relevance approach for feature selection in gene expression data[J].Procedia Technology,2013,10(1):20-27.
[15]LIN Y,HU Q,LIU J,et al.Multi-label feature selection based on max-dependency and min-redundancy[J].Neurocomputing,2015,168(C):92-103.
[16]WITTEN I H,FRANK E,HALL M A,et al..Data mining:Practical machine learning tools and techniques[M].Burlington:Morgan Kaufmann,2016:143-186.
[17]ZHANG M L,ZHOU Z H.ML-KNN:A lazy learning approach to multi-label learning[J].Pattern Recognition,2007,40(7):2038-2048.
[18]TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Random k-labelsets for multilabelclassification[J].IEEE Transactions on Knowledge and Data Engineering,2011,23(7):1079-1089.
[19]READ J,PFAHRINGER B,HOLMES G,et al.Classifier chains for multi-label classification[J].Machine Learning,2009,85(3):254-269.
[20]TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Effective and efficient multilabel classification in domains with large number of labels[C]//Proccessing of ECML/PKDD 2008 Workshop on Mining Multidimensional Data (MMD’08).Antwerp,Belgium,2008:30-44.
[1] 朱旭东, 熊贇.
基于样本分布损失的图像多标签分类研究
Study on Multi-label Image Classification Based on Sample Distribution Loss
计算机科学, 2022, 49(6): 210-216. https://doi.org/10.11896/jsjkx.210300267
[2] 林利祥, 刘旭东, 刘少腾, 徐跃东.
前向纠错编码在网络传输协议中的应用综述
Survey on the Application of Forward Error Correction Coding in Network Transmission Protocols
计算机科学, 2022, 49(2): 292-303. https://doi.org/10.11896/jsjkx.210500104
[3] 陈洁婷, 王维莹, 金琴.
弹幕信息协助下的视频多标签分类
Multi-label Video Classification Assisted by Danmaku
计算机科学, 2021, 48(1): 167-174. https://doi.org/10.11896/jsjkx.200800198
[4] 王生武,陈红梅.
基于粗糙集和改进鲸鱼优化算法的特征选择方法
Feature Selection Method Based on Rough Sets and Improved Whale Optimization Algorithm
计算机科学, 2020, 47(2): 44-50. https://doi.org/10.11896/jsjkx.181202285
[5] 方波,陈红梅,王生武.
基于粗糙集和果蝇优化算法的特征选择方法
Feature Selection Algorithm Based on Rough Sets and Fruit Fly Optimization
计算机科学, 2019, 46(7): 157-164. https://doi.org/10.11896/j.issn.1002-137X.2019.07.025
[6] 高山,刘炜,崔勇,张茜,王宗敏.
一种融合多种用户行为的协同过滤推荐算法
Collaborative Filtering Algorithm Integrating Multiple User Behaviors
计算机科学, 2016, 43(9): 227-231. https://doi.org/10.11896/j.issn.1002-137X.2016.09.045
[7] 焦 娜.
基于差异关系的变精度粗糙集知识约简算法研究
Research on Knowledge Reduction Algorithm Based on Variable Precision Tolerance Rough Set Theory
计算机科学, 2015, 42(5): 265-269. https://doi.org/10.11896/j.issn.1002-137X.2015.05.053
[8] 翟俊海,万丽艳,王熙照.
最小相关性最大依赖度属性约简
Attribute Reduction with Principle of Minimum Correlation and Maximum Dependency
计算机科学, 2014, 41(12): 148-150. https://doi.org/10.11896/j.issn.1002-137X.2014.12.031
[9] 刘遵仁,吴耿锋.
基于邻域粗糙模型的高维数据集快速约简算法
Quick Reduction Algorithm for High-dimensional Data Sets Based on Neighborhood Rough Set Model
计算机科学, 2012, 39(10): 268-271.
[10] 林宏康,李豫颖,阮群生.
数据依赖与异常数据分离-应用
Data Dependence and Separation-application of Abnormal Data
计算机科学, 2011, 38(5): 203-207.
[11] .
一种改进的基于正区域的决策树算法

计算机科学, 2008, 35(5): 138-142.
[12] .
基于依赖关系的大规模主题数据库的分解模式

计算机科学, 2008, 35(5): 223-225.
[13] .
粗糙集理论中求取最小决策规则的研究

计算机科学, 2007, 34(4): 185-187.
[14] 胡顺仁 欧阳.
基于类之间的依赖关系确定类的规模

计算机科学, 2004, 31(3): 190-191.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!