计算机科学 ›› 2013, Vol. 40 ›› Issue (10): 252-256.

• 人工智能 • 上一篇    下一篇

基于信息增益特征关联树的文本特征选择算法

任永功,杨雪,杨荣杰,胡志冬   

  1. 辽宁师范大学计算机与信息技术学院 大连116029;辽宁师范大学计算机与信息技术学院 大连116029;辽宁师范大学计算机与信息技术学院 大连116029;辽宁师范大学计算机与信息技术学院 大连116029
  • 出版日期:2018-11-16 发布日期:2018-11-16
  • 基金资助:
    本文受辽宁省计划项目(2012232001),辽宁省自然科学基金(201202119)资助

Text Feature Selection Methods Based on Information Gain and Feature Relation Tree

REN Yong-gong,YANG Xue,YANG Rong-jie and HU Zhi-dong   

  • Online:2018-11-16 Published:2018-11-16

摘要: 传统的信息增益算法在类和特征项分布不均时,分类性能明显下降。针对此不足,提出了一种基于信息增益特征关联树的文本特征选择算法(UDsIG)。首先,对数据集按类进行特征选择,降低类分布不均时对特征选择的影响。其次,利用特征分布均匀度改善特征项在类内分布不均对特征选择的干扰,并采用特征关联树模型对类内特征进行处理,保留强相关特征,删除弱相关和不相关特征,降低特征冗余度。最后,使用类间加权离散度的信息增益公式进一步计算,得到更优特征子集。通过对比实验表明,选取的特征具有更好的分类性能。

关键词: 特征选择,特征关联树,信息增益值,不平衡数据集,离散度

Abstract: Due to the maldistribution of classes and features,the classification performance of traditional information gain algorithm will decline sharply.Considering that,a text feature selection method UDsIG was proposed which is based on the information gain.Firstly,because the feature selection may be influenced when the classes is unevenly distributed,we selected features based on class.Secondly,we used feature distribution uniformity to improve the influence on feature selection process when features are uneven distributed in the class.Then we adopt the feature relation tree model to deal with the class features,retain strong correlation features and delete the weak correlation and irrelevant ones.At last,we got the best feature subset by using of information gain formula which is based on weighted dispersion.The comparison experiment shows that the method has better classification performance.

Key words: Feature selection,Feature relation tree,Information gain,Imbalanced dataset,Dispersion

[1] Kao C C.Design of echo cancellation and noise elimination for speech enhancement[J].IEEE Transactions on Consumer Electronics,2003,49
[2] Ng H,Goh W,Low K.Feature selection,perceptron learningand a usability case study for text categorization [C]∥Procee-dings of the 20th ACM International Confer-ence onResearch and Development in InformationRetrieval(SIGIR-97).1997:67-73
[3] Xu Yan,Chen Lin.Term-frequency Based Feature SelectionMethods for Text Categorization[C]∥Proceedings of the 2010Fourth International Conference on Genetic and Evolutionary Computing.Dec.2010
[4] J Xian,L Pei-yu,G Wei,et al.An algorithm application in intrusion forensics based on improved information gain[C]∥3rd Symposium on Web Society(SWS)2011.2011
[5] Wang Zi-qiang,Zhang De-xian.Feature Selection in Text Classification Via SVM and LSI[J].Lecture Notes in Computer Science,2006,1:1381-1386
[6] Yang Yu-zhen,Liu Pei-yu,Zhu Zhen-fang,et al.The Researchof an Improved Information Gain Method Using Distribution Information of Terms[C]∥IEEE International Symposium.2009:938-941
[7] 崔自峰,徐宝文,张卫峰.一种近似Markov Blanket最优特征选择算法[J].计算机学报,2007,0(12):2074-2081
[8] Hu Qing-hua,Yu Da-ren,Xie Zong-xia.Neighborhood classifiers[J].Expert Systems with Applications,2008,4(2):866-876
[9] 刘海峰,王元元,姚泽清.文本分类中一种基于选择的二次特征降维方法[J].情报学报,2009,8(1):23-27
[10] 徐燕,李锦涛,王斌,等.基于区分类别能力的高性能特征选择方法 [J].软件学报,2008,9(1):82-89
[11] 周城,葛斌,唐九阳,等.基于相关性和冗余度的联合特征选择方法[J].计算机科学,2012,9(4):181-184
[12] 刘庆和,梁正友.一种基于信息增益的特征优化选择方法[J].计算机工程与应用,2011,47(12):130-136

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!