计算机科学 ›› 2012, Vol. 39 ›› Issue (11): 127-130.

• 数据库与数据挖掘 • 上一篇    下一篇

基于信息增益的文本特征选择方法

任永功 杨荣杰 尹明飞 马名威   

  1. (辽宁师范大学计算机与信息技术学院 大连 116029)
  • 出版日期:2018-11-16 发布日期:2018-11-16

Information-gain-based Text Feature Selection Method

  • Online:2018-11-16 Published:2018-11-16

摘要: 在类和特征分布不均时,传统信息增益算法的分类性能急剧下降。针对此不足,提出一种基于信息增益的文 本特征选择方法(TDpIU)。首先对数据集按类进行特征选择,以减少数据集不平衡性对特征选取的影响。其次运用 特征出现概率计算信息增益权值,以降低低频词对特征选择的千扰。最后使用离散度分析特征在每类中的信息增益 值,过滤掉高频词中的相对冗余特征,并对选取的特征应用信息增益差值做进一步细化,获取均匀精确的特征子集。 通过对比实验表明,选取的特征具有更好的分类性能。

关键词: 特征选择,文本分类,信息增益值,冗余特征,不平衡数据集

Abstract: Due to the maldistribution of class and feature, the classification performance of traditional information gain algorithm will decrease sharply. Considering that,a text feature selection method TDpIG based on the information gain was proposed. First of all, selected feature in dataset based on the class,which can reduce the effect of dataset imbalance on feature selection. Secondly, calculated information gain weight by using feature occurrence probability to decrease the interference of low frequency words to feature selection. At last, analysed the increasing information of each class by use of dispersion,filtering out the relative redundant features of high frequency words,further refining the selected feature applied increasing information, and getting the uniform and accurate subsets. The comparison experiment shows that the method has better classification performance.

Key words: Feature selection, Text classification, Information gain, Redundant feature, Imbalanced dataset

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!