计算机科学 ›› 2013, Vol. 40 ›› Issue (10): 252-256.
任永功,杨雪,杨荣杰,胡志冬
REN Yong-gong,YANG Xue,YANG Rong-jie and HU Zhi-dong
摘要: 传统的信息增益算法在类和特征项分布不均时,分类性能明显下降。针对此不足,提出了一种基于信息增益特征关联树的文本特征选择算法(UDsIG)。首先,对数据集按类进行特征选择,降低类分布不均时对特征选择的影响。其次,利用特征分布均匀度改善特征项在类内分布不均对特征选择的干扰,并采用特征关联树模型对类内特征进行处理,保留强相关特征,删除弱相关和不相关特征,降低特征冗余度。最后,使用类间加权离散度的信息增益公式进一步计算,得到更优特征子集。通过对比实验表明,选取的特征具有更好的分类性能。
[1] Kao C C.Design of echo cancellation and noise elimination for speech enhancement[J].IEEE Transactions on Consumer Electronics,2003,49 [2] Ng H,Goh W,Low K.Feature selection,perceptron learningand a usability case study for text categorization [C]∥Procee-dings of the 20th ACM International Confer-ence onResearch and Development in InformationRetrieval(SIGIR-97).1997:67-73 [3] Xu Yan,Chen Lin.Term-frequency Based Feature SelectionMethods for Text Categorization[C]∥Proceedings of the 2010Fourth International Conference on Genetic and Evolutionary Computing.Dec.2010 [4] J Xian,L Pei-yu,G Wei,et al.An algorithm application in intrusion forensics based on improved information gain[C]∥3rd Symposium on Web Society(SWS)2011.2011 [5] Wang Zi-qiang,Zhang De-xian.Feature Selection in Text Classification Via SVM and LSI[J].Lecture Notes in Computer Science,2006,1:1381-1386 [6] Yang Yu-zhen,Liu Pei-yu,Zhu Zhen-fang,et al.The Researchof an Improved Information Gain Method Using Distribution Information of Terms[C]∥IEEE International Symposium.2009:938-941 [7] 崔自峰,徐宝文,张卫峰.一种近似Markov Blanket最优特征选择算法[J].计算机学报,2007,0(12):2074-2081 [8] Hu Qing-hua,Yu Da-ren,Xie Zong-xia.Neighborhood classifiers[J].Expert Systems with Applications,2008,4(2):866-876 [9] 刘海峰,王元元,姚泽清.文本分类中一种基于选择的二次特征降维方法[J].情报学报,2009,8(1):23-27 [10] 徐燕,李锦涛,王斌,等.基于区分类别能力的高性能特征选择方法 [J].软件学报,2008,9(1):82-89 [11] 周城,葛斌,唐九阳,等.基于相关性和冗余度的联合特征选择方法[J].计算机科学,2012,9(4):181-184 [12] 刘庆和,梁正友.一种基于信息增益的特征优化选择方法[J].计算机工程与应用,2011,47(12):130-136 |
No related articles found! |
|