计算机科学 ›› 2016, Vol. 43 ›› Issue (10): 206-210.doi: 10.11896/j.issn.1002-137X.2016.10.039

• 人工智能 • 上一篇    下一篇

面向非平衡文本情感分类的TSF特征选择方法

王杰,李德玉,王素格   

  1. 山西大学计算机与信息技术学院 太原030006,山西大学计算机与信息技术学院 太原030006;山西大学计算智能与中文信息处理教育部重点实验室 太原 030006,山西大学计算机与信息技术学院 太原030006;山西大学计算智能与中文信息处理教育部重点实验室 太原 030006
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(61175067,5,61573231,1,U1435212),国家“863”高技术研究发展计划基金项目(2015AA015407),山西省回国留学人员科研项目(2013-014),山西省科技基础条件平台计划项目(2015091001-0102)资助

TSF Feature Selection Method for Imbalanced Text Sentiment Classification

WANG Jie, LI De-yu and WANG Su-ge   

  • Online:2018-12-01 Published:2018-12-01

摘要: 非平衡数据中样本数量的不平衡分布往往伴随着特征分布的不平衡,在多数类文本中经常出现的特征,在少数类中却很少出现。针对非平衡数据特征分布的特点,提出了一种新的双边fisher特征选择算法TSF。该方法通过显式地组合正相关和负相关特征,缓解了特征层面的非平衡性,较好地表示了文本的信息。TSF方法在图书评论和COAE2014微博非平衡数据上进行实验,结果验证了该方法是可行的。

关键词: 非平衡,文本情感分类,正负相关特征,双边特征选择

Abstract: In the imbalanced datasets,the imbalanced distribution of the samples is often accompanied by the imbalanced distribution of features.The features,which often appear in the majority class,rarely appear in the minority class.According to the characteristics of the imbalanced feature distribution,we proposed a new two-side fisher (TSF) feature selection method.TSF can control combination of positive features and negative features explicitly and tackle the imba-lanced problem in the level of feature.Experiments are conducted on the book reviews and COAE2014 imbalanced dataset.Experimental results indicate that TSF is an effective feature selection method for the imbalanced problem.

Key words: Imbalanced,Text sentiment classification,Positive and negative feature,Two-side feature selection

[1] Lv Yun-yun,Li Yang,Wang Su-ge.A method for chinese opi-nion sentence identification based on the ensemble classifier with bootstrapping[J].Journal of Chinese Information Processing,2013,27(5):84-92(in Chinese) 吕云云,李旸,王素格.基于BootStrapping的集成分类器的中文观点句识别方法[J].中文信息学报,2013,27(5):84-92
[2] Tang Hui-feng,Tan Song-bo,Cheng Xue-qi.Research on sentiment classification of chinese review based on supervised machine learning techniques[J].Journal of Chinese Information Processing,2007,21(6):88-94(in Chinese)唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究[J].中文信息学报,2007,21(6):88-94
[3] Pang B,Lee L,Vaithyanathan S.Thumbs up?:sentiment classification using machine learning techniques[C]∥Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10.Association for Computational Linguistics,2002:79-86
[4] Liu S M,Chen Jun-huan.A multi-label classification based approach for sentiment classification[J].Expert Systems with Applications,2015,42(3):1083-1093
[5] Li Dong,Wei Fu-ru,Liu Shu-jie,et al.A statistical parsing frame-work for sentiment classification[J].Computational Linguistics,2015,4(2):293-336
[6] Zhang Dong-wen,Xu Hua,Su Zeng-cai,et al.Chinese comments sentiment classification based on word2vec and SVM perf[J].Expert Systems with Applications,2015,42(4):1857-1863
[7] Chawla N V,Japkowicz N,Kotcz A.Editorial:Special issue on learning from imbalanced data sets[J].SIGKDD Explorations Newsletters,2004,6(1):1-6
[8] Wang Su-ge,Li De-yu,Zhao Li-dong,et al.Sample cutting method for imbalanced text sentiment classification based on BRC[J].Knowledge-Based Systems,2013,37:451-461
[9] Su Jin-shu,Zhang Bo-feng,Xu Xin.Advances in machine lear-ning based text categorization[J].Journal of Software,2006,17(9):1848-1859(in Chinese) 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展 [J].软件学报,2006,17(9):1848-1859
[10] Japkowicz N,Stephen S.The Class Imbalance Problem:A Systematic Study[J].Intelligent Data Analysis,2002,6(5):429-449
[11] Chandrashekar G,Sahin F.A survey on feature selection methods[J].Computers & Electrical Engineering,2014,40(1):16-28
[12] Kubat M,Matwin S.Addressing the curse of imbalanced trai-ning sets:one-sided selection[C]∥Proceedings of the 14th International Conference on Machine Learning.1997:179-186
[13] Wang B X,Japkowicz N.Imbalanced data set learning with synthetic samples[C]∥Proc.IRIS Machine Learning Workshop.2004:19
[14] Zhu Ming,Tao Xin-min.The SVM classifier for unbalanced data based on combination of RU-Undersample and SMOTE [J].Information Technology,2012,1:39-43
[15] Yan Jun,Liu Ning,Zhang Ben-yun,et al.OCFS:optimal orthogo-nal centroid feature selection for text categorization[C]∥Proceedings of the 28th Annual International ACM SIGIR Confe-rence on Research and Development in Information Retrieval.ACM,2005:122-129
[16] Wang Su-ge,Li De-yu,Song Xiao-lei,et al.A feature selection method based on improved fisher’s discriminant ratio for text sentiment classification[J].Expert Systems with Applications,2011,38(7):8696-8702
[17] Dai Liu-ling,Huang He-yan,Chen Zhao-xiong.A comparativestudy on feature selection in Chinese text categorization [J].Journal of Chinese Information Processing,2004,18(1):26-32(in Chinese) 代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32
[18] Mladenic D,Grobelnik M.Feature selection for unbalanced class distribution and naive bayes[C]∥ICML.1999:258-267
[19] Wasikowski M,Chen Xue-wen.Combating the small sampleclass imbalance problem using feature selection[J].IEEE Tran-sactions on Knowledge and Data Engineering,2010,22(10):1388-1400
[20] Yin Liu-zhi,Ge Yong,Xiao Ke-li,et al.Feature selection forhigh-dimensional imbalanced data[J].Neurocomputing,2013,105:3-11
[21] Ren Yong-gong,Yang Rong-jie,Yin Ming-fei,et al.Information-gain-based text feature selection method[J].Computer Science,2012,39(11):127-130(in Chinese) 任永功,杨荣杰,尹明飞,等.基于信息增益的文本特征选择方法[J].计算机科学,2012,39(11):127-130
[22] Ogura H,Amano H,Kondo M.Comparison of metrics for feature selection in imbalanced text classification[J].Expert Systems with Applications,2011,38(5):4978-4989
[23] Zheng Zhao-hui,Wu Xiao-yun,Srihari R.Feature selection fortext categorization on imbalanced data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):80-89
[24] Fan R E,Chen P H,Lin C J.Working set selection using second order information for training support vector machines[J].The Journal of Machine Learning Research,2005,6:1889-1918
[25] He Hai-bo,Garcia E.Learning from imbalanced data[J].IEEETransactions on Knowledge and Engineering,2009,21(9):1263-1284

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!