计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 39-46.doi: 10.11896/j.issn.1002-137X.2018.01.006

• CRSSC-CWI-CGrC-3WD 2017 • 上一篇    下一篇

基于Lasso算法的中文情感混合特征选择方法研究

李燕,卫志华,徐凯   

  1. 同济大学电子与信息工程学院计算机科学与技术系 上海201804,同济大学电子与信息工程学院计算机科学与技术系 上海201804,上海海事大学上海国际航运研究中心港航大数据实验室 上海200082
  • 出版日期:2018-01-15 发布日期:2018-11-13
  • 基金资助:
    本文受国家自然科学基金项目(61573259),上海市进一步加快中医药事业发展三年行动计划(2014-2016年)(ZY3-CCCX-3-6002),中央高校基本科研专项资金(0800219302,0800219315)资助

Hybrid Feature Selection Method of Chinese Emotional Characteristics Based on Lasso Algorithm

LI Yan, WEI Zhi-hua and XU Kai   

  • Online:2018-01-15 Published:2018-11-13

摘要: 中文情感分析中的一个重要问题就是情感倾向分类,情感特征选择是基于机器学习的情感倾向分类的前提和基础,其作用在于通过剔除无关或冗余的特征来降低特征集的维数。提出一种将Lasso算法与过滤式特征选择方法相结合的情感混合特征选择方法:先利用Lasso惩罚回归算法对原始特征集合进行筛选,得出冗余度较低的情感分类特征子集;再对特征子集引入CHI,MI,IG等过滤方法来评价候选特征词与文本类别的依赖性权重,并据此剔除候选特征词中相关性较低的特征词;最终,在使用高斯核函数的SVM分类器上对比所提方法与DF,MI,IG和CHI在不同特征词数量下的分类效果。在微博短文本语料库上进行了实验,结果表明所提算法具有有效性和高效性;并且在特征子集维数小于样本数量时,提出的混合方法相比DF,MI,IG和CHI的特征选择效果都有一定程度的改善;通过对比识别率和查全率可以发现,Lasso-MI方法相比MI以及其他过滤方法更为有效。

关键词: 中文情感分析,特征选择,Lasso,情感分类,机器学习

Abstract: An important issue in Chinese sentiment analysis is the emotional tendency classification.The sentiment feature selection is the premise and foundation of the emotional tendency classification based on the machine learning,with the effect of rejecting irrelevant and redundant features to reduce the dimension of the feature set.The hybrid sentiment feature selection method was proposed in this paper combining the Lasso algorithm and filtering feature selection me-thod.At first,Lasso type penalized methods are used to filtrate original feature set to generate emotional classification feature subset with lower redundancy.Secondly,such filtering algorithms as CHI,MI and IG are introduced to evaluate the dependency weight between the candidate feature word and the text category.And some candidate words with lower correlation can be rejected according to the evaluation result.Finally,the proposed algorithm and those such as DF,MI,IG and CHI are compared about various numbers of feature words by SVM classifier which uses gaussian kernel function.It turns out that the proposed algorithm is more effective and efficient when it is used in blog short text corpus.Otherwise,it can improve the effects of feature selection used in DF,MI,IG and CHI to some extent when feature subset dimension is smaller than sample size.With the comparison of recognition rate and recall ratio,it is obvious that Lasso-MI is better than MI as well as other filtering methods.

Key words: Chinese sentiment analysis,Feature selection,Lasso,Sentiment classification,Machine learning

[1] CHEN B.Research on key problems in Web text sentiment classification[D].Beijing:Beijing University of Posts and Telecommunications,2008.(in Chinese) 陈博.Web文本情感分类中关键问题的研究 [D].北京:北京邮电大学,2008.
[2] TANG H F,TAN S B,CHENG X Q.Research on SentimentClassification of Chinese Reviews Based on Supervised Machine Learning Techniques [J].Journal of Chinese Information Processing,2007,6(11):88-94.(in Chinese) 唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究 [J].中文信息学报,2007,6(11):88-94.
[3] WANG S L,LI X L,FANG J W.Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification [J].BMC Bioinformatics,2012,13:178.
[4] FORMAN G.An extensive empirical study of feature selection metrics for text classification [J].Journal of Machine Learning Research,2003,3(1):1533-1534.
[5] WANG W,WANG S F,LI H H.Microblogging sentiment ana-lysis method based on text semantics and expression tendentiousness [J].Journal of Nanjing University of Science and Technology,2014,8(6):733-738.(in Chinese) 王文,王树锋,李洪华.基于文本语义和表情倾向的微博情感分析方法 [J].南京理工大学学报,2014,8(6):733-738.
[6] SONG X L,WANG S G,LI H X.Research on Comment Target Recognition for Specific Domain Products [J].Journal of Chinese Information Processing,2010,24(1):89-93.(in Chinese) 宋晓雷,王索格,李红霞.面向特定领域的产品评价对象自动识别研究 [J].中文信息学报,2010,24(1):89-93.
[7] NA R S,LIU Y,LI Y.Semantic Fuzzy Calculation and Product Recommendation Based on Online Reviews [J].Journal of Guangxi Normal University(Natural Science Edition),2010,28(1):143-146.(in Chinese) 那日萨,刘影,李嫒.消费者网络评论的情感模糊计算与产品推荐研究 [J].广西师范大学学报(自然科学版),2010,28(1):143-146.
[8] XU J,DING Z X,WANG X L.Sentiment Classification for Chinese News Using Machine Learning Methods [J].Journal of Chinese Information Processing,2007,21(6):95-100.(in Chinese) 徐军,丁字新,王晓龙.使用机器学习方法进行新闻的情感自动分类 [J].中文信息学报,2007,21(6):95-100.
[9] BOLLEN J,PEPE A,MAO H.Modeling public mood and emotion:Twitter sentiment and socioeconomic phenomena[C]∥WWW 2010.Raleigh,North Carolina,USA:ACM,2010:26-30.
[10] EGUCHI K,LAVRENKO V.Sentiment retrieval using generative models[C]∥Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.Morristown,NJ,USA:A ssociation for Computational Linguistics,2006:345-354.
[11] MISHNE G,GLANCE N.Predicting movie sales from blogger sentiment[C]∥Proceedings of the 21st National Conference on Artificial Intelligence.Menlo Park,California:AAAI Press,2006:155-158.
[12] TURNEY P D.Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[C]∥Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).Philadelphia,2002:417-424.
[13] PANG B,LEE L,VA ITHYANA T S.Thumbs up?:Sentiment classification using machine learning techniques[C]∥Procee-dings of the ACL-02 Conference on Empirical Methods in Natural Language Processing Morristown.NJ,USA:Association for Computational Linguistics,2002:79-86.
[14] PANG B,LEE L.Opinion Mining and Sentiment Analysis [M].Hanover:Now Publishers Inc,2008:1-135.
[15] PANG B,LEE L.A Sentimental Education:Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts[C]∥Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Trenton:Association for Computational Linguistics,2004:271.
[16] ZHU Y L,MIN J,ZHOU Y Q,et al(1)Semantic Orientation Computing Based on HowNet[J].Journal of Chinese Information Processing,2005,20(1):14-20.(in Chinese) 朱嫣岚,闵锦,周雅倩,等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2005,20(1):14-20.
[17] SHEN Y,LI S C.Emotion mining research on micro-blog [C]∥2009 1st IEEE Symposium on Web Society.2009:71-75.
[18] ZHANG L,FENG X.Extracting Sentiment Element from Chinese Micro-blog Based on POS Template Library and Dependency Parsing[J].Computer Science,2015,42(6):474-478.(in Chinese) 张凌,冯欣.基于词性模板与依存分析的中文微博情感要素抽取[J].计算机科学,2015,42(6):474-478.
[19] ZHANG J,WANG S G.Cross-domain Sentiment Classification Based on Optimizing Classification Model Progressively[J].Computer Science,2016,43(7):234-239.(in Chinese) 张军,王素格.基于逐步优化分类模型的跨领域文本情感分类[J].计算机科学,2016,43(7):234-239.
[20] YE Q,LIN B,LI Y J.Sentiment classification for Chinesereviews:A comparison between SVM and semantic approaches [C]∥Proceedings of 2005 International Conference on Machine Learning and Cybernetics.2005:2341-2346.
[21] ZHOU L Z,HE Y K,WANG J Y.Survey on research of sentiment analysis[J].Computer Applications,2008,28(11):2725-2728.(in Chinese) 周立柱,贺宇凯,王建勇.情感分析研究综述[J].计算机应用,2008,28(11):2725-2728.
[22] HADDI E,LIU X H,SHI Y.The Role of Text Preprocessing inSentiment Analysis[J].Procedia Computer Science,2013(17):26-32.
[23] PAK A,PAROUBEK P.Twitter as a Corpus for SentimentAnalysis and Opinion Mining[C]∥Language Resources and Evaluation.2010:1320-1326.
[24] AKAIKE H.Information theory and an extension of the maximum likelihood principle[C]∥Second International Symposium on Information Theory.Akademinai Kiado,1973:267-281.
[25] GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003(3):1157-1182.
[26] DONG H B,TENG X Y,YANG X.Feature Selection Based on the Measurement of Correlation Information Entropy[J].Journal of Computer Research and Development,2016,53(8):1684-1695.(in Chinese) 董红斌,滕旭阳,杨雪.一种基于关联信息熵度量的特征选择方法[J].计算机研究与发展,2016,53(8):1684-1695.
[27] YAO J N,WANG H W,YIN P.Sentiment feature identification from Chinese online reviews[C]∥Communications in Computer and Information Science(2011 CCIS).2011:315-322.
[28] LIU Z M,LIU L.Empirical study of sentiment classification for Chinese microblog based on machine learning[J].Computer Engineering and Applications,2012,48(1):1-4.(in Chinese) 刘志明,刘鲁.基于机器学习的中文微博情感分类实证研究[J].计算机工程与应用,2012,48(1):1-4.
[29] WANG J,LI D Y,WANG S G.TSF Feature Selection Method for Imbalanced Text Sentiment Classification[J].Computer Science,2016,43(10):206-210,4.(in Chinese) 王杰,李德玉,王素格.面向非平衡文本情感分类的TSF特征选择方法[J].计算机科学,2016,43(10):206-210,4.
[30] TIBSHIRANI R.Regression shrinkage and selection via thelasso [J].Journal of the Royal Statistical Society ,1996,58(1):267-288.
[31] EFRON B,HASTIE T,JOHNSTONE I,et al.Least angle regression[J].Journal of Mathematical Statistics,2004,32(2):407-499.
[32] ZOU H,TREVOR H.Regularization and variable selection via the elastic net[J].Journal of the Royal Statistical Society,2005,67(2):301-320.
[33] SHI W F,HU X G,YU K.K-part Lasso based on feature selection algorithm for high-dimensional data[J].Computer Engineering and Applications,2012,48(1):157-161.(in Chinese) 施万锋,胡学钢,俞奎.一种面向高维数据的均分式Lasso特征选择方法[J].计算机工程与应用,2012,48(1):157-161.
[34] HANCZAR B,COURTINE M, BENIS A,et al.Improving classification of microarray data using prototype-based feature selection[J].ACM SIGKDD Explorations Newslener,2003,5(2):23-30.
[35] LIU J W,CUI L P,LIU Z Y,et al(1)Survey on the Regularized Sparse Models[J].Chinese Journal of Computers,2015,38(7):1307-1322.(in Chinese) 刘建伟,崔立鹏,刘泽宇,等.正则化稀疏模型[J].计算机学报,2015,38(7):1307-1322.
[36] 苗夺谦,卫志华.中文文本信息处理的原理与应用[M].北京:清华大学出版社,2007.
[37] DAVE K,LA WRE N S,DPE NNOCK M.Mining the peanut gallery:Opinion extraction and semantic classification of product reviews [C]∥Proceedings of the 12th International Conference on World Wide Web.New York:ACM Press,2003:519-528.
[38] WANG Y H,MAKEDON F,FORD J,et al(1)HykGene:a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data[J].Bioinformatics,2005,21(8):1530-1537.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!