计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 39-46.doi: 10.11896/j.issn.1002-137X.2018.01.006
• CRSSC-CWI-CGrC-3WD 2017 • 上一篇 下一篇
李燕,卫志华,徐凯
LI Yan, WEI Zhi-hua and XU Kai
摘要: 中文情感分析中的一个重要问题就是情感倾向分类,情感特征选择是基于机器学习的情感倾向分类的前提和基础,其作用在于通过剔除无关或冗余的特征来降低特征集的维数。提出一种将Lasso算法与过滤式特征选择方法相结合的情感混合特征选择方法:先利用Lasso惩罚回归算法对原始特征集合进行筛选,得出冗余度较低的情感分类特征子集;再对特征子集引入CHI,MI,IG等过滤方法来评价候选特征词与文本类别的依赖性权重,并据此剔除候选特征词中相关性较低的特征词;最终,在使用高斯核函数的SVM分类器上对比所提方法与DF,MI,IG和CHI在不同特征词数量下的分类效果。在微博短文本语料库上进行了实验,结果表明所提算法具有有效性和高效性;并且在特征子集维数小于样本数量时,提出的混合方法相比DF,MI,IG和CHI的特征选择效果都有一定程度的改善;通过对比识别率和查全率可以发现,Lasso-MI方法相比MI以及其他过滤方法更为有效。
[1] CHEN B.Research on key problems in Web text sentiment classification[D].Beijing:Beijing University of Posts and Telecommunications,2008.(in Chinese) 陈博.Web文本情感分类中关键问题的研究 [D].北京:北京邮电大学,2008. [2] TANG H F,TAN S B,CHENG X Q.Research on SentimentClassification of Chinese Reviews Based on Supervised Machine Learning Techniques [J].Journal of Chinese Information Processing,2007,6(11):88-94.(in Chinese) 唐慧丰,谭松波,程学旗.基于监督学习的中文情感分类技术比较研究 [J].中文信息学报,2007,6(11):88-94. [3] WANG S L,LI X L,FANG J W.Finding minimum gene subsets with heuristic breadth-first search algorithm for robust tumor classification [J].BMC Bioinformatics,2012,13:178. [4] FORMAN G.An extensive empirical study of feature selection metrics for text classification [J].Journal of Machine Learning Research,2003,3(1):1533-1534. [5] WANG W,WANG S F,LI H H.Microblogging sentiment ana-lysis method based on text semantics and expression tendentiousness [J].Journal of Nanjing University of Science and Technology,2014,8(6):733-738.(in Chinese) 王文,王树锋,李洪华.基于文本语义和表情倾向的微博情感分析方法 [J].南京理工大学学报,2014,8(6):733-738. [6] SONG X L,WANG S G,LI H X.Research on Comment Target Recognition for Specific Domain Products [J].Journal of Chinese Information Processing,2010,24(1):89-93.(in Chinese) 宋晓雷,王索格,李红霞.面向特定领域的产品评价对象自动识别研究 [J].中文信息学报,2010,24(1):89-93. [7] NA R S,LIU Y,LI Y.Semantic Fuzzy Calculation and Product Recommendation Based on Online Reviews [J].Journal of Guangxi Normal University(Natural Science Edition),2010,28(1):143-146.(in Chinese) 那日萨,刘影,李嫒.消费者网络评论的情感模糊计算与产品推荐研究 [J].广西师范大学学报(自然科学版),2010,28(1):143-146. [8] XU J,DING Z X,WANG X L.Sentiment Classification for Chinese News Using Machine Learning Methods [J].Journal of Chinese Information Processing,2007,21(6):95-100.(in Chinese) 徐军,丁字新,王晓龙.使用机器学习方法进行新闻的情感自动分类 [J].中文信息学报,2007,21(6):95-100. [9] BOLLEN J,PEPE A,MAO H.Modeling public mood and emotion:Twitter sentiment and socioeconomic phenomena[C]∥WWW 2010.Raleigh,North Carolina,USA:ACM,2010:26-30. [10] EGUCHI K,LAVRENKO V.Sentiment retrieval using generative models[C]∥Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing.Morristown,NJ,USA:A ssociation for Computational Linguistics,2006:345-354. [11] MISHNE G,GLANCE N.Predicting movie sales from blogger sentiment[C]∥Proceedings of the 21st National Conference on Artificial Intelligence.Menlo Park,California:AAAI Press,2006:155-158. [12] TURNEY P D.Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews[C]∥Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL).Philadelphia,2002:417-424. [13] PANG B,LEE L,VA ITHYANA T S.Thumbs up?:Sentiment classification using machine learning techniques[C]∥Procee-dings of the ACL-02 Conference on Empirical Methods in Natural Language Processing Morristown.NJ,USA:Association for Computational Linguistics,2002:79-86. [14] PANG B,LEE L.Opinion Mining and Sentiment Analysis [M].Hanover:Now Publishers Inc,2008:1-135. [15] PANG B,LEE L.A Sentimental Education:Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts[C]∥Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics.Trenton:Association for Computational Linguistics,2004:271. [16] ZHU Y L,MIN J,ZHOU Y Q,et al(1)Semantic Orientation Computing Based on HowNet[J].Journal of Chinese Information Processing,2005,20(1):14-20.(in Chinese) 朱嫣岚,闵锦,周雅倩,等.基于HowNet的词汇语义倾向计算[J].中文信息学报,2005,20(1):14-20. [17] SHEN Y,LI S C.Emotion mining research on micro-blog [C]∥2009 1st IEEE Symposium on Web Society.2009:71-75. [18] ZHANG L,FENG X.Extracting Sentiment Element from Chinese Micro-blog Based on POS Template Library and Dependency Parsing[J].Computer Science,2015,42(6):474-478.(in Chinese) 张凌,冯欣.基于词性模板与依存分析的中文微博情感要素抽取[J].计算机科学,2015,42(6):474-478. [19] ZHANG J,WANG S G.Cross-domain Sentiment Classification Based on Optimizing Classification Model Progressively[J].Computer Science,2016,43(7):234-239.(in Chinese) 张军,王素格.基于逐步优化分类模型的跨领域文本情感分类[J].计算机科学,2016,43(7):234-239. [20] YE Q,LIN B,LI Y J.Sentiment classification for Chinesereviews:A comparison between SVM and semantic approaches [C]∥Proceedings of 2005 International Conference on Machine Learning and Cybernetics.2005:2341-2346. [21] ZHOU L Z,HE Y K,WANG J Y.Survey on research of sentiment analysis[J].Computer Applications,2008,28(11):2725-2728.(in Chinese) 周立柱,贺宇凯,王建勇.情感分析研究综述[J].计算机应用,2008,28(11):2725-2728. [22] HADDI E,LIU X H,SHI Y.The Role of Text Preprocessing inSentiment Analysis[J].Procedia Computer Science,2013(17):26-32. [23] PAK A,PAROUBEK P.Twitter as a Corpus for SentimentAnalysis and Opinion Mining[C]∥Language Resources and Evaluation.2010:1320-1326. [24] AKAIKE H.Information theory and an extension of the maximum likelihood principle[C]∥Second International Symposium on Information Theory.Akademinai Kiado,1973:267-281. [25] GUYON I,ELISSEEFF A.An introduction to variable and feature selection[J].Journal of Machine Learning Research,2003(3):1157-1182. [26] DONG H B,TENG X Y,YANG X.Feature Selection Based on the Measurement of Correlation Information Entropy[J].Journal of Computer Research and Development,2016,53(8):1684-1695.(in Chinese) 董红斌,滕旭阳,杨雪.一种基于关联信息熵度量的特征选择方法[J].计算机研究与发展,2016,53(8):1684-1695. [27] YAO J N,WANG H W,YIN P.Sentiment feature identification from Chinese online reviews[C]∥Communications in Computer and Information Science(2011 CCIS).2011:315-322. [28] LIU Z M,LIU L.Empirical study of sentiment classification for Chinese microblog based on machine learning[J].Computer Engineering and Applications,2012,48(1):1-4.(in Chinese) 刘志明,刘鲁.基于机器学习的中文微博情感分类实证研究[J].计算机工程与应用,2012,48(1):1-4. [29] WANG J,LI D Y,WANG S G.TSF Feature Selection Method for Imbalanced Text Sentiment Classification[J].Computer Science,2016,43(10):206-210,4.(in Chinese) 王杰,李德玉,王素格.面向非平衡文本情感分类的TSF特征选择方法[J].计算机科学,2016,43(10):206-210,4. [30] TIBSHIRANI R.Regression shrinkage and selection via thelasso [J].Journal of the Royal Statistical Society ,1996,58(1):267-288. [31] EFRON B,HASTIE T,JOHNSTONE I,et al.Least angle regression[J].Journal of Mathematical Statistics,2004,32(2):407-499. [32] ZOU H,TREVOR H.Regularization and variable selection via the elastic net[J].Journal of the Royal Statistical Society,2005,67(2):301-320. [33] SHI W F,HU X G,YU K.K-part Lasso based on feature selection algorithm for high-dimensional data[J].Computer Engineering and Applications,2012,48(1):157-161.(in Chinese) 施万锋,胡学钢,俞奎.一种面向高维数据的均分式Lasso特征选择方法[J].计算机工程与应用,2012,48(1):157-161. [34] HANCZAR B,COURTINE M, BENIS A,et al.Improving classification of microarray data using prototype-based feature selection[J].ACM SIGKDD Explorations Newslener,2003,5(2):23-30. [35] LIU J W,CUI L P,LIU Z Y,et al(1)Survey on the Regularized Sparse Models[J].Chinese Journal of Computers,2015,38(7):1307-1322.(in Chinese) 刘建伟,崔立鹏,刘泽宇,等.正则化稀疏模型[J].计算机学报,2015,38(7):1307-1322. [36] 苗夺谦,卫志华.中文文本信息处理的原理与应用[M].北京:清华大学出版社,2007. [37] DAVE K,LA WRE N S,DPE NNOCK M.Mining the peanut gallery:Opinion extraction and semantic classification of product reviews [C]∥Proceedings of the 12th International Conference on World Wide Web.New York:ACM Press,2003:519-528. [38] WANG Y H,MAKEDON F,FORD J,et al(1)HykGene:a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data[J].Bioinformatics,2005,21(8):1530-1537. |
No related articles found! |
|