计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 478-481.

• 大数据与数据挖掘 • 上一篇    下一篇

基于频繁项特征扩展的短文本分类方法

靳一凡, 傅颖勋, 马礼   

  1. 北方工业大学信息学院 北京100144
  • 出版日期:2019-06-14 发布日期:2019-07-02
  • 通讯作者: 靳一凡(1994-),男,硕士生,主要研究方向为分布式信息处理等,E-mail:1009542253@qq.com
  • 作者简介:傅颖勋(1986-),讲师,博士,CCF会员,主要研究方向为云/分布式存储可靠性等;马 礼(1968-),教授,CCF高级会员,主要研究方向为无线传感器网络、嵌入式技术等。
  • 基金资助:
    本文受国家自然科学基金(61702013),北京市优秀人才培养资助项目(2016000020124G016),北京市教委科技计划项目(KM201710009008),北方工业大学科研启动项目资助。

Method of Short Text Classification Based on Frequent Item Feature Extension

JIN Yi-fan, FU Ying-xun, MA Li   

  1. College of Information,North China University of Technology,Beijing 100144,China
  • Online:2019-06-14 Published:2019-07-02

摘要: 短文本具有特征维度高且稀疏等特点,导致将传统的分类方法应用于短文本分类时效果较差。针对此问题,提出基于频繁项特征扩展的短文本分类方法(Short Text Classification Based on Frequent Item Feature Extension,STCFIFE)。首先通过FP-growth算法挖掘背景语料库的频繁项集,结合上下文的关联特征,计算出扩展特征权重;然后将新特征加入到原短文本的特征空间中,在此基础上训练SVM(Support Vector Machine,SVM)分类器,并进行分类。实验结果表明,与传统的SVM算法和LDA+KNN算法相比,STCFIFE方法能有效缓解短文本特征不足、高维稀疏的问题,使F1值提升了2%~10%,提高了短文本的分类效果。

关键词: 短文本分类, 频繁项挖掘, 特征扩展, 特征权重, 支持向量机

Abstract: Short text has the characteristics of high feature dimension and sparse,as a result,the traditional classification method is not effective in short text classification.To solve this problem,a short text classification method based on frequent item feature extension called STCFIFE was proposed.First of all,frequent itemsets in the background corpus are mined through FP-growth algorithm,and combining the contextual association feature,the extended feature weight is calculated.Then the new features are added to the feature space of the original short text.On this basis,SVM (Support Vector Machine) classifier is trained for classification.The experimental results show that,compared with the traditional SVM algorithm and the LDA+KNN algorithm,STCFIFE can effectively alleviate problems of feature deficiency and high dimensional sparsity in short text and improves F1 value by 2%~10%,improving the classification effect in short text.

Key words: Feature extension, Feature weight, Frequent item mining, Short text classification, Support vector machine

中图分类号: 

  • TP391
[1]张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590.
[2]王雯,赵衎衎,李翠平,等.Spark平台下的短文本征扩展与分类研究[J].计算机科学与探索,2017,34(5):1-9.
[3]王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学2013,40(12):229-232.
[4]石晶,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1593.
[5]YANG Y,ZHANG J,KISIEL B.A scalability analysis of classifiers in text categorization [C]∥Proceedings of the 26th ACM International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-03).Toronto:ACM Press,2003:96-103.
[6]JOACHIMS T.Text Categorization with Support Vector Ma-chines:Learning with Many Relevant Features[J].Machine Learning,1998,1398(23):137-142.
[7]CALMA A,REITMAIER T,SICK B.Semi-Supervised Active Learning for Support Vector Machines:A Novel Approach that Exploits Structure Information in Data[J].Information Sciences,2018,456:13-22.
[8]徐光美,刘宏哲,张敬尊.基于特征加权的多关系朴素贝叶斯分类模型[J].计算机科学,2014,41(2):283-285.
[9] 胡元,石冰.基于区域划分的KNN文本快速分类算法研究[J].计算机科学,2012,39(10):182-186.
[10]季一木,张永潘,郎贤波,等.面向流数据的决策树分类算法并行化[J].计算机研究与发展,2017,54(9):1945-1957.
[11]SHIRAKAWA M,NAKAYAMA K,HARA T,et al.Wikipedia-Based Semantic Similarity Measurementsfor Noisy Short Texts Using Extended Naive Bayes[J].IEEE Transactionson Emerging Topics in Computing,2015,3(2):1.
[12]LIU W S,CAO Z W,WANG J,et al.Short text classification based on Wikipedia and Word2vec[C]∥2nd IEEE International Conference on Computer and Communications (ICCC).2016.
[13]HE H,CHEN B,XU W,et al.Short Text Feature Extraction and Clustering for Web Topic Mining[C]∥Proceedings of the Third International Conference on Semantics,Knowledge and Grid.IEEE Computer Society,2007:382-385.
[14]LIU J L,YAN Y Y.SMS Text Classification Method Based on Context[J].Computer Engineering,2011,37(10):41-43.
[15]CHEN Q U,YAO L X,YANG J.Short text classification based on LDA topic model[C]∥International Conference on Audio,Language and Image Processing (ICALIP).2016.
[16]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥4th International Conference on Information Science and Control Engineering(ICISCE).2017.
[17]YUAN M.Feature Extension for Short Text Categorization Using Frequent Term Sets[J].Elsevier Procedia Computer Scien-ce,2014,31:663-670.
[18]FENG G,LI S,SUN T,et al.A Probabilistic Model Derived Term Weighting Scheme for Text Classification[J].Pattern Recognition Letters,2018,110:23-29.
[19]MIROΗCZUK M M,PROTASIEWICZ J.A Recent Overview of the State-of-the-Art Elements of Text Classification[J].Expert Systems with Applications,2018,106:36-54.
[20]LI H,WANG Y,ZHANG D,et al.Pfp:parallel fpgrowth for query recommendation[C]∥Proceedings of the 2008 ACM Conference on Recommender Systems.ACM,2008:107-114.
[21]SOGOULABS.SogouCS,version:2012[OL].http://www.sogou.com/ labs/resource/cs.php.
[1] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[2] 单晓英, 任迎春.
基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别
Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm
计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216
[3] 陈景年.
一种适于多分类问题的支持向量机加速方法
Acceleration of SVM for Multi-class Classification
计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[4] 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真.
一种基于支持向量机的主动度量学习算法
Active Metric Learning Based on Support Vector Machines
计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[5] 邢云冰, 龙广玉, 胡春雨, 忽丽莎.
基于SVM的类别增量人体活动识别方法
Human Activity Recognition Method Based on Class Increment SVM
计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[6] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[7] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[8] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[9] 侯春萍, 赵春月, 王致芃.
基于自反馈最优子类挖掘的视频异常检测算法
Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining
计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[10] 郭福民, 张华, 胡瑢华, 宋岩.
一种基于表面肌电信号的腕部肌力估计方法研究
Study on Method for Estimating Wrist Muscle Force Based on Surface EMG Signals
计算机科学, 2021, 48(6A): 317-320. https://doi.org/10.11896/jsjkx.200600021
[11] 卓雅倩, 欧博.
噪声环境下的人脸防伪识别算法研究
Face Anti-spoofing Algorithm for Noisy Environment
计算机科学, 2021, 48(6A): 443-447. https://doi.org/10.11896/jsjkx.200900207
[12] 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇.
基于整车EMC标准测试和机器学习的反向诊断方法
Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning
计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[13] 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳.
基于用户兴趣词典和LSTM的个性化情感分类方法
User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification
计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202
[14] 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆.
一种低频词词向量优化方法及其在短文本分类中的应用
Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification
计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163
[15] 曹素娥, 杨泽民.
基于聚类分析算法和优化支持向量机的无线网络流量预测
Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine
计算机科学, 2020, 47(8): 319-322. https://doi.org/10.11896/jsjkx.190800075
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!