计算机科学 ›› 2019, Vol. 46 ›› Issue (6A): 478-481.
靳一凡, 傅颖勋, 马礼
JIN Yi-fan, FU Ying-xun, MA Li
摘要: 短文本具有特征维度高且稀疏等特点,导致将传统的分类方法应用于短文本分类时效果较差。针对此问题,提出基于频繁项特征扩展的短文本分类方法(Short Text Classification Based on Frequent Item Feature Extension,STCFIFE)。首先通过FP-growth算法挖掘背景语料库的频繁项集,结合上下文的关联特征,计算出扩展特征权重;然后将新特征加入到原短文本的特征空间中,在此基础上训练SVM(Support Vector Machine,SVM)分类器,并进行分类。实验结果表明,与传统的SVM算法和LDA+KNN算法相比,STCFIFE方法能有效缓解短文本特征不足、高维稀疏的问题,使F1值提升了2%~10%,提高了短文本的分类效果。
中图分类号:
[1]张志飞,苗夺谦,高灿.基于LDA主题模型的短文本分类方法[J].计算机应用,2013,33(6):1587-1590. [2]王雯,赵衎衎,李翠平,等.Spark平台下的短文本征扩展与分类研究[J].计算机科学与探索,2017,34(5):1-9. [3]王振振,何明,杜永萍.基于LDA主题模型的文本相似度计算[J].计算机科学2013,40(12):229-232. [4]石晶,李万龙.基于LDA模型的主题分析[J].自动化学报,2009,35(12):1586-1593. [5]YANG Y,ZHANG J,KISIEL B.A scalability analysis of classifiers in text categorization [C]∥Proceedings of the 26th ACM International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-03).Toronto:ACM Press,2003:96-103. [6]JOACHIMS T.Text Categorization with Support Vector Ma-chines:Learning with Many Relevant Features[J].Machine Learning,1998,1398(23):137-142. [7]CALMA A,REITMAIER T,SICK B.Semi-Supervised Active Learning for Support Vector Machines:A Novel Approach that Exploits Structure Information in Data[J].Information Sciences,2018,456:13-22. [8]徐光美,刘宏哲,张敬尊.基于特征加权的多关系朴素贝叶斯分类模型[J].计算机科学,2014,41(2):283-285. [9] 胡元,石冰.基于区域划分的KNN文本快速分类算法研究[J].计算机科学,2012,39(10):182-186. [10]季一木,张永潘,郎贤波,等.面向流数据的决策树分类算法并行化[J].计算机研究与发展,2017,54(9):1945-1957. [11]SHIRAKAWA M,NAKAYAMA K,HARA T,et al.Wikipedia-Based Semantic Similarity Measurementsfor Noisy Short Texts Using Extended Naive Bayes[J].IEEE Transactionson Emerging Topics in Computing,2015,3(2):1. [12]LIU W S,CAO Z W,WANG J,et al.Short text classification based on Wikipedia and Word2vec[C]∥2nd IEEE International Conference on Computer and Communications (ICCC).2016. [13]HE H,CHEN B,XU W,et al.Short Text Feature Extraction and Clustering for Web Topic Mining[C]∥Proceedings of the Third International Conference on Semantics,Knowledge and Grid.IEEE Computer Society,2007:382-385. [14]LIU J L,YAN Y Y.SMS Text Classification Method Based on Context[J].Computer Engineering,2011,37(10):41-43. [15]CHEN Q U,YAO L X,YANG J.Short text classification based on LDA topic model[C]∥International Conference on Audio,Language and Image Processing (ICALIP).2016. [16]WANG X L,WANG J,YANG Y.Labeled LDA-Kernel SVM:A Short Chinese Text Supervised Classification Based on SinaWeibo[C]∥4th International Conference on Information Science and Control Engineering(ICISCE).2017. [17]YUAN M.Feature Extension for Short Text Categorization Using Frequent Term Sets[J].Elsevier Procedia Computer Scien-ce,2014,31:663-670. [18]FENG G,LI S,SUN T,et al.A Probabilistic Model Derived Term Weighting Scheme for Text Classification[J].Pattern Recognition Letters,2018,110:23-29. [19]MIROΗCZUK M M,PROTASIEWICZ J.A Recent Overview of the State-of-the-Art Elements of Text Classification[J].Expert Systems with Applications,2018,106:36-54. [20]LI H,WANG Y,ZHANG D,et al.Pfp:parallel fpgrowth for query recommendation[C]∥Proceedings of the 2008 ACM Conference on Recommender Systems.ACM,2008:107-114. [21]SOGOULABS.SogouCS,version:2012[OL].http://www.sogou.com/ labs/resource/cs.php. |
[1] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[2] | 单晓英, 任迎春. 基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别 Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm 计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216 |
[3] | 陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149 |
[4] | 侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034 |
[5] | 邢云冰, 龙广玉, 胡春雨, 忽丽莎. 基于SVM的类别增量人体活动识别方法 Human Activity Recognition Method Based on Class Increment SVM 计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024 |
[6] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[7] | 武玉坤, 李伟, 倪敏雅, 许志骋. 单类支持向量机融合深度自编码器的异常检测模型 Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder 计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142 |
[8] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[9] | 侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法 Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining 计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146 |
[10] | 郭福民, 张华, 胡瑢华, 宋岩. 一种基于表面肌电信号的腕部肌力估计方法研究 Study on Method for Estimating Wrist Muscle Force Based on Surface EMG Signals 计算机科学, 2021, 48(6A): 317-320. https://doi.org/10.11896/jsjkx.200600021 |
[11] | 卓雅倩, 欧博. 噪声环境下的人脸防伪识别算法研究 Face Anti-spoofing Algorithm for Noisy Environment 计算机科学, 2021, 48(6A): 443-447. https://doi.org/10.11896/jsjkx.200900207 |
[12] | 雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204 |
[13] | 王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳. 基于用户兴趣词典和LSTM的个性化情感分类方法 User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification 计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202 |
[14] | 程婧, 刘娜娜, 闵可锐, 康昱, 王新, 周扬帆. 一种低频词词向量优化方法及其在短文本分类中的应用 Word Embedding Optimization for Low-frequency Words with Applications in Short-text Classification 计算机科学, 2020, 47(8): 255-260. https://doi.org/10.11896/jsjkx.191000163 |
[15] | 曹素娥, 杨泽民. 基于聚类分析算法和优化支持向量机的无线网络流量预测 Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine 计算机科学, 2020, 47(8): 319-322. https://doi.org/10.11896/jsjkx.190800075 |
|