基于改进TF-IDF和ABLCNN的中文文本分类模型

doi:10.11896/jsjkx.210100232

摘要/Abstract

摘要： 文本分类是自然语言处理领域中的重要内容,常用于信息检索、情感分析等领域。针对传统的文本分类模型文本特征提取不全面、文本语义表达弱的问题,提出一种基于改进TF-IDF算法、带有注意力机制的长短期记忆卷积网络(Attention base on Bi-LSTM and CNN,ABLCNN)相结合的文本分类模型。该模型首先利用特征项在类内、类间的分布关系和位置信息改进TF-IDF算法,突出特征项的重要性,并结合Word2vec工具训练的词向量对文本进行表示;然后使用ABLCNN提取文本特征,ABLCNN结合了注意力机制、长短期记忆网络和卷积神经网络的优点,既可以有重点地提取文本的上下文语义特征,又兼顾了局部语义特征;最后,将特征向量通过softmax函数进行文本分类。在THUCNews数据集和online_shopping_10_cats数据集上对基于改进TF-IDF和ABLCNN的文本分类模型进行实验,结果表明,所提模型在两个数据集上的准确率分别为97.38%和91.33%,高于其他文本分类模型。

关键词: TF-IDF, 长短期记忆网络, 卷积神经网络, 文本分类, 注意力机制

Abstract: Text classification which is often used in information retrieval,emotion analysis and other fields,is a very important content in the field of natural language processing and has become a research hotspot of many scholars.Traditional text classification model exists the problems of incomplete text feature extraction and weak semantic expression,thus,a text classification model based on improved TF-IDF algorithm and attention base on Bi-LSTM and CNN (ABLCNN) is proposed.Firstly,the TF-IDF algorithm is improved by using the distribution relationship of feature items within and between classes and location information to highlight the importance of feature items,the text is represented by word vector trained by word2vec tool and improved TF-IDF.Then,ABLCNN extracts the text features.ABLCNN combines the advantages of attention mechanism,long-term memory network and convolutional neural network.ABLCNN not only extracts major the context semantic features of the text,but also takes into account the local semantic features,At last,the feature vector is classified by softmax function.Chinese text classification model based on improved TF-IDF and ABLCNN is tested on THUCNews dataset and online_ shopping_ 10_cats dataset.The results of experimental show that the accuracy on the THUCNews dataset is 97.38% and the accuracy on the online_ shopping_ 10_cats dataset is 91.33%,the accuracy of experiment is higher than that of other text classification models.

Key words: Attention, Convolutional neural network, Long-term and short-term memory network, Term frequency-inverse document frequency, Text classification

中图分类号:

TP391

景丽, 何婷婷. 基于改进TF-IDF和ABLCNN的中文文本分类模型[J]. 计算机科学, 2021, 48(11A): 170-175. https://doi.org/10.11896/jsjkx.210100232

JING Li, HE Ting-ting. Chinese Text Classification Model Based on Improved TF-IDF and ABLCNN[J]. Computer Science, 2021, 48(11A): 170-175. https://doi.org/10.11896/jsjkx.210100232

参考文献

[1]WEI J.Research on chinese text classification algorithm basedon convolutional neural network[C]//3rd International Confe-rence on Computer Engineering,Information Science & Application Technology(ICCIA 2019).Paris:Atlantis Press,2019:250-254.
[2]KOWSARI K,JAFARI MEIMANDI K,HEIDARYSAFA M,et al.Text classification algorithms:a survey[J].Information,2019,10(4):150.
[3]CHEN Z,ZHOU L J,DA LI X,et al.The Lao text classification method based on KNN[J].Procedia Computer Science,2020,166:523-528.
[4]HUO G Y,ZHANG Y,SUN Y,et al.Research on Archive Data Intelligent Classification Based on Semantic[J/OL].(2020-11-18) [2021-01-21].http:// kns.cnki.net/kcms/detail/11.2127.TP.20201118.1647.018.html.
[5]HU W,GU Z,XIE Y,et al.Chinese text classification based on neural networks and word2vec[C]//2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC).Piscata-way:IEEE,2019:284-291.
[6]LU Y,ZHANG P Z,ZHANG C.Research on News Keyword Extraction Technology Based on TF-IDF and TextRank[C]//2019 IEEE/ACIS 18th International Conference on Computer and Information Science (ICIS).Piscataway:IEEE,2019:425-455.
[7]YE X M,MAO X M,XIA J C.Improved approach to TF-IDF algorithm in text classification[J].Computer Engineering and Applications,2019,55(2):104-109,161.
[8]MA Y,ZHAO H,LI W L,et al.Optimization of TF-IDF algorithm combined with improved CHI statistical method[J].Application Research of Computers,2019,36(9):2596-2598,2603.
[9]ZHANG L,LI Z H.An improved feature weighting method in text classification[J].Journal of Fujian Normal University(Na-tural Science Edition),2020,36(2):49-54.
[10]PENG H,LI J,HE Y,et al.Large-scale hierarchical text classification with recursively regularized deep graph-cnn[C]//Proceedings of the 2018 World Wide Web.Switzerland:InternationalWorld Wide Web Conferences Steering Committee Republic and Canton of Geneva,2018:1063-1072.
[11]LIU P,QIU X,HUANG X.Recurrent neural network for text classification with multi-task learning[J].arXiv:1605.05101,2016.
[12]KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing.Qatar,2014:1746-1751.
[13]ZHOU P,QI Z,ZHENG S,et al.Text classification improved by integrating bidirectional LSTM with two-dimensional max pooling[J].arXiv:1611.06639,2016.
[14]XING X,SUN G Z.Dual-channel word vectors based acrnn for text classification.[J/OL].(2020-12-14)[2021-01-21].https://doi.org/10.19734/j.issn.1001-3695.
[15]DU L,CAO D,LIN S Y,et al.Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM[J].Computer Science,2020,47(S2):416-420.
[16]BAI F B,CHANG L,WANG S F,et al.An Improved method study on the extracting keywords in chinese Judgment documents[J].Computer Engineering and Applications,2020,56(23):153-160.
[17]HOCHSREITER S,SCHMIDHUBER J.Long short-term me-mory[J].Neural Computation,1997,9(8):1735-1780.
[18]DONG Y R,LIU P Y,LIU W F,et al.A text classification model based on BiLSTM and label embedding[J].Journal of Shandong University(Natural Science),2020,55(11):78-86.
[19]SUN H,CHEN Y Q.Chinese text classification based on BERT and attention.[J/OL].(2021-01-06) [2021-01-21].https://kns.cnki.net/kcms/detail/detail.aspx?FileName=XXWX2021010500E&DbName=CAPJ2021.
[20]WANG H T,SONG W,WANG H.Text classification method based on hybrid model of LSTM and CNN[J].Journal of Chinese Computer Systems,2020,41(6):1163-1168.
[21]WANG G S,HUANG X J.convolution neural network textclassification model based on Word2vec and improved TF-IDF[J].Journal of Chinese Computer Systems,2019,40(5):1120-1126.
[22]LI Y H,LIANG S C,REN J,et al.Text classification method based on recurrent neural network variants and convolutional neural network[J].Journal of Northwest University(Natural Science Edition),2019,49(4):573-579.

相关文章 15

[1]	周芳泉, 成卫青. 基于全局增强图神经网络的序列推荐 Sequence Recommendation Based on Global Enhanced Graph Neural Network 计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[2]	戴禹, 许林峰. 基于文本行匹配的跨图文本阅读方法 Cross-image Text Reading Method Based on Text Line Matching 计算机科学, 2022, 49(9): 139-145. https://doi.org/10.11896/jsjkx.220600032
[3]	周乐员, 张剑华, 袁甜甜, 陈胜勇. 多层注意力机制融合的序列到序列中国连续手语识别和翻译 Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion 计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[4]	熊丽琴, 曹雷, 赖俊, 陈希亮. 基于值分解的多智能体深度强化学习综述 Overview of Multi-agent Deep Reinforcement Learning Based on Value Factorization 计算机科学, 2022, 49(9): 172-182. https://doi.org/10.11896/jsjkx.210800112
[5]	饶志双, 贾真, 张凡, 李天瑞. 基于Key-Value关联记忆网络的知识图谱问答方法 Key-Value Relational Memory Networks for Question Answering over Knowledge Graph 计算机科学, 2022, 49(9): 202-207. https://doi.org/10.11896/jsjkx.220300277
[6]	武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[7]	汪鸣, 彭舰, 黄飞虎. 基于多时间尺度时空图网络的交通流量预测模型 Multi-time Scale Spatial-Temporal Graph Neural Network for Traffic Flow Prediction 计算机科学, 2022, 49(8): 40-48. https://doi.org/10.11896/jsjkx.220100188
[8]	李宗民, 张玉鹏, 刘玉杰, 李华. 基于可变形图卷积的点云表征学习 Deformable Graph Convolutional Networks Based Point Cloud Representation Learning 计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[9]	王馨彤, 王璇, 孙知信. 基于多尺度记忆残差网络的网络流量异常检测模型 Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network 计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[10]	郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[11]	姜梦函, 李邵梅, 郑洪浩, 张建朋. 基于改进位置编码的谣言检测模型 Rumor Detection Model Based on Improved Position Embedding 计算机科学, 2022, 49(8): 330-335. https://doi.org/10.11896/jsjkx.210600046
[12]	陈泳全, 姜瑛. 基于卷积神经网络的APP用户行为分析方法 Analysis Method of APP User Behavior Based on Convolutional Neural Network 计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[13]	朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥. 基于注意力机制的医学影像深度哈希检索算法 Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism 计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[14]	孙奇, 吉根林, 张杰. 基于非局部注意力生成对抗网络的视频异常事件检测方法 Non-local Attention Based Generative Adversarial Network for Video Abnormal Event Detection 计算机科学, 2022, 49(8): 172-177. https://doi.org/10.11896/jsjkx.210600061
[15]	檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed