计算机科学 ›› 2022, Vol. 49 ›› Issue (4): 282-287.doi: 10.11896/jsjkx.210200027

• 人工智能 • 上一篇    下一篇

基于混合字词特征的中文短文本分类算法

刘硕, 王庚润, 彭建华, 李柯   

  1. 中国人民解放军战略支援部队信息工程大学 郑州 450000
  • 收稿日期:2021-02-02 修回日期:2021-05-31 发布日期:2022-04-01
  • 通讯作者: 王庚润(wanggengrun@gmail.com)
  • 作者简介:(842964176@qq.com)
  • 基金资助:
    国家自然科学基金(61803384)

Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words

LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke   

  1. People's Liberation Army Strategic Support Force Information Engineering University, Zhengzhou 450000, China
  • Received:2021-02-02 Revised:2021-05-31 Published:2022-04-01
  • About author:LIU Shuo,born in 1996,postgraduate.His main research interests include data analysis,natural language processing and short text classification.WANG Geng-run,born in 1987,Ph.D,assistant researcher.His main research interests include telecommunication network security and data processing.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61803384).

摘要: 随着信息技术的迅速发展,网络中产生了海量的中文短文本数据。利用中文短文本分类技术,在低信息量的数据中挖掘出有价值的信息是当前的一个研究热点。相比中文长文本,中文短文本具有字数少、歧义多以及信息不规范等特点,导致其文本特征难以提取与表达。为此,文中提出了一种基于混合字词特征深度神经网络模型的中文短文本分类算法。首先,该算法同时计算出中文短文本的字向量和词向量,并分别对其进行特征提取;然后将提取到的字向量特征和词向量特征进行融合;最后通过全连接层和softmax层完成分类任务。在公开的THUCNews新闻数据集上的测试结果表明,该算法在精确率、召回率和F1值3种评价指标上均优于主流的TextCNN,BiGRU,Bert以及ERNIE_BiGRU等对比模型,具有较好的短文本分类效果。

关键词: 词向量, 卷积神经网络, 预训练模型, 中文短文本分类, 字向量

Abstract: The rapid development of information technology has lead to massive data of Chinese short texts on the Internet.As such, using classification technology to dig out valuable information from it is a current research hotspot.Compared with Chinese long texts, short texts have the characteristics of fewer words, more ambiguities and irregular information, making text feature extraction and expression a challenge.For this reason, a Chinese short text classification algorithm based on the deep neural network model of hybrid features of characters and words is proposed.First, the character vector and word vector of Chinese short text are calculated respectively.Then, their features are extracted and fused.Last, the classification task is accomplished through the fully connected layer and the softmax layer.The test results on the public THUCNews news data set show that the algorithm is better than the mainstream TextCNN, BiGRU, Bert and ERNIE_BiGRU comparison models in terms of accuracy, recall and F1 value.It has a good effect on short text classification.

Key words: Character vector, Chinese short text classification, Convolutional Neural Network, Pre-training model, Word vector

中图分类号: 

  • TP391.1
[1] SHI H M.Research on Social Network Information Filtering Method Based on Long Short-term Memory Network [D].Nanjing University of Posts and Telecommunications,2019.
[2] ZHAO J Q.Research on Internet Public Opinion MonitoringMethod Based on Automatic Classification[J].Software Guide,2016,15(3):133-135.
[3] WU S,GAO M,XIAO Q,et al.A topic-enhanced recurrent autoencoder model for sentiment analysis of short texts[J].International Journal of Internet Manufacturing and Services,2020,7(4):393-399.
[4] CHEN H.Personalized recommendation system of e-commercebased on big data analysis[J].Journal of Interdisciplinary Ma-thematics,2018,21(5):1243-1247.
[5] TAN C.Short Text Classification Based on LDA and SVM[J].International Journal of Applied Mathematics & Stats,2013,51(22):205-214.
[6] YIN C,SHI L,WANG J.Short Text Classification Technology Based on KNN+Hierarchy SVM[C]//International Conference on Multimedia and Ubiquitous Engineering International Conference on Future Information Technology.2017:633-639.
[7] MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning based text classification:A comprehensive review[J].arXiv:2004.03705,2020.
[8] LI C B,DUAN Q J,JI C H,et al.Method of Short Text Classification Based on CHI and TF-IWF Feature Selection[J].Journal of Chongqing University of Technology(Natural Science),2021,35(5):135-140.
[9] MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of the International Conference on Learning Representations.ACM,2013:1-8.
[10] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrasesand their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[11] PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//Proceedings of the 2018 Confe-rence of the North American Chapter of the Association for Computational.2018:2227-2237.
[12] DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceeding of the 2019 Conference of the North American Chapter of the Association for Computational Linguistices (NAACL).2019:4171-4186.
[13] LAN Z,CHEN M,GOODMAN S,et al.ALBERT:A LiteBERT for Self-supervised Learning of Language Representations[EB/OL].(2019-09-26)[2020-01-06].https://arxiv.org/abs/1909.11942.
[14] ZHANG Z Y,HAN X,LIU Z Y,et al.ERNIE:enhanced language representation with informative entities[C]//Proceedings of the 57th Annual Meeting of the Association for Computatio-nal Linguistics.Florence,2019:1441-1451.
[15] SUN Y,WANG S,LI Y,et al.ERNIE 2.0:A Continual Pre-Training Framework for Language Understanding[J].Procee-dings of the AAAI Conference on Artificial Intelligence,2020,34(5):8968-8975.
[16] KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Association for Computational Linguistics.Procee-dings of the 2014 Conference on Empirical Methods in Natural Language Processing.Doha,Qatar,2014:1746-1751.
[17] JOHNSON R,ZHANG T.Deep Pyramid Convolutional Neural Networks for Text Categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:562-570.
[18] LI Z J,GENG C Y,SONG P.Research on Short Text Classification Based on LSTM-TextCNN Joint Model[J].Journal of Xi’an Technological University,2020,40(3):299-304.
[19] DUAN D D,TANG J S,WEN Y,et al.Research on ChineseShort Text Classification Algorithm Based on BERT[J].Computer Engineering,2021,47(1):79-86.
[20] ZHENG C,HONG T T,XUE M Y.BLSTM_MLPCNN Model For Short Text Classification[J].Computer Science,2019,46(6):206-211.
[21] HOU X L,LI X,CHEN Y P.Short Text Classification Model Based on Multi-Neural Network Hybrid[J].Computer System Applications,2020,29(10):9-19.
[22] SUN M S,LI J Y,GUO Z P,et al.THUCTC:An efficient toolkit for Chinese text classification [EB/OL].http://thuctc.thunlp.org.2016-12-30.
[23] HU D F,ZHANG C X,WANG S T,et al.Intelligent Prediction Model of Tool Wear Based on Deep Signal Processing and Stacked-ResGRU[J].Computer Science,2021,48(6):175-183.
[24] WANG W,SUN Y X,QI Q J,et al.Text sentiment classification model based on BiGRU-attention neural network[J].Application Research of Computers,2019,36(12):3558-3564.
[25] LEI J S,QIAN Y.Chinese text classification method based on ERNIE-BiGRU model[J].Journal of Shanghai Electric Power University,2020,36(4):329-335,350.
[1] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[2] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[3] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[4] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[5] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[6] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[7] 戴朝霞, 李锦欣, 张向东, 徐旭, 梅林, 张亮.
基于DNGAN的磁共振图像超分辨率重建算法
Super-resolution Reconstruction of MRI Based on DNGAN
计算机科学, 2022, 49(7): 113-119. https://doi.org/10.11896/jsjkx.210600105
[8] 刘月红, 牛少华, 神显豪.
基于卷积神经网络的虚拟现实视频帧内预测编码
Virtual Reality Video Intraframe Prediction Coding Based on Convolutional Neural Network
计算机科学, 2022, 49(7): 127-131. https://doi.org/10.11896/jsjkx.211100179
[9] 徐鸣珂, 张帆.
Head Fusion:一种提高语音情绪识别的准确性和鲁棒性的方法
Head Fusion:A Method to Improve Accuracy and Robustness of Speech Emotion Recognition
计算机科学, 2022, 49(7): 132-141. https://doi.org/10.11896/jsjkx.210100085
[10] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[11] 金方焱, 王秀利.
融合RACNN和BiLSTM的金融领域事件隐式因果关系抽取
Implicit Causality Extraction of Financial Events Integrating RACNN and BiLSTM
计算机科学, 2022, 49(7): 179-186. https://doi.org/10.11896/jsjkx.210500190
[12] 姜胜腾, 张亦弛, 罗鹏, 刘月玲, 曹阔, 赵海涛, 魏急波.
语义通信系统的性能度量指标分析
Analysis of Performance Metrics of Semantic Communication Systems
计算机科学, 2022, 49(7): 236-241. https://doi.org/10.11896/jsjkx.211200071
[13] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[14] 杨玥, 冯涛, 梁虹, 杨扬.
融合交叉注意力机制的图像任意风格迁移
Image Arbitrary Style Transfer via Criss-cross Attention
计算机科学, 2022, 49(6A): 345-352. https://doi.org/10.11896/jsjkx.210700236
[15] 杨健楠, 张帆.
一种结合双注意力机制和层次网络结构的细碎农作物分类方法
Classification Method for Small Crops Combining Dual Attention Mechanisms and Hierarchical Network Structure
计算机科学, 2022, 49(6A): 353-357. https://doi.org/10.11896/jsjkx.210200169
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!