计算机科学 ›› 2023, Vol. 50 ›› Issue (9): 278-286.doi: 10.11896/jsjkx.221200133

• 人工智能 • 上一篇    下一篇

基于并行卷积网络信息融合的层级多标签文本分类算法

易流, 耿新宇, 白静   

  1. 西南石油大学计算机科学学院 成都 610000
  • 收稿日期:2022-12-23 修回日期:2023-04-07 出版日期:2023-09-15 发布日期:2023-09-01
  • 通讯作者: 耿新宇(gengxy123@126.com)
  • 作者简介:(y1u6rk@163.com)
  • 基金资助:
    四川省科技计划项目(2022NSFSC0555)

Hierarchical Multi-label Text Classification Algorithm Based on Parallel Convolutional Network Information Fusion

YI Liu, GENG Xinyu, BAI Jing   

  1. School of Computer Science,Southwest Petroleum University,Chengdu 610000,China
  • Received:2022-12-23 Revised:2023-04-07 Online:2023-09-15 Published:2023-09-01
  • About author:YI Liu,born in 1995,postgraduate.His main research interests include natural language processing and text classification.
    GENG Xinyu,born in 1964,professor.His main research interests include data mining and artificial neural networks.
  • Supported by:
    Sichuan Science and Technology Program(2022NSFSC0555).

摘要: 自然语言处理是人工智能与机器学习领域的重要方向,它的目标是利用计算机技术来分析、理解和处理自然语言。自然语言处理的一个重点研究方向是从文本内容中获取信息,并且按照一定的标签体系或标准将文本内容进行自动分类标记。相比于单一标签文本分类而言,多标签文本分类具有一条数据属于多个标签的特点,使得更难从文本信息中获得多类别的数据特征。层级多标签文本分类又是其中的一个特别的类别,它将文本中的信息对应划分到不同的类别标签体系中,各个类别标签体系又具有互相依赖的层级关系。因此,如何利用其内部标签体系中的层级关系更准确地将文本分类到对应的标签中,也就成了解决问题的关键。为此,提出了一种基于并行卷积网络信息融合的层级多标签文本分类算法。首先,该算法利用BERT模型对文本信息进行词嵌入,接着利用自注意力机制增强文本信息的语义特征,然后利用不同卷积核对文本数据特征进行抽取。通过使用阈值控制树形结构建立上下位的节点间关系,更有效地利用了文本的多方位语义信息实现层级多标签文本分类任务。在公开数据集Kanshan-Cup和CI企业信息数据集上的结果表明,该算法在宏准确率、宏召回率与微F1值3种评价指标上均优于主流的TextCNN,TextRNN,FastText等对比模型,具有较好的层级多标签文本分类效果。

关键词: 层级多标签文本分类, 预训练模型, 注意力机制, 卷积神经网络, 树形结构

Abstract: Natural language processing(NLP) is an important research direction in the field of artificial intelligence and machine learning,which aims to use computer technology to analyze,understand,and process natural language.One of the main research areas in NLP is to obtain information from textual content and automatically classify and label textual content based on a certain labeling system or standard.Compared to single-label text classification,multi-label text classification has the characteristic that a data element belongs to multiple labels,which makes it more difficult to obtain multiple categories of data features from textual information.Hierarchical classification of multi-label texts isa special category,whichdivides the information contained in the text into different category labeling systems,and each category labeling system has an interdependent hierarchical relationship.Therefore,the use of the hierarchical relationship in the internal labeling system to more accurately classify the text into corresponding labels becomes the key to solving the problem.To this end,this paper proposes a hierarchical classification algorithm for multi-label texts based on the fusion of parallel convolutional network information.First,the algorithm uses the BERT model for word integration in textual information,then it enhances the semantic features of textual information using a self-attention mechanism and extracts the features of textual data using different convolutional kernels.The multi-faceted semantic information of the text is more effectively used for the task of a hierarchical classification of multi-label texts by using a threshold-controlled tree structure to establish inter-node relationships between higher and lower bits.The results obtained on the Kanshan-Cup public dataset and the CI enterprise information dataset show that the algorithm outperforms TextCNN,TextRNN,FastTex and other comparative models in three evaluation measures,namely macro-precision,macro-recall,and micro F1 value,and has a better cascade multi-label text classification effect.

Key words: Hierarchical multi-label text classification, Pre-training model, Attention mechanism, Convolutional neural network, Tree structure

中图分类号: 

  • TP391
[1]WU S,GAO M,XIAO Q,et al.A topic-enhanced recurrent autoencoder model for sentiment analysis of short texts[J].International Journal of Internet Manufacturing and Services,2020,7(4):393-406.
[2]BIN N,WU J W,HU F.Spam message classification based on theNaïve Bayes classification algorithm[J].IAENG International Journal of Computer Science,2019,46(1):46-53.
[3]CHEN J,HE J,SHEN Y,et al.End-to-end learning of LDA by mirror-descent back propagation over a deep architecture[J].arXiv:1508.03398,2015.
[4]MINAEE S,KALCHBRENNER N,CAMBRIA E,et al.Deeplearning--based text classification:a comprehensive review[J].ACM Computing Surveys(CSUR),2021,54(3):1-40.
[5]TAN C.Short Text Classification Based on LDA and SVM [J].International Journal of Applied Mathematics & Stats,2013,51(22):205-214.
[6]YIN C,SHI L,WANG J.Short Text Classification Technology Based on KNN+Hierarchy SVM [C] // International Confe-rence on Multimedia and Ubiquitous Engineering International Conference on Future Information Technology.2017:633-639.
[7]JIANG T,WANG D,SUN L,et al.Transformer with DynamicNegative Sampling for High-Performance Extreme Multi-label Text Classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2021:7987-7994.
[8]JOHNSON R,ZHANG T.Effective use of word order for text categorization with convolutional neural networks[J].arXiv:1412.1058,2014.
[9]GARGIULO F,SILVESTRI S,CIAMPI M,et al.Deep neuralnetwork for hierarchical extreme multi-label text classification[J].Applied Soft Computing,2019,79:125-138.
[10]LIU J,CHANG W C,WU Y,et al.Deep learning for extreme multi-label text classification [C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.2017:115-124.
[11]KIM Y.Convolutional Neural Networks for Sentence Classification [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.2014:1746-1751.
[12]GRAVES A,MOHAMED A,HINTON G.Speech recognitionwith deep recurrent neural networks[C]//Proceedings of the IEEE International Conference on Acoustics,Speech and Signal Processing.2013:6645-6649.
[13]JOULIN A,GRAVE E,BOJANOWSKI P,et al.FastText.zip:Compressing text classification models[J].arXiv:1612.03651,2016.
[14]GARGIULO F,SILVESTRI S,CIAMPI M,et al.Deep neuralnetwork for hierarchical extreme multi-label text classification[J].Applied Soft Computing,2019,79:125-138.
[15]ZHENG C,HONG T T,XUE M Y.BLSTM_MLPCNN Model For short Text Classification [J].Computer Science,2019,46(6):206-211.
[16]DUAN D D,TANG J S,WEN Y,et al.Chinese short text classification algorithm based on BERT model[J].Computer engineering,2021,47(1):79-86.
[17]LAN Z,CHEN M,GOODMAN S,et al.Albert:A lite bert for self-supervised learning of language representations[J].arXiv:1909.11942,2019.
[18]GARGIULO F,SILVESTRI S,CIAMPI M,et al.Deep neural network for hierarchical extreme multi-label text classification[J].Applied Soft Computing,2019,79:125-138.
[19]SOUCY P,MINEAU G W.A simple KNN algorithm for text categorization[C]//Proceedings 2001 IEEE International Conference on Data Mining.IEEE,2001:647-648.
[20]CAI L,HOFMANN T.Hierarchical document categorizationwith support vector machines[C]//Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management.2004:78-87.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!