基于DAN与FastText的藏文短文本分类研究

doi:10.11896/jsjkx.230700064

计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230700064-5.doi: 10.11896/jsjkx.230700064

基于DAN与FastText的藏文短文本分类研究

李果^1,2, 陈晨^1,2, 杨进^1,3, 群诺¹

1 西藏大学信息科学技术学院拉萨 850000
2 藏文信息技术教育部工程研究中心拉萨 850000
3 四川大学网络空间安全学院成都 610000

发布日期:2024-06-06
通讯作者: 杨进(yangjin_abc123@163.com)
作者简介:(2216643277@qq.com)
基金资助:
国家自然科学基金(61872254,62162057)

Study on Tibetan Short Text Classification Based on DAN and FastText

LI Guo^1,2, CHEN Chen^1,2, YANG Jing^1,3, QUN Nuo¹

1 School of Information Science and Technology,Tibet University,Lhasa 850000,China
2 Engineering Research Center of Tibetan Information Technology Ministry of Education,Tibet University,Lhasa 850000,China
3 School of Cyber Science and Engineering,Sichuan University,Chengdu 610000,China

Published:2024-06-06
About author:LI Gu,born in 1994,postgraduate.His main research interest includs natural language processing.
YANG Jing,born in 1980,professor.His main research interests include cyberspace security and artificial intelligence.
Supported by:
National Natural Science Foundation of China(61872254,62162057).

摘要/Abstract

摘要： 随着藏文信息不断融入社会生活,越来越多的藏文短文本数据存在网络平台上。针对传统分类方法在藏文短文本上分类性能低的问题,文中提出了一种基于DAN-FastText的藏文短文本分类模型。该模型使用FastText网络在较大规模的藏文语料上进行无监督训练获得预训练的藏文音节向量集,使用预训练的音节向量集将藏文短文本信息转化为音节向量,把音节向量送入DAN(Deep Averaging Networks)网络并在输出阶段融合经过FastText网络训练的句向量特征,最后通过全连接层和softmax层完成分类。在公开的TNCC(Tibetan News Classification Corpus)新闻标题数据集上所提模型的Macro-F1是64.53%,比目前最好评测结果TiBERT模型的Macro-F1得分高出2.81%,比GCN模型的Macro-F1得分高出6.14%,融合模型具有较好的藏文短文本分类效果。

关键词: 藏文短文本分类, 特征融合, 深度平均网络, 快速文本

Abstract: As Tibetan information continues to be integrated into social life,more and more Tibetan short text data is available on online platforms.Aiming at the low classification performance of traditional classification methods on Tibetan short texts,a Tibetan short text classification model based on DAN-FastText is proposed.The model uses the FastText network to perform unsupervised training on a large-scale Tibetan corpus to obtain the pre-trained Tibetan syllabic vector set,uses the pre-trained syllable vector set to convert the Tibetan short text information into syllable vector,sends the syllable vector into the deep averaging networks(DAN) network and fuses the sentence vector features trained by the FastText network in the output stage,and finally completes the classification through the fully connected layer and the softmax layer.On the publicly available tibetan news classification corpus(TNCC) news headline dataset,Macro-F1 is 64.53%,which is 2.81% higher than that of the TiBERT model and 6.14% higher than that GCN model,and the fusion model has a better Tibetan short text classification effect.

Key words: Tibetan short text classification, Feature fusion, Deep averaging networks, Fast text

中图分类号:

TP391.1

李果, 陈晨, 杨进, 群诺. 基于DAN与FastText的藏文短文本分类研究[J]. 计算机科学, 2024, 51(6A): 230700064-5. https://doi.org/10.11896/jsjkx.230700064

LI Guo, CHEN Chen, YANG Jing, QUN Nuo. Study on Tibetan Short Text Classification Based on DAN and FastText[J]. Computer Science, 2024, 51(6A): 230700064-5. https://doi.org/10.11896/jsjkx.230700064

参考文献

[1]SALTON G,WONG A,YANG C A.A vector space model for automatic indexing[C]//Communications of the ACM.1975:613-620.
[2]MIKOLOV T,SUTSKEVER I,CHENK,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[3]DUMAIS,SUSAN T.Latent semantic analysis.AnnualReview of Information Science and Technology[J].Annual Review of Information Science and Technology,2004,38:189-230.
[4]CHRISTOS H P,PRABHAKAR R,HISAO T,et al.Latent semantic indexing:a probabilistic analysis[J].Journal of Computer and System Sciences,2000,61(2):217-235.
[5]BLEI D M,NG A Y,JORDANM I.Latent dirichlet allocation[J].Machine Learning Research Archive,2003,3(Jan):993-1022.
[6]CAO C P,CUI H C.Microblog topic detection based on LSA and structural property[J].Application Research of Computers,2015,32(9):2720-2723.
[7]WANG Y B,ZHENG W J,CHENG Y S,et al.Multi-label classification algorithm based on PLSA learning probability distribution semantic information[J].Journal of NANJING University(Natural Science),2021,57(1):75-89.
[8]SUN X K,DAI H,ZHOU J H,et al.LTTFAD:log template topic feature-based anomaly detection[J].Computer Science,2023,50(6):313-321.
[9]YAN X H,GUO J F,LAN Y Y,et al.A biterm topic model for short texts[C]// WWW 2013-Proceedings of the 22nd International Conference on World Wide Web.2013:1445-1456.
[10]JIANG X H,SHEN Y H,WANG Y Z,et al.BaKGraSTeC:a background knowledge graph based method for short text classifications[C]//2020 IEEE International Conference on Know-ledge Graph(ICKG).IEEE,2020:360-366.
[11]HE Y,WANG C,ZHANG S,et al.KG-MTT-BERT:knowledge graph enhancedbert for multi-type medical text classification[J].arXiv:2210.03970,2022.
[12]LI B H,XIANG Y X,FENG D I,et al.Short text classification model combining knowledge aware and dual attention[J].Journal of Software,2022,33(10):3565-3581.
[13]JIANG T,YUAN B,YU H Z.Multi-feature based sentimentanalysis of Tibetan microblogs[J].Journal of Chinese Information Processing,2017,31(3):163-169.
[14]YAN X D,HUANG T.Tibetan sentence sentiment classification based on emotion dictionary[J].Journal of Chinese Information Processing,2018,32(2):75-80.
[15]ZHU Y L,DEJI K Z,QUN N,et al.Sentiment analysis of Tibe-tan short texts based on graphical neural networks and pre-training models[J].Journal of Chinese Information Processing,2023,37(2):71-79.
[16]MENG X H,YU H Z.Tibetan text sentiment classificationcombining syllables and words[J].Journal of Chinese Information Processing,2023,37(2):80-86.
[17]QUN N,LI X,QIU X,et al.End-to-End neural text classification for Tibetan[C]//The Sixteenth China National Conference on Computational Linguistics.2017:1-8.
[18]XU G X,ZHANG Z X,YU S N,et al.Tibetannews text classification based on graph convolutional networks[J].Data Analysis and Knowledge Discovery,2022,7(6):73-85.
[19]LIU S S,DENG J J,SUN Y,et al.TiBERT:tibetan pre-trained language model[C]//2022 IEEE International Conference on Systems.IEEE,2022:2956-2961.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于DAN与FastText的藏文短文本分类研究

Study on Tibetan Short Text Classification Based on DAN and FastText

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0