计算机科学 ›› 2024, Vol. 51 ›› Issue (6A): 230700064-5.doi: 10.11896/jsjkx.230700064
李果1,2, 陈晨1,2, 杨进1,3, 群诺1
LI Guo1,2, CHEN Chen1,2, YANG Jing1,3, QUN Nuo1
摘要: 随着藏文信息不断融入社会生活,越来越多的藏文短文本数据存在网络平台上。针对传统分类方法在藏文短文本上分类性能低的问题,文中提出了一种基于DAN-FastText的藏文短文本分类模型。该模型使用FastText网络在较大规模的藏文语料上进行无监督训练获得预训练的藏文音节向量集,使用预训练的音节向量集将藏文短文本信息转化为音节向量,把音节向量送入DAN(Deep Averaging Networks)网络并在输出阶段融合经过FastText网络训练的句向量特征,最后通过全连接层和softmax层完成分类。在公开的TNCC(Tibetan News Classification Corpus)新闻标题数据集上所提模型的Macro-F1是64.53%,比目前最好评测结果TiBERT模型的Macro-F1得分高出2.81%,比GCN模型的Macro-F1得分高出6.14%,融合模型具有较好的藏文短文本分类效果。
中图分类号:
[1]SALTON G,WONG A,YANG C A.A vector space model for automatic indexing[C]//Communications of the ACM.1975:613-620. [2]MIKOLOV T,SUTSKEVER I,CHENK,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119. [3]DUMAIS,SUSAN T.Latent semantic analysis.AnnualReview of Information Science and Technology[J].Annual Review of Information Science and Technology,2004,38:189-230. [4]CHRISTOS H P,PRABHAKAR R,HISAO T,et al.Latent semantic indexing:a probabilistic analysis[J].Journal of Computer and System Sciences,2000,61(2):217-235. [5]BLEI D M,NG A Y,JORDANM I.Latent dirichlet allocation[J].Machine Learning Research Archive,2003,3(Jan):993-1022. [6]CAO C P,CUI H C.Microblog topic detection based on LSA and structural property[J].Application Research of Computers,2015,32(9):2720-2723. [7]WANG Y B,ZHENG W J,CHENG Y S,et al.Multi-label classification algorithm based on PLSA learning probability distribution semantic information[J].Journal of NANJING University(Natural Science),2021,57(1):75-89. [8]SUN X K,DAI H,ZHOU J H,et al.LTTFAD:log template topic feature-based anomaly detection[J].Computer Science,2023,50(6):313-321. [9]YAN X H,GUO J F,LAN Y Y,et al.A biterm topic model for short texts[C]// WWW 2013-Proceedings of the 22nd International Conference on World Wide Web.2013:1445-1456. [10]JIANG X H,SHEN Y H,WANG Y Z,et al.BaKGraSTeC:a background knowledge graph based method for short text classifications[C]//2020 IEEE International Conference on Know-ledge Graph(ICKG).IEEE,2020:360-366. [11]HE Y,WANG C,ZHANG S,et al.KG-MTT-BERT:knowledge graph enhancedbert for multi-type medical text classification[J].arXiv:2210.03970,2022. [12]LI B H,XIANG Y X,FENG D I,et al.Short text classification model combining knowledge aware and dual attention[J].Journal of Software,2022,33(10):3565-3581. [13]JIANG T,YUAN B,YU H Z.Multi-feature based sentimentanalysis of Tibetan microblogs[J].Journal of Chinese Information Processing,2017,31(3):163-169. [14]YAN X D,HUANG T.Tibetan sentence sentiment classification based on emotion dictionary[J].Journal of Chinese Information Processing,2018,32(2):75-80. [15]ZHU Y L,DEJI K Z,QUN N,et al.Sentiment analysis of Tibe-tan short texts based on graphical neural networks and pre-training models[J].Journal of Chinese Information Processing,2023,37(2):71-79. [16]MENG X H,YU H Z.Tibetan text sentiment classificationcombining syllables and words[J].Journal of Chinese Information Processing,2023,37(2):80-86. [17]QUN N,LI X,QIU X,et al.End-to-End neural text classification for Tibetan[C]//The Sixteenth China National Conference on Computational Linguistics.2017:1-8. [18]XU G X,ZHANG Z X,YU S N,et al.Tibetannews text classification based on graph convolutional networks[J].Data Analysis and Knowledge Discovery,2022,7(6):73-85. [19]LIU S S,DENG J J,SUN Y,et al.TiBERT:tibetan pre-trained language model[C]//2022 IEEE International Conference on Systems.IEEE,2022:2956-2961. |
|