ZHA_TGCN:面向低资源壮文的主题分类方法

doi:10.11896/jsjkx.250100059

计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 250100059-8.doi: 10.11896/jsjkx.250100059

ZHA_TGCN:面向低资源壮文的主题分类方法

赵卓洋¹, 秦董洪^1,4, 白凤波^1,4, 梁贤烨¹, 徐晨¹, 郑月华¹, 梁宇锋¹, 蓝盛^2,4, 周国平³

1 广西民族大学人工智能学院南宁 530006
2 广西民族大学文学院南宁 530006
3 广西民族大学预科教育学院南宁 530006
4 语言计算与智能广西高校工程研究中心南宁 530006

出版日期:2025-11-15 发布日期:2025-11-10
通讯作者: 白凤波(baif@gxun.edu.cn)
作者简介:zhuoyangzhao@stu.gxmzu.edu.cn
基金资助:
广西壮族自治区中央引导地方科技发展资金项目(桂科ZY24212045);广西科技基地和人才专项(桂科AD23026054);广西科技基地和人才专项(桂科AD22035200)

ZHA_TGCN:A Topic Classification Method for Low-resource Sawcuengh Language

ZHAO Zhuoyang¹, QIN Donghong^1,4, BAI Fengbo^1,4, LIANG Xianye¹, XU Chen¹, ZHENG Yuehua¹, LIANG Yufeng¹, LAN Sheng^2,4, ZHOU Guoping³

1 College of Artificial Intelligence,Guangxi Minzu University,Nanning 530006,China
2 College of Chinese Language and Literature,Guangxi Minzu University,Nanning 530006,China
3 College of Preparatory Education,GuangxiMinzu University,Nanning 530006,China
4 University Engineering Research Center of Computational Linguistics and Intelligence,Nanning 530006,China

Online:2025-11-15 Published:2025-11-10
Supported by:
Central Guidance on Local Science and Technology Development Fund of Guangxi Zhuang Autonomous Region(GUIKEZY24212045),Guangxi Science and Technology Base and Talent Project(GUIKEAD23026054) and Guangxi Science and Technology Base and Talent Project(GUIKEAD22035200).

摘要/Abstract

摘要： 传统图卷积网络方法在数据有限的条件下能够有效建模图结构,但由于依赖稀疏的独热编码,其捕捉词与词之间上下文关系的能力存在局限性。这一问题在低资源语言环境中尤为突出。以壮文文本主题分类任务为例,该任务不仅面临数据稀缺的困境,还需应对复杂语言结构的挑战。针对这些挑战,提出了一种适用于低资源环境的壮文主题分类方法——ZHA_TGCN。该方法利用壮文预训练模型 ZHA_BERT 提取文本特征,并将文本特征与壮文声调特征相结合,输入BiGRU以学习深层语义表示,将学习到的表示向量作为文档节点的特征提供给GCN,通过在GCN中执行标签传播来学习训练数据和未标记测试数据的特征表示。最后,利用Softmax层输出分类结果。实验结果表明,提出的方法在低资源壮文主题分类任务中的准确率为82.12%,精确率为90.08%,召回率为92.46%,F1值为90.18%,验证了该方法的有效性。

关键词: 低资源语言, 壮文, 主题分类, 预训练模型, 图卷积网络

Abstract: Traditional graph convolutional network methods can effectively model graph structures under data-limited conditions.However,due to their reliance on sparse one-hot encoding,they face limitations in capturing the contextual relationships between words.This issue is particularly pronounced in low-resource language environments.Taking the Sawcuengh language text topic classification task as an example,this task faces not only data scarcity but also the challenge of complex linguistic structures.To address these challenges,this paper proposes a Sawcuengh language topic classification method suitable for low-resource settings－ZHA_TGCN.This method leverages the Sawcuengh pre-trained model,ZHA_BERT,to extract textual features,and combines these features with Sawcuengh tone features.These combined features are then input into a BiGRU to learn deep semantic representations.The learned representation vectors are used as node features for the GCN,which propagates labels to learn the feature representations of both the training data and the unlabeled test data.Finally,a Softmax layer is used to output the classification results.Experimental results show that the proposed method achieves an accuracy of 82.12%,precision of 90.08%,recall of 92.46%,and an F1 score of 90.18% in the low-resource Sawcuengh language topic classification task,demonstrating the effectiveness of the method.

Key words: Low-resource language, Sawcuengh language, Subjet classification, Pre-trained model, Graph convolutional network

中图分类号:

TP391

赵卓洋, 秦董洪, 白凤波, 梁贤烨, 徐晨, 郑月华, 梁宇锋, 蓝盛, 周国平. ZHA_TGCN:面向低资源壮文的主题分类方法[J]. 计算机科学, 2025, 52(11A): 250100059-8. https://doi.org/10.11896/jsjkx.250100059

ZHAO Zhuoyang, QIN Donghong, BAI Fengbo, LIANG Xianye, XU Chen, ZHENG Yuehua, LIANG Yufeng, LAN Sheng, ZHOU Guoping. ZHA_TGCN:A Topic Classification Method for Low-resource Sawcuengh Language[J]. Computer Science, 2025, 52(11A): 250100059-8. https://doi.org/10.11896/jsjkx.250100059

参考文献

[1]WANG A H.Don’t follow me:Spam detection in twitter[C]//2010 International Conference on Security and Cryptography(SECRYPT).IEEE,2010:1-10.
[2]PANG B,LEE L.Opinion mining and sentiment analysis[J].Foundations and Trends© in Information Retrieval,2008,2(1／2):1-135.
[3]KARIM A,AZAM S,SHANMUGAM B,et al.A comprehensive survey for intelligent spam email detection[J].IEEE Access,2019,7:168261-168295.
[4]LI J Y.From Ancient Zhuang Characters to Zhuang Script:The Zhuang People Now Have Their Own Writing System [J].Contemporary Guangxi,2019(Z1):86.
[5]CHURCH K W.Word2Vec[J].Natural Language Engineering,2017,23(1):155-162.
[6]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[7]LI Z,LIU F,YANG W,et al.A survey of convolutional neural networks:analysis,applications,and prospects [J].IEEE Transactions on Neural Networks and Learning Systems,2021,33(12):6999-7019.
[8]MIKOLOV T,KARAFIÁT M,BURGET L,et al.Recurrentneural network based language model[C]//Interspeech.2010:1045-1048.
[9]ZHANG S,ZHENG D,HU X,et al.Bidirectional long short-term memory networks for relation classification[C]//Procee-dings of the 29th Pacific Asia Conference on Language,Information and Computation.2015:73-78.
[10]ZULQARNAIN M,GHAZALI R,GHOUSE M G,et al.Efficient processing of GRU based on word embedding for text classification[J].International Journal on Informatics Visualization,2019,3(4):377-383.
[11]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[12]WU L,CHEN Y,SHEN K,et al.Graph neural networks fornatural language processing:A survey [J].Foundations and Trends© in Machine Learning,2023,16(2):119-328.
[13]SCARSELLI F,GORI M,TSOI A C,et al.The graph neural network model [J].IEEE Transactions on Neural Networks,2008,20(1):61-80.
[14]KIPF T N,WELLING M.Semi-Supervised Classification withGraph Convolutional Networks[C]//International Conference on Learning Representations.2017.
[15]YAO L,MAO C,LUO Y.Graph convolutional networks fortext classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:7370-7377.
[16]LIU X,YOU X,ZHANG X,et al.Tensor graph convolutional networks for text classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2020:8409-8416.
[17]LI X,LI Z,SHENG J,et al.Low-resource text classification via cross-lingual language model fine-tuning[C]//China National Conference on Chinese Computational Linguistics.Cham:Springer,2020:231-246.
[18]SAZZED S.Cross-lingual sentiment analysis in bengali utilizing a new benchmark corpus[C]//Proceedings of the 2020 EMNLP Workshop W-NUT:The Sixth Workshop on Noisy User-gene-rate.2020:50-60.
[19]YAO H,WU Y,AL-SHEDIVAT M,et al.Knowledge-Aware Meta-learning for Low-Resource Text Classification[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing.2021:1814-1821.
[20]FESSEHA A,XIONG S,EMIRU E D,et al.Text classification based on convolutional neural networks and word embedding for low-resource languages:Tigrinya [J].Information,2021,12(2):52.
[21]AN B,ZHAO W N,LONG C J.Low-resource Tibetan TextClassification Based on Prompt Learning[J].Journal of Chinese Information Processing,2024,38(2):70-78.
[22]WEN Z,FANG Y.Augmenting low-resource text classification with graph-grounded pre-training and prompting[C]//Procee-dings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval.2023:506-516.
[23]LIN Y,MENG Y,SUN X,et al.BertGCN:Transductive TextClassification by Combining GNN and BERT[C]//Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021.2021:1456-1462.
[24]YUAN Y,LV S,BAO Z,et al.A Joint Model for Text Classification with BERT-BiLSTM and GCN[C]//Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition.2022:180-186.
[25]HUANG L,MA D,LI S,et al.Text Level Graph Neural Net-work for Text Classification[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).2019:3444-3450.
[26]JOULIN A,GRAVE É,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[C]//Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.2017:427-431.
[27]LIU P,QIU X,HUANG X.Recurrent neural network for text classification with multi-task learning[C]//Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.2016:2873-2879.
[28]JOHNSON R,ZHANG T.Deep pyramid convolutional neural networks for text categorization[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:562-570.
[29]GUO B,ZHANG C,LIU J,et al.Improving text classification with weighted word embeddings via a multi-channel TextCNN model[J].Neurocomputing,2019,363:366-374.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

ZHA_TGCN:面向低资源壮文的主题分类方法

ZHA_TGCN:A Topic Classification Method for Low-resource Sawcuengh Language

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0