融合词性与声调特征的越南语语法错误检测

doi:10.11896/jsjkx.210900247

计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 221-227.doi: 10.11896/jsjkx.210900247

融合词性与声调特征的越南语语法错误检测

张洲, 朱俊国, 余正涛

昆明理工大学信息工程与自动化学院昆明 650500
昆明理工大学云南省人工智能重点实验室昆明 650500

收稿日期:2021-09-28 修回日期:2022-03-22 出版日期:2022-11-15 发布日期:2022-11-03
通讯作者: 朱俊国(jg.zhu.hit@qq.com)
作者简介:(zhangzhoukust@foxmail.com )
基金资助:
国家自然科学基金(62166022,61732005,61866020);云南省重大科技专项计划(202002AD080001,202103AA080015);云南省科技厅面上项目(202101AT070077);云南省人培项目(KKSY201903018)

Incorporating Part of Speech and Tonal Features for Vietnamese Grammatical Error Detection

ZHANG Zhou, ZHU Jun-guo, YU Zheng-tao

School of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China

Received:2021-09-28 Revised:2022-03-22 Online:2022-11-15 Published:2022-11-03
About author:ZHANG Zhou,born in 1992,postgra-duate.His main research interests include natural language processing and grammatical error correction.
ZHU Jun-guo,born in 1982,Ph.D,lecturer,is a member of China Computer Federation.His main research interests include natural language processing and machine translation.
Supported by:
National Natural Science Foundation of China(62166022,61732005,61866020),Yunnan Provincial Major Science and Technology Special Plan(202002AD080001,202103AA080015),General Project of Yunnan Provincial Department of Science and Technology(202101AT070077) and People Training Project of Yunnan Province(KKSY201903018).

摘要/Abstract

摘要： BERT(Bidirectional Encoder Representation from Transformers)预训练语言模型在对越南语分词时会去掉越南语音节的声调,导致语法错误检测模型在训练过程中会丢失部分语义信息。针对该问题,提出了一种融合越南语词性和声调特征的方法来补全输入音节的语义信息。由于越南语的标注语料稀缺,语法错误检测任务面临训练数据规模不足的问题。针对该问题,设计了一种由正确语料生成大量错误文本的数据增强算法。在越南语维基百科和新闻语料上的实验结果表明,所提方法在测试集上取得了最高的F_0.5和F₁分数,证明该方法可提高检测效果,并且随着生成数据规模的扩大,该方法与基线模型方法的效果都得到了逐步提升,从而证明了所提数据增强算法的有效性。

关键词: 预训练语言模型, 越南语语法错误检测, 特征融合, 数据增强

Abstract: The BERT pre-trained language model removes the tones of the syllables when segmenting Vietnamese words,which leads to the loss of some semantic information during the training process of grammatical error detection model.To address this problem,an approach combining part of speech and tonal features is proposed to complete the semantic information of the input syllables.Grammatical error detection task confronts the problem of insufficient training data due to the scarcity of labeled Vietnamese data.To address this problem,a data augmentation algorithm is designed to generate a large number of error texts from the correct corpus.Experimental results on Vietnamese Wikipedia and news corpus show that the proposed method achieves the highest F_0.5 and F₁ score on the test set,which proves it improves the detection performance.Both the proposed method and the baseline model method have a gradual improvement with the increasing scales of the generated data,which proves that the proposed data augmentation algorithm is effective.

Key words: Pre-trained language model, Vietnamese grammatical error detection, Feature fusion, Data augmentation

中图分类号:

TP391

张洲, 朱俊国, 余正涛. 融合词性与声调特征的越南语语法错误检测[J]. 计算机科学, 2022, 49(11): 221-227. https://doi.org/10.11896/jsjkx.210900247

ZHANG Zhou, ZHU Jun-guo, YU Zheng-tao. Incorporating Part of Speech and Tonal Features for Vietnamese Grammatical Error Detection[J]. Computer Science, 2022, 49(11): 221-227. https://doi.org/10.11896/jsjkx.210900247

参考文献

[1]MADI N,AL-KHALIFA H S.Grammatical Error CheckingSystems:A Review of Approaches and Emerging Directions[C]//Proceedings of the Thirteenth International Conference on Digital Information Management.2019:142-147.
[2]YIN C,WU M.Survey on N-gram Model [J].Computer Systems and Applications,2018,27(10):33-38.
[3]PETERS M,NEUMANN M,IYYER M,et al.Deep Contextua-lized Word Representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:2227-2237.
[4]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[5]MACDONALD N,FRASE L,GINGRICH P,et al.The Writer’sWorkbench:Computer Aids for Text Analysis[J].IEEE Tran-sactions on Communications,1982,30(1):105-110.
[6]FOSTER J,VOGEL C.Parsing Ill-Formed Text Using an Error Grammar[J].Artificial Intelligence Review,2004,21(3):269-291.
[7]TETREAULT J R,CHODOROW M.The Ups and Downs ofPreposition Error Detection in ESL Writing[C]//Proceedings of the 22nd International Conference on Computational Linguistics.2008:865-872.
[8]NGUYEN P H,NGO T D,PHAN D A,et al.Vietnamese Spel-ling Detection and Correction Using Bi-gram,Minimum Edit Distance,SoundEx Algorithms with Some Additional Heuristics[C]//Proceedings of 2008 International Conference on Research,Innovation and Vision for the Future.IEEE,2008:96-102.
[9]HUONG N,DANG T T,NGUYEN T T,et al.Using Large N-gram for Vietnamese Spell Checking[C]//Proceedings of the Sixth International Conference on Knowledge and Systems Engineering.2015:617-627.
[10]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of International Conference on Learning Representations.2013:1-12.
[11]REI M.Semi-supervised Multitask Learning for Sequence Labeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:2121-2130.
[12]REI M,YANNAKOUDAKIS H.Compositional Sequence Labeling Models for Error Detection in Learner Writing[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1181-1191.
[13]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780.
[14]PISLAR M,REI M.Seeing Both the Forest and the Trees:Multi-head Attention for Joint Classification on Different Compositional Levels[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:3761-3775.
[15]MASAHIRO K,MAMORU K.Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection[J].Computación y Sistemas,2019,23(3):883-891.
[16]ZHANG J H.Combining GCN and Transformer for ChineseGrammatical Error Detection[J].arXiv:2105.09085,2021.
[17]KIPF T N,WELLING M.Semi-Supervised Classification withGraph Convolutional Networks[C]//Proceedings of 5th International Conference on Learning Representation.2017:1-14.
[18]RAO G Q,YANG E H,ZHANG B L.Overview of NLPTEA-2020 Shared Task for Chinese Grammatical Error Diagnosis[C]//Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications.2020:25-35.
[19]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001.
[20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of Advances in Neural Information Processing Systems.2017:5998-6008.
[21]TAN Z C,XU F Y,LIN L.Basic Vietnamese[M].Beijing:World Publishing Corporation,2013:3-95.
[22]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global Vectors for Word Representation [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[23]STRUBELL E,VERGA P,BELANGER D,et al.Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:2670-2680.

相关文章 15

[1]	张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
[2]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[3]	郁舒昊, 周辉, 叶春杨, 王太正. SDFA:基于多特征融合的船舶轨迹聚类方法研究 SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion 计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253
[4]	杨玥, 冯涛, 梁虹, 杨扬. 融合交叉注意力机制的图像任意风格迁移 Image Arbitrary Style Transfer via Criss-cross Attention 计算机科学, 2022, 49(6A): 345-352. https://doi.org/10.11896/jsjkx.210700236
[5]	王建明, 陈响育, 杨自忠, 史晨阳, 张宇航, 钱正坤. 不同数据增强方法对模型识别精度的影响 Influence of Different Data Augmentation Methods on Model Recognition Accuracy 计算机科学, 2022, 49(6A): 418-423. https://doi.org/10.11896/jsjkx.210700210
[6]	陈永平, 朱建清, 谢懿, 吴含笑, 曾焕强. 基于外接圆半径差损失的实时安全帽检测算法 Real-time Helmet Detection Algorithm Based on Circumcircle Radius Difference Loss 计算机科学, 2022, 49(6A): 424-428. https://doi.org/10.11896/jsjkx.220100252
[7]	孙洁琪, 李亚峰, 张文博, 刘鹏辉. 基于离散小波变换的双域特征融合深度卷积神经网络 Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation 计算机科学, 2022, 49(6A): 434-440. https://doi.org/10.11896/jsjkx.210900199
[8]	蔡欣雨, 冯翔, 虞慧群. 自适应权重的级联增强节点的宽度学习算法 Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes 计算机科学, 2022, 49(6): 134-141. https://doi.org/10.11896/jsjkx.210500119
[9]	蓝凌翔, 池明旻. 基于特征注意力融合网络的遥感变化检测研究 Remote Sensing Change Detection Based on Feature Fusion and Attention Network 计算机科学, 2022, 49(6): 193-198. https://doi.org/10.11896/jsjkx.210500058
[10]	范新南, 赵忠鑫, 严炜, 严锡君, 史朋飞. 结合注意力机制的多尺度特征融合图像去雾算法 Multi-scale Feature Fusion Image Dehazing Algorithm Combined with Attention Mechanism 计算机科学, 2022, 49(5): 50-57. https://doi.org/10.11896/jsjkx.210400093
[11]	李发光, 伊力哈木·亚尔买买提. 基于改进CenterNet的航拍绝缘子缺陷实时检测模型 Real-time Detection Model of Insulator Defect Based on Improved CenterNet 计算机科学, 2022, 49(5): 84-91. https://doi.org/10.11896/jsjkx.210400142
[12]	董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[13]	李鹏祖, 李瑶, Ibegbu Nnamdi JULIAN, 孙超, 郭浩, 陈俊杰. 基于多特征融合的重叠组套索脑功能超网络构建及分类 Construction and Classification of Brain Function Hypernetwork Based on Overlapping Group Lasso with Multi-feature Fusion 计算机科学, 2022, 49(5): 206-211. https://doi.org/10.11896/jsjkx.210300049
[14]	许华杰, 秦远卓, 杨洋. 基于多级特征融合与注意力模块的场景识别方法 Scene Recognition Method Based on Multi-level Feature Fusion and Attention Module 计算机科学, 2022, 49(4): 209-214. https://doi.org/10.11896/jsjkx.210100135
[15]	高心悦, 田汉民. 基于改进U-Net网络的液滴分割方法 Droplet Segmentation Method Based on Improved U-Net Network 计算机科学, 2022, 49(4): 227-232. https://doi.org/10.11896/jsjkx.210300193

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

融合词性与声调特征的越南语语法错误检测

Incorporating Part of Speech and Tonal Features for Vietnamese Grammatical Error Detection

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0