计算机科学 ›› 2022, Vol. 49 ›› Issue (11): 221-227.doi: 10.11896/jsjkx.210900247
张洲, 朱俊国, 余正涛
ZHANG Zhou, ZHU Jun-guo, YU Zheng-tao
摘要: BERT(Bidirectional Encoder Representation from Transformers)预训练语言模型在对越南语分词时会去掉越南语音节的声调,导致语法错误检测模型在训练过程中会丢失部分语义信息。针对该问题,提出了一种融合越南语词性和声调特征的方法来补全输入音节的语义信息。由于越南语的标注语料稀缺,语法错误检测任务面临训练数据规模不足的问题。针对该问题,设计了一种由正确语料生成大量错误文本的数据增强算法。在越南语维基百科和新闻语料上的实验结果表明,所提方法在测试集上取得了最高的F0.5和F1分数,证明该方法可提高检测效果,并且随着生成数据规模的扩大,该方法与基线模型方法的效果都得到了逐步提升,从而证明了所提数据增强算法的有效性。
中图分类号:
[1]MADI N,AL-KHALIFA H S.Grammatical Error CheckingSystems:A Review of Approaches and Emerging Directions[C]//Proceedings of the Thirteenth International Conference on Digital Information Management.2019:142-147. [2]YIN C,WU M.Survey on N-gram Model [J].Computer Systems and Applications,2018,27(10):33-38. [3]PETERS M,NEUMANN M,IYYER M,et al.Deep Contextua-lized Word Representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:2227-2237. [4]DEVLIN J,CHANG M W,LEE K,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186. [5]MACDONALD N,FRASE L,GINGRICH P,et al.The Writer’sWorkbench:Computer Aids for Text Analysis[J].IEEE Tran-sactions on Communications,1982,30(1):105-110. [6]FOSTER J,VOGEL C.Parsing Ill-Formed Text Using an Error Grammar[J].Artificial Intelligence Review,2004,21(3):269-291. [7]TETREAULT J R,CHODOROW M.The Ups and Downs ofPreposition Error Detection in ESL Writing[C]//Proceedings of the 22nd International Conference on Computational Linguistics.2008:865-872. [8]NGUYEN P H,NGO T D,PHAN D A,et al.Vietnamese Spel-ling Detection and Correction Using Bi-gram,Minimum Edit Distance,SoundEx Algorithms with Some Additional Heuristics[C]//Proceedings of 2008 International Conference on Research,Innovation and Vision for the Future.IEEE,2008:96-102. [9]HUONG N,DANG T T,NGUYEN T T,et al.Using Large N-gram for Vietnamese Spell Checking[C]//Proceedings of the Sixth International Conference on Knowledge and Systems Engineering.2015:617-627. [10]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[C]//Proceedings of International Conference on Learning Representations.2013:1-12. [11]REI M.Semi-supervised Multitask Learning for Sequence Labeling[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.2017:2121-2130. [12]REI M,YANNAKOUDAKIS H.Compositional Sequence Labeling Models for Error Detection in Learner Writing[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:1181-1191. [13]HOCHREITER S,SCHMIDHUBER J.Long Short-Term Me-mory[J].Neural Computation,1997,9(8):1735-1780. [14]PISLAR M,REI M.Seeing Both the Forest and the Trees:Multi-head Attention for Joint Classification on Different Compositional Levels[C]//Proceedings of the 28th International Conference on Computational Linguistics.2020:3761-3775. [15]MASAHIRO K,MAMORU K.Multi-Head Multi-Layer Attention to Deep Language Representations for Grammatical Error Detection[J].Computación y Sistemas,2019,23(3):883-891. [16]ZHANG J H.Combining GCN and Transformer for ChineseGrammatical Error Detection[J].arXiv:2105.09085,2021. [17]KIPF T N,WELLING M.Semi-Supervised Classification withGraph Convolutional Networks[C]//Proceedings of 5th International Conference on Learning Representation.2017:1-14. [18]RAO G Q,YANG E H,ZHANG B L.Overview of NLPTEA-2020 Shared Task for Chinese Grammatical Error Diagnosis[C]//Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications.2020:25-35. [19]PIRES T,SCHLINGER E,GARRETTE D.How Multilingual is Multilingual BERT?[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:4996-5001. [20]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isAll You Need[C]//Proceedings of Advances in Neural Information Processing Systems.2017:5998-6008. [21]TAN Z C,XU F Y,LIN L.Basic Vietnamese[M].Beijing:World Publishing Corporation,2013:3-95. [22]PENNINGTON J,SOCHER R,MANNING C D.GloVe:Global Vectors for Word Representation [C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing.2014:1532-1543. [23]STRUBELL E,VERGA P,BELANGER D,et al.Fast and Accurate Entity Recognition with Iterated Dilated Convolutions[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:2670-2680. |
[1] | 张颖涛, 张杰, 张睿, 张文强. 全局信息引导的真实图像风格迁移 Photorealistic Style Transfer Guided by Global Information 计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036 |
[2] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[3] | 郁舒昊, 周辉, 叶春杨, 王太正. SDFA:基于多特征融合的船舶轨迹聚类方法研究 SDFA:Study on Ship Trajectory Clustering Method Based on Multi-feature Fusion 计算机科学, 2022, 49(6A): 256-260. https://doi.org/10.11896/jsjkx.211100253 |
[4] | 杨玥, 冯涛, 梁虹, 杨扬. 融合交叉注意力机制的图像任意风格迁移 Image Arbitrary Style Transfer via Criss-cross Attention 计算机科学, 2022, 49(6A): 345-352. https://doi.org/10.11896/jsjkx.210700236 |
[5] | 王建明, 陈响育, 杨自忠, 史晨阳, 张宇航, 钱正坤. 不同数据增强方法对模型识别精度的影响 Influence of Different Data Augmentation Methods on Model Recognition Accuracy 计算机科学, 2022, 49(6A): 418-423. https://doi.org/10.11896/jsjkx.210700210 |
[6] | 陈永平, 朱建清, 谢懿, 吴含笑, 曾焕强. 基于外接圆半径差损失的实时安全帽检测算法 Real-time Helmet Detection Algorithm Based on Circumcircle Radius Difference Loss 计算机科学, 2022, 49(6A): 424-428. https://doi.org/10.11896/jsjkx.220100252 |
[7] | 孙洁琪, 李亚峰, 张文博, 刘鹏辉. 基于离散小波变换的双域特征融合深度卷积神经网络 Dual-field Feature Fusion Deep Convolutional Neural Network Based on Discrete Wavelet Transformation 计算机科学, 2022, 49(6A): 434-440. https://doi.org/10.11896/jsjkx.210900199 |
[8] | 蔡欣雨, 冯翔, 虞慧群. 自适应权重的级联增强节点的宽度学习算法 Adaptive Weight Based Broad Learning Algorithm for Cascaded Enhanced Nodes 计算机科学, 2022, 49(6): 134-141. https://doi.org/10.11896/jsjkx.210500119 |
[9] | 蓝凌翔, 池明旻. 基于特征注意力融合网络的遥感变化检测研究 Remote Sensing Change Detection Based on Feature Fusion and Attention Network 计算机科学, 2022, 49(6): 193-198. https://doi.org/10.11896/jsjkx.210500058 |
[10] | 范新南, 赵忠鑫, 严炜, 严锡君, 史朋飞. 结合注意力机制的多尺度特征融合图像去雾算法 Multi-scale Feature Fusion Image Dehazing Algorithm Combined with Attention Mechanism 计算机科学, 2022, 49(5): 50-57. https://doi.org/10.11896/jsjkx.210400093 |
[11] | 李发光, 伊力哈木·亚尔买买提. 基于改进CenterNet的航拍绝缘子缺陷实时检测模型 Real-time Detection Model of Insulator Defect Based on Improved CenterNet 计算机科学, 2022, 49(5): 84-91. https://doi.org/10.11896/jsjkx.210400142 |
[12] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[13] | 李鹏祖, 李瑶, Ibegbu Nnamdi JULIAN, 孙超, 郭浩, 陈俊杰. 基于多特征融合的重叠组套索脑功能超网络构建及分类 Construction and Classification of Brain Function Hypernetwork Based on Overlapping Group Lasso with Multi-feature Fusion 计算机科学, 2022, 49(5): 206-211. https://doi.org/10.11896/jsjkx.210300049 |
[14] | 许华杰, 秦远卓, 杨洋. 基于多级特征融合与注意力模块的场景识别方法 Scene Recognition Method Based on Multi-level Feature Fusion and Attention Module 计算机科学, 2022, 49(4): 209-214. https://doi.org/10.11896/jsjkx.210100135 |
[15] | 高心悦, 田汉民. 基于改进U-Net网络的液滴分割方法 Droplet Segmentation Method Based on Improved U-Net Network 计算机科学, 2022, 49(4): 227-232. https://doi.org/10.11896/jsjkx.210300193 |
|