计算机科学 ›› 2022, Vol. 49 ›› Issue (1): 73-79.doi: 10.11896/jsjkx.210900036

• 多语言计算前沿技术* 上一篇    下一篇

融合多策略数据增强的低资源依存句法分析方法

线岩团, 高凡雅, 相艳, 余正涛, 王剑   

  1. 昆明理工大学信息工程与自动化学院 昆明650500
    昆明理工大学云南省人工智能重点实验室 昆明650500
  • 收稿日期:2021-09-03 修回日期:2021-10-13 出版日期:2022-01-15 发布日期:2022-01-18
  • 通讯作者: 相艳(50691012@qq.com)
  • 作者简介:xianyt@kust.edu.cn
  • 基金资助:
    国家自然科学基金项目(61732005,61972186);云南省重大科技专项(202002AD080001,202103AA080015);云南省高新技术产业专项(201606)

Improving Low-resource Dependency Parsing Using Multi-strategy Data Augmentation

XIAN Yan-tuan, GAO Fan-ya, XIANG Yan, YU Zheng-tao, WANG Jian   

  1. Faculty of Information Engineering and Automation,Kunming University of Science and Technology,Kunming 650500,China
    Yunnan Key Laboratory of Artificial Intelligence,Kunming University of Science and Technology,Kunming 650500,China
  • Received:2021-09-03 Revised:2021-10-13 Online:2022-01-15 Published:2022-01-18
  • About author:XIAN Yan-tuan,born in 1981,Ph.D,associate professor,is a member of China Computer Federation.His main research interests include information retrieval and natural language processing.
    XIANG Yan,born in 1979,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include text mining and sentiment analysis.
  • Supported by:
    National Natural Science Foundation of China(61732005,61972186),Yunnan Provincial Major Science and Technology Special Plan Projects(202002AD080001,202103AA080015) and Yunnan High and New Technology Industry Project(201606).

摘要: 依存句法分析旨在识别句子中词与词之间的句法依赖关系。依存句法能为信息抽取、自动问答和机器翻译等任务提供句法特征,提高模型性能。训练数据规模对依存句法分析模型的性能具有重要影响,训练数据的缺乏会带来严重的未知词问题和模型过拟合问题。文中针对低资源依存句法分析问题,提出了多种数据增强策略。所提方法通过同义词替换有效扩充了训练数据,缓解了未知词问题。通过多种Mixup的数据增强策略,有效缓解了模型过拟合问题,提高了模型的泛化能力。在(Universal Dependencies treebanks,UD treebanks)数据集上的实验结果表明,所提方法有效提升了小规模训练语料条件下泰语、越南语和英语依存句法分析的性能。

关键词: Mixup数据增强, 低资源语言, 多策略, 同义词替换, 依存句法分析

Abstract: Dependency parsing aims to identify syntactic dependencies between words in a sentence.Dependency parsing can provide syntactic features and improve model performance for tasks such as information extraction,automatic question answering and machine translation.The training data size has an significant impact on the performance of the dependency parsing model.The lack of training data will cause serious unknown word problems and model over-fitting problems.This paper proposes various data augment strategies for the problem of low-resource dependency parsing.The proposed method effectively expands the training data by synonym substitution and alleviates the unknown words problem.The data augment strategies of multiple Mixups effectively alleviate the model overfitting problem and improve the generalization ability of the model.Experimental results on the universal dependencies treebanks(UD treebanks) dataset show that the proposed methods effectively improve the performance of Thai,Vietnamese and English dependency parsing under small-scale training corpus conditions.

Key words: Dependency parsing, Low-resource language, Mixup data augmentation, Multi-strategy, Synonym substitution

中图分类号: 

  • TP391
[1]TU K W,LI J.A Survey of Recent Developments in Syntactic Parsing[J].Journal of Chinese Information Processing,2020,34(7):30-41.
[2]MAO C L,MAN Z B,YU Z T,et al.A Burmese Dependency Parsing Method Based on Transfer Learning[C]//2020 International Conference on Asian Language Processing (IALP).IEEE,2020:92-97.
[3]CHEN D,MANNING C D.A fast and accurate dependency parser using neural networks [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),Stroudsburg,PA:Association for Computational Linguistics.2014:740-750.
[4]DYER C,BALLESTEROS M,LING W,et al.Transition-based dependency parsing with stack long short-term memory[J].ar-Xiv:1505.08075,2015.
[5]ANDOR D,ALBERTI C,WEISS D,et al.Globally Normalized Transition-Based Neural Networks [C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers).2016:2442-2452.
[6]NALLANI S,SHRIVASTAVA M,SHARMA D M.A Simple and Effective Dependency parser for Telugu[C]//Proceedings of the 58th Annual Meeting of the Association for Computatio-nal Linguistics:Student Research Workshop.2020:143-149.
[7]KIPERWASSER E,GOLDBERG Y.Simple and accurate de-pendency parsing using bidirectional LSTM feature representations[J].Transactions of the Association for Computational Linguistics,2016,4:313-327.
[8]DOZAT T,MANNING C D.Deep biaffine attention for neural dependency parsing[J].arXiv:1611.01734,2016.
[9]SINGKUL S,WORARATPANYA K.Thai dependency parsing with character embedding[C]//2019 11th International Confe-rence on Information Technology and Electrical Engineering (ICITEE).IEEE,2019:1-5.
[10]KULMIZEV A,DE-LHONEUX M,GONTRUM J,et al.DeepContextualized Word Embeddings in Transition-Based and Graph-Based Dependency Parsing-A Tale of Two Parsers Revi-sited[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).2019:2755-2768.
[11]DELHONEUX M,BALLESTEROS M,NIVRE J.Recursivesubtree composition in LSTM-based dependency parsing[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Association for ComputationalLinguistics,2019:1566-1576.
[12]FALENSKA A,KUHN J.The (non-)utility of structural fea-tures in BiLSTM-based dependency parsers [C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:117-128.
[13]ZHANG Z,MA X,HOVY E.An empirical investigation ofstructured output modeling for graph-based neural dependency parsing[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.2019:5592-5598.
[14]ZHANG X,ZHAO J,LECUN Y.Character-level convolutional networks for text classification[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.2015:649-657.
[15]XIE Q,DAI Z,HOVY E,et al.Unsupervised Data Augmentation for Consistency Training[J].Advances in Neural Information Processing Systems,2020,33:6256-6268.
[16]COULOMBE C.Text data augmentation made simple by leveraging nlp cloud apis[J].arXiv:1812.04718,2018.
[17]ZHANG H,CISSE M,DAUPHIN Y N,et al.Mixup:BeyondEmpirical Risk Minimization[C]//International Confe-rence on Learning Representations.2018.
[18]GUO H,MAO Y,ZHANG R.Augmenting data with mixup for sentence classification:An empirical study[J].arXiv:1905.08941,2019.
[19]ZHANG R,YU Y,ZHANG C.SeqMix:Augmenting Active Sequence Labeling via Sequence Mixup[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).2020:8566-8579.
[20]WEI J,ZOU K.EDA:Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Confe-rence on Natural Language Processing (EMNLP-IJCNLP).2019:6382-6388.
[1] 苏向东,高光来,闫学亮.
蒙古文依存句法分析
Dependency Parsing for Traditional Mongolian
计算机科学, 2014, 41(8): 97-100. https://doi.org/10.11896/j.issn.1002-137X.2014.08.021
[2] 邱云飞,鲍莉,邵良杉.
基于分类的term重要性识别方法
Term Importance Identification Method Based on Classification
计算机科学, 2013, 40(11): 242-247.
[3] 王超 朱炜 李俊 潘金贵.
多策略的主题集中式万维网爬虫设计

计算机科学, 2004, 31(7): 84-86.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!