Automatic Learning Method of Domain Semantic Grammar Based on Fault-tolerant Earley Parsing Algorithm

MA Yi-fan1, MA Tao-tao2, FANG Fang3, WANG Shi2, TANG Su-qin4, CAO Cun-gen2   

  1. 1 School of Computer Science and Information Engineering,Guangxi Normal University,Guilin,Guangxi 541000,China
    2 Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190,China
    3 Institute of Information Engineering,Chinese Academy of Sciences,Beijing 100190,China
    4 Department of Educational Technology,Faculty of Education,Guangxi Normal University,Guilin,Guangxi 541000,China
  • Received:2021-01-28 Revised:2021-05-19 Online:2021-11-15 Published:2021-11-10
  • About author:MA Yi-fan,born in 1996,postgraduate.Her main research interests include na-tural language processing and so on.
    CAO Cun-gen,born in 1964,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include large-scale knowledge process and so on.
  • Supported by:
    Key Research and Development Projects of the Ministry of Science and Technology(2017YFC1700302),Beijing NOVA Program(Cross-discipline,Z191100001119014),National Key Research and Development Program of China(2017YFB1002300) and National Natural Science Foundation of China(61967002).

摘要: 精细化的领域文本分析是高质量领域知识获取的重要前提,它通常依赖于大量某种形式的语义文法产生式,但总结这些文法通常耗时耗力。对此,文中提出了一种基于容错Earley解析算法的语义文法自动学习方法,根据种子文法自动生成新的语义文法(包括词类和文法产生式),以减少人工成本。该方法利用优化后的容错Earley解析器,对输入的语句进行容错解析,然后根据容错解析生成的解析树产生候选语义文法,最后对候选语义文法进行过滤或纠正得到最终的语义文法。在5种不同疾病的中医医案的实验中,该方法的词类学习的正确率达到63.88%,文法产生式学习的正确率达到81.78%。

关键词: 过滤算法, 容错Earley解析, 文法学习, 语义纠正, 语义文法

Abstract: Refined domain text analysis is an important prerequisite for high-quality domain knowledge acquisition.It usually relies on a large number of some form of semantic grammars,but summarizing them is often time-consuming and labor-intensive.In this paper,an automatic learning method of semantic grammar based on fault-tolerant Earley parsing algorithm is proposed,which automatically generates new semantic grammars (including lexicons and grammar production rules) according to seed grammar to reduce labor costs.This method uses the optimized fault-tolerant Earley parser to perform fault-tolerant parsing on the input statements,and then generates candidate semantic grammars based on the parse tree generated by the fault-tolerant parsing.Finally,the candidate semantic grammars are filtered or corrected to obtain the final semantic grammars.In the experiment of five TCM medical records with different diseases,the precision rate of learning new lexicons is 63.88%,and precision rate of learning new grammar production rules is 81.78%.

Key words: Fault-tolerant Earley parser, Filtering algorithm, Grammar learning, Semantic correction, Semantic grammar


