计算机科学 ›› 2019, Vol. 46 ›› Issue (3): 234-241.doi: 10.11896/j.issn.1002-137X.2019.03.035

• 人工智能 • 上一篇    下一篇

基于篇章结构的英文作文自动评分方法

周明1,3,贾艳明2,周彩兰1,徐宁1,3   

  1. (武汉理工大学计算机科学与技术学院 武汉 430070)1
    (北京博智天下信息技术有限公司人工智能与大数据研究中心 北京 100085)2
    (武汉理工大学交通物联网技术湖北省重点实验室 武汉 430070)3
  • 收稿日期:2018-01-24 修回日期:2018-05-13 出版日期:2019-03-15 发布日期:2019-03-22
  • 通讯作者: 周彩兰(1964-),女,硕士,副教授,主要研究方向为深度学习、图像处理,E-mail:383277764@qq.com
  • 作者简介:周明(1993-),男,硕士生,主要研究方向为自然语言处理、篇章分析,E-mail:zhoum1118@163.com;贾艳明(1980-),男,博士,高级工程师,主要研究方向为机器学习、算法优化;徐宁(1968-),男,博士,教授,博士生导师,CCF高级会员,主要研究方向为超大规模集成电路的计算机辅助设计系统、计算机体系结构、数据挖掘和算法优化。

English Automated Essay Scoring Methods Based on Discourse Structure

ZHOU Ming1,3,JIA Yan-ming2,ZHOU Cai-lan1,XU Ning1,3   

  1. (School of Computer Science and Technology,Wuhan University of Technology,Wuhan 430070,China)1
    (Research Center for Artificial Intelligence and Big Data,Global Wisdom Inc,Beijing 100085,China)2
    (Hubei Key Laboratory of Transportation Internet of Things,Wuhan University of Technology,Wuhan 430070,China)3
  • Received:2018-01-24 Revised:2018-05-13 Online:2019-03-15 Published:2019-03-22

摘要: 作文自动评分(Automated Essay Scoring AES)是指使用统计学、自然语言处理及语言学等领域的技术对作文进行评价和评分的系统。篇章结构分析是自然语言处理领域的一个重要研究方向,也是作文自动评分系统的重要组成部分之一。目前国外的作文自动评分系统虽有广泛应用,但对篇章结构评分的研究还存在不足,且对中国学生英语作文的针对性不强;国内对英语作文自动评分的研究处于起步阶段,忽视了篇章结构对英语作文评分的重要性。针对这些问题,提出一种基于篇章结构的英文作文自动评分方法,在词、句、段落3个层面上提取作文的词汇、句法以及结构等特征,并使用支持向量机、随机森林以及极端梯度上升等算法对篇章成分进行分类,最后构建线性回归模型对作文的篇章结构进行评分。实验结果表明,基于随机森林的篇章成分识别模型(Discourse Element Identification based Random Forest,DEI-RF)的准确率为94.13%;基于线性回归的篇章结构自动评分模型(Discourse Structures Scoring based Linear Regression,DSS-LR)在背景介绍段(Introduction)、论证段(Argumentation)以及让步段(Concession)的均方差可达到0.02,0.11和0.08。

关键词: 篇章成分, 篇章结构分析, 随机森林, 线性回归, 自然语言处理, 作文自动评分

Abstract: Automated essay scoring is defined as the computer technology that evaluates and scores the composition,based on the technologies of statistics,natural language processing,linguistics and some other fields.Discourse structure analysis is not only an important research field of natural language processing,but also an important component of the AES system.Nowadays,AES system has widely application.However,there is not enough research on the structure of the essay,and the AES system does not focus on the Chinese students.The domestic researches on the AES are in infancy,ignoring the importance of discourse structure in essay scoring.In view of these problems,this paper proposed a method of automated essay scoring based on discourse structure.Firstly,the method extracts essay’s features,such as vocabulary,lexical and discourse structure from levels of words,sentences and paragraphs.Then,the composition of essays is classified by support vector machines,random forests and extreme gradient boosting,and then the linear regression model with the discourse element is constructed to score the compositions.The experimental results show that the accuracy of discourse element identification based random forest (DEI-RF) can reach 94.13%,and the mean squared error of automated discourse structure scoring based on linear regression (DSS-LR) model can reach 0.02,0.11 and 0.08 on introduction,argumentation and concession respectively.

Key words: Automated essay scoring, Discourse element, Discourse structure analysis, Linear regression

第3期周 明, Natural language processing, Random forest, 等:基于篇章结构的英文作文自动评分方法

中图分类号: 

  • TP391.1
[1]STAB C,GUREVYCH I.Parsing Argumentation Structures in Persuasive Essays[J].Computational Linguistics,2017,43(3):619-659.
[2]STAB C,GUREVYCH I.Identifying Argumentative Discourse
Structures in Persuasive Essays[C]∥Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing (EMNLP).2014:46-56.
[3]SONG W,FU R,LIU L,et al.Discourse Element Identification in Student Essays based on Global and Local Cohesion[C]∥Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.2015:2255-2261.
[4]BURSTEIN J,MARCU D,KNIGHT K.Finding the WRITE
Stuff:Automatic Identification of Discourse Structure in Student Essays[J].IEEE Intelligent Systems,2003,18(1):32-39.
[5]YIGAL A,JILL B.Automated Essay Scoring with E-rater
v.2.0 [J].The Journal of Technology,Learning,and Assessment,2006,4(2):1-21.
[6]PALTRIDGE B.Discourse Analysis for the Second Language
Writing Classroom∥The TESOL Encyclopedia of English Language Teaching.John Wiley & Sons,Inc.,2017.
[7]HSIEH C J,CHANG K W,LIN C J,et al.A dual coordinate descent method for large-scale linear SVM [C]∥International Conference on Machine Learning.Helsinki,Finland:IEEE press,2008:1369-1398.
[8]BREIMAN L.Random Forests[J].Machine Learning,2001,45(1):5-32.
[9]CHEN T,GUESTRIN C.XGBoost:A Scalable Tree Boosting System[C]∥Acm SIGKDD International Conference on Know-ledge Discovery and Data Mining.ACM,2016:785-794.
[10]MANN W.Rhetorical Structure Theory:Toward a Functional Theory of Text Organization[J].Text & Talk,2009,8(3):243-281.
[11]DUVERLE D A,PRENDINGER H.A novel discourse parser based on support vector machine classification[C]∥Internatio-nal Joint Conference on Natural Language Processing of the Afnlp.ACL,2010:665-673.
[12]FENG V W,HIRST G.A Linear-Time Bottom-Up Discourse
Parser with Constraints and Post-Editing[C]∥Proceeding of the 52nd Annual Meeting of the Association for Computational Linguistics.ACL,2014:511-521.
[13]YAN W R,XU Y,ZHU S S,et al.A Survey to Discourse Relation Analyzing[J].Journal of Chinese Information Processing,2016,30(4):1-11.(in Chinese)
严为绒,徐扬,朱珊珊,等.篇章关系分析研究综述[J].中文信息学报,2016,30(4):1-11.
[14]LI S,KONG F,ZHOU G D.A PDTB-Based Automatic Explicit Discourse Parser[J].Journal of Chinese Information Processing,2016,30(2):18-25.(in Chinese)
李生,孔芳,周国栋.基于PDTB的自动显式篇章分析器[J].中文信息学报,2016,30(2):18-25.
[15]XU F,ZHU Q M,ZHOU G D.Implicit discourse relation recognition based on tree kernel[J].Chinese Journal of Software,2013,24(5):1022-1035.(in Chinese)
徐凡,朱巧明,周国栋.基于树核的隐式篇章关系识别[J].软件学报,2013,24(5):1022-1035.
[16]JIANG Y R,SONG R.Topic clause identification method based on specific features[J].Journal of Computer Applications,2014,34(5):1345-1349.(in Chinese)
蒋玉茹,宋柔.基于细粒度特征的话题句识别方法[J].计算机应用,2014,34(5):1345-1349.
[17]BIRAN O,RAMBOW O.Identifying Justifications in Written
Dialogs[J].International Journal of Semantic Computing,2011,5(4):363-381.
[18]XING Y K,MA S P.A Survey on Statistical language Models[J].Computer Science,2003,30(9):22-26.(in Chinese)
邢永康,马少平.统计语言模型综述[J].计算机科学,2003,30(9):22-26.
[19]PRASAD R,MILTSAKAKI E,DINESH N,et al.The penn discourse treebank 2.0 annotation manual[J].Proceedings of Lrec,2007,24(1):2961-2968.
[20]PALAU R M,MOENS M F.Argumentation mining:the detection,classification and structure of arguments in text[C]∥International Conference on Artificial Intelligence and Law.ACM,2009:98-107.
[21]周志华.机器学习[M].北京:清华大学出版社,2016.
[1] 吕由, 吴文渊.
隐私保护线性回归方案与应用
Privacy-preserving Linear Regression Scheme and Its Application
计算机科学, 2022, 49(9): 318-325. https://doi.org/10.11896/jsjkx.220300190
[2] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[3] 高振卓, 王志海, 刘海洋.
嵌入典型时间序列特征的随机Shapelet森林算法
Random Shapelet Forest Algorithm Embedded with Canonical Time Series Features
计算机科学, 2022, 49(7): 40-49. https://doi.org/10.11896/jsjkx.210700226
[4] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[5] 侯钰涛, 阿布都克力木·阿布力孜, 哈里旦木·阿布都克里木.
中文预训练模型研究进展
Advances in Chinese Pre-training Models
计算机科学, 2022, 49(7): 148-163. https://doi.org/10.11896/jsjkx.211200018
[6] 李小伟, 舒辉, 光焱, 翟懿, 杨资集.
自然语言处理在简历分析中的应用研究综述
Survey of the Application of Natural Language Processing for Resume Analysis
计算机科学, 2022, 49(6A): 66-73. https://doi.org/10.11896/jsjkx.210600134
[7] 阙华坤, 冯小峰, 刘盼龙, 郭文翀, 李健, 曾伟良, 范竞敏.
Grassberger熵随机森林在窃电行为检测的应用
Application of Grassberger Entropy Random Forest to Power-stealing Behavior Detection
计算机科学, 2022, 49(6A): 790-794. https://doi.org/10.11896/jsjkx.210800032
[8] 王文强, 贾星星, 李朋.
自适应的集成定序算法
Adaptive Ensemble Ordering Algorithm
计算机科学, 2022, 49(6A): 242-246. https://doi.org/10.11896/jsjkx.210200108
[9] 章晓庆, 方建生, 肖尊杰, 陈浜, RisaHIGASHITA, 陈婉, 袁进, 刘江.
基于眼前节相干光断层扫描成像的核性白内障分类算法
Classification Algorithm of Nuclear Cataract Based on Anterior Segment Coherence Tomography Image
计算机科学, 2022, 49(3): 204-210. https://doi.org/10.11896/jsjkx.201100085
[10] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[11] 陈志毅, 隋杰.
基于DeepFM和卷积神经网络的集成式多模态谣言检测方法
DeepFM and Convolutional Neural Networks Ensembles for Multimodal Rumor Detection
计算机科学, 2022, 49(1): 101-107. https://doi.org/10.11896/jsjkx.201200007
[12] 刘振宇, 宋晓莹.
一种可用于分类型属性数据的多变量回归森林
Multivariate Regression Forest for Categorical Attribute Data
计算机科学, 2022, 49(1): 108-114. https://doi.org/10.11896/jsjkx.201200189
[13] 陈长伟, 周晓峰.
快速局部协同表示分类器及其在人脸识别中的应用
Fast Local Collaborative Representation Based Classifier and Its Applications in Face Recognition
计算机科学, 2021, 48(9): 208-215. https://doi.org/10.11896/jsjkx.200800155
[14] 王立梅, 朱旭光, 汪德嘉, 张勇, 邢春晓.
基于深度学习的民事案件判决结果分类方法研究
Study on Judicial Data Classification Method Based on Natural Language Processing Technologies
计算机科学, 2021, 48(8): 80-85. https://doi.org/10.11896/jsjkx.210300130
[15] 杨小琴, 刘国军, 郭建慧, 马文涛.
基于随机森林的空域-频域联合特征全参考彩色图像质量评价方法
Full Reference Color Image Quality Assessment Method Based on Spatial and Frequency Domain Joint Features with Random Forest
计算机科学, 2021, 48(8): 99-105. https://doi.org/10.11896/jsjkx.200700106
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!