计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 90-98.doi: 10.11896/jsjkx.250600183

• 智能教育技术 • 上一篇    下一篇

基于大语言模型的创新性自动评分

王胜辉, 李腾   

  1. 安徽大学人工智能学院 合肥 230601
  • 收稿日期:2025-06-26 修回日期:2025-08-28 发布日期:2026-05-08
  • 通讯作者: 李腾(liteng@ahu.edu.cn)
  • 作者简介:(WA23201005@stu.ahu.edu.cn)

Innovative Automated Scoring Based on Large Language Models

WANG Shenghui, LI Teng   

  1. School of Artificial Intelligence, Anhui University, Hefei 230601, China
  • Received:2025-06-26 Revised:2025-08-28 Online:2026-05-08
  • About author:WANG Shenghui,born in 2000,postgraduate.His main research interests include the application of large language models and computer vision.
    LI Teng,born in 1980,Ph.D,professor,Ph.D supervisor.His main research interests include computer vision and pattern recognition.

摘要: 创新性自动评分(IAS)在教育领域具有重要的意义,传统评分方式存在主观性强、效率低和标准不一等问题,而大语言模型的快速发展为解决这些问题提供了新的可能。为此,构建了高质量数据集WAIS,并提出语义驱动的层次化主题提取算法,该算法通过语义分块、基础主题提取、优化分析和主题融合4个阶段,有效提高了模型对学生回答主题的提取效果,实现了自动主题提取,为自动评分提供了更准确的依据,同时为后续评分建立了一个可解释的认知框架。通过对比Zero-shot,Few-shot和Chain-of-Thought(CoT)3种提示策略,并使用多个预训练模型进行评估,结果表明:CoT方法显著优于其他方法,DeepSeek-R1模型的准确率为68%;而经过微调的小参数模型Qwen1.5-7B的准确率达到了83%,其在创新性评分任务中的表现甚至略优于大参数模型仅使用提示词的效果。研究表明,利用大语言模型进行创新性自动评分是可行的,并具有广阔的发展前景。

关键词: 大语言模型, 创新性, 自动评分, 提示词工程, 监督微调

Abstract: Innovative automated scoring(IAS) is crucial in education.Traditional scoring is subjective,inefficient,and lacks uniform standards.The fast progress of large language models offers new solutions.This study creates a high-quality dataset WAIS and presents a semantic-driven hierarchical topic extraction algorithm.Through four phases-semantic chunking,basic topic extraction,optimized analysis,and topic fusion-the algorithm improves the model’s ability to extract themes from student answers,enabling automatic topic extraction.It offers a solid basis for automated scoring and establishes an explainable cognitive framework for subsequent scoring.The study compares three prompting strategies:Zero-shot,Few-shot,and Chain-of-Thought(CoT),and evaluates them using several pre-trained models.Results show CoT is superior.The DeepSeek-R1 model achieves 68% accuracy.After fine-tuning,the smaller-parameter model Qwen1.5-7B reaches 83% accuracy,even slightly surpassing the larger-parameter model using only the prompt in innovative scoring tasks.This indicates that using large language models for innovative automated scoring is feasible and has great potential for development.

Key words: Large language models, Innovation, Automated scoring, Prompt engineering, Supervised fine-tuning

中图分类号: 

  • TP391
[1]DASGUPTA T,DEY L.Automatic Scoring for Innovativeness of Textual Ideas[C]//AAAI Workshop:Knowledge Extraction from Text.2016.
[2]LIU A,FENG B,XUE B,et al.DeepSeek-v3 technical report[J].arXiv:2412.19437,2024.
[3]MIN B,ROSS H,SULEM E,et al.Recent advances in natural language processing via large pre-trained language models:A survey[J].ACM Computing Surveys,2023,56(2):1-40.
[4]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[5]CHOWDHARY K R.Natural language processing[M]//Fundamentals of Artificial Intelligence.New Delhi:Springer India,2020:603-649.
[6]MAHOWALD K,IVANOVA A A,BLANK I A,et al.Dissociating language and thought in large language models[J].Trends in Cognitive Sciences,2024,28(6):517-540.
[7]KLINE S J,ROSENBERG N.An overview of innovation[M]//Studies on Science and the Innovation Process:Selected Works of Nathan Rosenberg.New Jersey:World Scientific,2010:173-203.
[8]PIMENTEL M A F,CLIFTON D A,CLIFTON L,et al.A review of novelty detection[J].Signal Processing,2014,99:215-249.
[9]SILVIA P J,WINTERSTEIN B P,WILLSE J T,et al.Assessing creativity with divergent thinking tasks:exploring the reliability and validity of new subjective scoring methods[J].Psychology of Aesthetics,Creativity,and the Arts,2008,2(2):68.
[10]TORRANCE E P.Torrance tests of creative thinking[J].Educational and Psychological Measurement,1966,26(2):223-232.
[11]BENEDEK M,MÜHLMANN C,JAUK E,et al.Assessment of divergent thinking by means of the subjective top-scoring method:Effects of the number of top-ideas and time-on-task on reliability and validity[J].Psychology of Aesthetics,Creativity,and the Arts,2013,7(4):341.
[12]CROPLEY A J.Defining and measuring creativity:Are creativity tests worth using?[J].Roeper Review,2000,23(2):72-79.
[13]ELAZAR Y,KASSNER N,RAVFOGEL S,et al.Measuringand improving consistency in pretrained language models[J].Transactions of the Association for Computational Linguistics,2021,9:1012-1031.
[14]GUO J.Web-based creativity assessment system that collects both verbal and figural responses:Its problems and potentials[J].International Journal of Information and Education Technology,2019,9(1):27-34.
[15]DASGUPTA I,LAMPINEN A K,CHAN S C Y,et al.Language models show human-like content effects on reasoning tasks[J].arXiv:2207.07051,2022.
[16]ORGANISCIAK P,ACAR S,DUMAS D,et al.Beyond semantic distance:Automated scoring of divergent thinking greatly improves with large language models[J].Thinking Skills and Creativity,2023,49:101356.
[17]LEE Y.Systematic homonym detection and replacement based on contextual word embedding[J].Neural Processing Letters,2021,53(1):17-36.
[18]MCNAMEE P,DUH K.An extensive exploration of back-translation in 60 languages[C]//Findings of the Association for Computational Linguistics:ACL 2023.2023:8166-8183.
[19]WHITE J,FU Q,HAYS S,et al.A prompt pattern catalog to enhance prompt engineering with chatgpt[J].arXiv:2302.11382,2023.
[20]XU L,XIE H,QIN S Z J,et al.Parameter-efficient fine-tuning methods for pretrained language models:A critical review and assessment[J].arXiv:2312.12148,2023.
[21]POURPANAH F,ABDAR M,LUO Y,et al.A review of generalized zero-shot learning methods[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(4):4051-4070.
[22]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837.
[23]RENZE M.The effect of sampling temperature on problem solving in large language models[C]//Findings of the Association for Computational Linguistics:EMNLP 2024.2024:7346-7356.
[24]ARORA K,GUPTA N,PATHAK S.Sentimental analysis onimdb movies review using bert[C]//2023 4th International Conference on Electronics and Sustainable Communication Systems(ICESC).IEEE,2023:866-871.
[25]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-rank adaptation of large language models[C]//ICLR.2022.
[26]XIN C,LU Y,LIN H,et al.Beyond full fine-tuning:Harnessing the power of LoRA for multi-taskinstruction tuning[C]//Proceedings of the 2024 Joint International Conference on Computational Linguistics,Language Resources and Evaluation(LREC-COLING 2024).2024:2307-2317.
[27]HODSON T O.Root mean square error(RMSE) or mean absolute error(MAE):When to use them or not[J].Geoscientific Model Development Discussions,2022,2022:1-10.
[28]COHEN I,HUANG Y,CHEN J,et al.Pearson correlation coefficient[M]//Noise Reduction in Speech Processing.Berlin:Springer,2009:1-4.
[29]MCGRAW K O,WONG S P.Forming inferences about some in-traclass correlation coefficients[J].Psychological Methods,1996,1(1):30.
[30]PARTHASARATHY V B,ZAFAR A,KHAN A,et al.The ultimate guide to fine-tuning llms from basics to breakthroughs:An exhaustive review of technologies,research,best practices,applied research challenges and opportunities[J].arXiv:2408.13296,2024.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!