计算机科学 ›› 2026, Vol. 53 ›› Issue (5): 90-98.doi: 10.11896/jsjkx.250600183
王胜辉, 李腾
WANG Shenghui, LI Teng
摘要: 创新性自动评分(IAS)在教育领域具有重要的意义,传统评分方式存在主观性强、效率低和标准不一等问题,而大语言模型的快速发展为解决这些问题提供了新的可能。为此,构建了高质量数据集WAIS,并提出语义驱动的层次化主题提取算法,该算法通过语义分块、基础主题提取、优化分析和主题融合4个阶段,有效提高了模型对学生回答主题的提取效果,实现了自动主题提取,为自动评分提供了更准确的依据,同时为后续评分建立了一个可解释的认知框架。通过对比Zero-shot,Few-shot和Chain-of-Thought(CoT)3种提示策略,并使用多个预训练模型进行评估,结果表明:CoT方法显著优于其他方法,DeepSeek-R1模型的准确率为68%;而经过微调的小参数模型Qwen1.5-7B的准确率达到了83%,其在创新性评分任务中的表现甚至略优于大参数模型仅使用提示词的效果。研究表明,利用大语言模型进行创新性自动评分是可行的,并具有广阔的发展前景。
中图分类号:
| [1]DASGUPTA T,DEY L.Automatic Scoring for Innovativeness of Textual Ideas[C]//AAAI Workshop:Knowledge Extraction from Text.2016. [2]LIU A,FENG B,XUE B,et al.DeepSeek-v3 technical report[J].arXiv:2412.19437,2024. [3]MIN B,ROSS H,SULEM E,et al.Recent advances in natural language processing via large pre-trained language models:A survey[J].ACM Computing Surveys,2023,56(2):1-40. [4]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023. [5]CHOWDHARY K R.Natural language processing[M]//Fundamentals of Artificial Intelligence.New Delhi:Springer India,2020:603-649. [6]MAHOWALD K,IVANOVA A A,BLANK I A,et al.Dissociating language and thought in large language models[J].Trends in Cognitive Sciences,2024,28(6):517-540. [7]KLINE S J,ROSENBERG N.An overview of innovation[M]//Studies on Science and the Innovation Process:Selected Works of Nathan Rosenberg.New Jersey:World Scientific,2010:173-203. [8]PIMENTEL M A F,CLIFTON D A,CLIFTON L,et al.A review of novelty detection[J].Signal Processing,2014,99:215-249. [9]SILVIA P J,WINTERSTEIN B P,WILLSE J T,et al.Assessing creativity with divergent thinking tasks:exploring the reliability and validity of new subjective scoring methods[J].Psychology of Aesthetics,Creativity,and the Arts,2008,2(2):68. [10]TORRANCE E P.Torrance tests of creative thinking[J].Educational and Psychological Measurement,1966,26(2):223-232. [11]BENEDEK M,MÜHLMANN C,JAUK E,et al.Assessment of divergent thinking by means of the subjective top-scoring method:Effects of the number of top-ideas and time-on-task on reliability and validity[J].Psychology of Aesthetics,Creativity,and the Arts,2013,7(4):341. [12]CROPLEY A J.Defining and measuring creativity:Are creativity tests worth using?[J].Roeper Review,2000,23(2):72-79. [13]ELAZAR Y,KASSNER N,RAVFOGEL S,et al.Measuringand improving consistency in pretrained language models[J].Transactions of the Association for Computational Linguistics,2021,9:1012-1031. [14]GUO J.Web-based creativity assessment system that collects both verbal and figural responses:Its problems and potentials[J].International Journal of Information and Education Technology,2019,9(1):27-34. [15]DASGUPTA I,LAMPINEN A K,CHAN S C Y,et al.Language models show human-like content effects on reasoning tasks[J].arXiv:2207.07051,2022. [16]ORGANISCIAK P,ACAR S,DUMAS D,et al.Beyond semantic distance:Automated scoring of divergent thinking greatly improves with large language models[J].Thinking Skills and Creativity,2023,49:101356. [17]LEE Y.Systematic homonym detection and replacement based on contextual word embedding[J].Neural Processing Letters,2021,53(1):17-36. [18]MCNAMEE P,DUH K.An extensive exploration of back-translation in 60 languages[C]//Findings of the Association for Computational Linguistics:ACL 2023.2023:8166-8183. [19]WHITE J,FU Q,HAYS S,et al.A prompt pattern catalog to enhance prompt engineering with chatgpt[J].arXiv:2302.11382,2023. [20]XU L,XIE H,QIN S Z J,et al.Parameter-efficient fine-tuning methods for pretrained language models:A critical review and assessment[J].arXiv:2312.12148,2023. [21]POURPANAH F,ABDAR M,LUO Y,et al.A review of generalized zero-shot learning methods[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(4):4051-4070. [22]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtprompting elicits reasoning in large language models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837. [23]RENZE M.The effect of sampling temperature on problem solving in large language models[C]//Findings of the Association for Computational Linguistics:EMNLP 2024.2024:7346-7356. [24]ARORA K,GUPTA N,PATHAK S.Sentimental analysis onimdb movies review using bert[C]//2023 4th International Conference on Electronics and Sustainable Communication Systems(ICESC).IEEE,2023:866-871. [25]HU E J,SHEN Y,WALLIS P,et al.Lora:Low-rank adaptation of large language models[C]//ICLR.2022. [26]XIN C,LU Y,LIN H,et al.Beyond full fine-tuning:Harnessing the power of LoRA for multi-taskinstruction tuning[C]//Proceedings of the 2024 Joint International Conference on Computational Linguistics,Language Resources and Evaluation(LREC-COLING 2024).2024:2307-2317. [27]HODSON T O.Root mean square error(RMSE) or mean absolute error(MAE):When to use them or not[J].Geoscientific Model Development Discussions,2022,2022:1-10. [28]COHEN I,HUANG Y,CHEN J,et al.Pearson correlation coefficient[M]//Noise Reduction in Speech Processing.Berlin:Springer,2009:1-4. [29]MCGRAW K O,WONG S P.Forming inferences about some in-traclass correlation coefficients[J].Psychological Methods,1996,1(1):30. [30]PARTHASARATHY V B,ZAFAR A,KHAN A,et al.The ultimate guide to fine-tuning llms from basics to breakthroughs:An exhaustive review of technologies,research,best practices,applied research challenges and opportunities[J].arXiv:2408.13296,2024. |
|
||