Papers
arxiv:2603.06424

From Prompting to Preference Optimization: A Comparative Study of LLM-based Automated Essay Scoring

Published on Mar 6
Authors:
,
,
,
,
,
,
,

Abstract

A comprehensive comparison of large language model-based approaches for automated essay scoring in English as a second language reveals significant trade-offs between accuracy, computational cost, and robustness, with a hybrid method achieving state-of-the-art performance.

AI-generated summary

Large language models (LLMs) have recently reshaped Automated Essay Scoring (AES), yet prior studies typically examine individual techniques in isolation, limiting understanding of their relative merits for English as a Second Language (L2) writing. To bridge this gap, we presents a comprehensive comparison of major LLM-based AES paradigms on IELTS Writing Task~2. On this unified benchmark, we evaluate four approaches: (i) encoder-based classification fine-tuning, (ii) zero- and few-shot prompting, (iii) instruction tuning and Retrieval-Augmented Generation (RAG), and (iv) Supervised Fine-Tuning combined with Direct Preference Optimization (DPO) and RAG. Our results reveal clear accuracy-cost-robustness trade-offs across methods, the best configuration, integrating k-SFT and RAG, achieves the strongest overall results with F1-Score 93%. This study offers the first unified empirical comparison of modern LLM-based AES strategies for English L2, promising potential in auto-grading writing tasks. Code is public at https://github.com/MinhNguyenDS/LLM_AES-EnL2

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2603.06424 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2603.06424 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2603.06424 in a Space README.md to link it from this page.

Collections including this paper 1