THE VALIDITY OF AUTOMATED ESSAY SCORING USING NLP COMPARED TO HUMAN RATERS IN THE CONTEXT OF LANGUAGE CERTIFICATION EXAMS

Automated Essay Scoring Natural Language Processing Writing Assessment

Authors

January 4, 2026
December 31, 2025

Downloads

The integration of Automated Essay Scoring (AES) using Natural Language Processing (NLP) in educational settings has raised questions about its validity, particularly in high-stakes language certification exams. While AES offers the advantage of scalability and efficiency, its ability to replicate human judgment, especially in complex aspects of writing such as creativity and argumentation, remains a subject of debate. This study aims to compare the validity of AES systems to human raters in assessing essays within the context of language certification exams. The primary objective is to evaluate the accuracy, reliability, and alignment between machine-generated scores and those provided by human raters across various writing criteria. A mixed-methods approach was employed, combining quantitative analysis of essay scores and qualitative insights from expert raters. The results indicate a high correlation between AES and human scores for grammar, coherence, and relevance (r = 0.88–0.91), but moderate discrepancies were observed in assessing creativity and argumentation (r = 0.72). The findings suggest that while AES is effective for assessing technical writing aspects, human raters remain essential for evaluating subjective elements. The study concludes that a hybrid approach combining AES with human evaluation may offer a more balanced, reliable, and comprehensive scoring system for language certification exams.