THE VALIDITY OF AUTOMATED ESSAY SCORING USING NLP COMPARED TO HUMAN RATERS IN THE CONTEXT OF LANGUAGE CERTIFICATION EXAMS

Wirdatul Khasanah; Hale Yilmaz; Benjamin White

doi:10.55849/jiltech.v4i3.1129

Authors

Wirdatul Khasanah
wirdatulkhasanah@unesa.ac.id
Universitas Negeri Surabaya, Indonesia
Hale Yilmaz Ankara University, Turkey
Benjamin White McMaster University, Canada

Vol. 4 No. 3 (2025)

Articles

Accepted January 4, 2026

Published December 31, 2025

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

The integration of Automated Essay Scoring (AES) using Natural Language Processing (NLP) in educational settings has raised questions about its validity, particularly in high-stakes language certification exams. While AES offers the advantage of scalability and efficiency, its ability to replicate human judgment, especially in complex aspects of writing such as creativity and argumentation, remains a subject of debate. This study aims to compare the validity of AES systems to human raters in assessing essays within the context of language certification exams. The primary objective is to evaluate the accuracy, reliability, and alignment between machine-generated scores and those provided by human raters across various writing criteria. A mixed-methods approach was employed, combining quantitative analysis of essay scores and qualitative insights from expert raters. The results indicate a high correlation between AES and human scores for grammar, coherence, and relevance (r = 0.88–0.91), but moderate discrepancies were observed in assessing creativity and argumentation (r = 0.72). The findings suggest that while AES is effective for assessing technical writing aspects, human raters remain essential for evaluating subjective elements. The study concludes that a hybrid approach combining AES with human evaluation may offer a more balanced, reliable, and comprehensive scoring system for language certification exams.

Khasanah, W., Yilmaz, H., & White, B. (2025). THE VALIDITY OF AUTOMATED ESSAY SCORING USING NLP COMPARED TO HUMAN RATERS IN THE CONTEXT OF LANGUAGE CERTIFICATION EXAMS. Journal International of Lingua and Technology, 4(3), 308–322. https://doi.org/10.55849/jiltech.v4i3.1129

Download Citation

Atilan, A. U., & Cetin, N. (2025). Benchmarking Large Language Models on the Turkish Dermatology Board Exam: A Comparative Multilingual Analysis. Turkish Journal of Dermatology, 19(3), 126–133. Scopus. https://doi.org/10.4274/tjd.galenos.2025.85856

Cox, T. L., Brown, A. V., & Malone, M. E. (2025). UNDERSTANDING THE ROLE OF STANDARDIZED EXAMS IN SECOND LANGUAGE PROGRAMS. In The Routledge Handb. Of Language Program Development and Administration (pp. 136–148). Taylor and Francis; Scopus. https://doi.org/10.4324/9781003361213-14

D?browski, A., Kucharczyk, R., Le?ko-Szyma?ska, A., & Sujecka-Zaj?c, J. (2020). COMPETENCES OF THE 21ST CENTURY: Certification of language proficiency. In Competences of the 21st Century: Certification of Language Profic. (p. 272). Warsaw University; Scopus. https://doi.org/10.31338/uw.9788323546917

Estell, J. K. (2007). Using a Java certification book and mock exam in an introductory programming course. Computers in Education Journal, 17(3), 36–43. Scopus.

Flor, M., & Cahill, A. (2025). Automated Scoring of Open-Ended Written Responses: Possibilities and Challenges. In Methodol. Educ. Meas. Assess.: Vol. Part F1024 (pp. 265–298). Springer Nature; Scopus. https://doi.org/10.1007/978-3-031-90951-1_11

Fukuda, A. (2024). Unveiling task value and self-regulated language learning strategies among Japanese learners of English: Insights from different EFL learning scenarios. AILA Review, 37(2), 388–415. Scopus. https://doi.org/10.1075/aila.24024

Inoshita, K. (2024). Assessing GPT’s Legal Knowledge in Japanese Real Estate Transactions Exam. Int. Conf. Innov. Intell. Informatics, Comput., Technol., 3ICT, 149–155. Scopus. https://doi.org/10.1109/3ICT64318.2024.10824669

Ito, R., Kato, K., Higashi, M., Abe, Y., Minamimoto, R., Kato, K., Taoka, T., & Naganawa, S. (2025). Vision-language model performance on the Japanese Nuclear Medicine Board Examination: High accuracy in text but challenges with image interpretation. Annals of Nuclear Medicine, 39(11), 1258–1266. Scopus. https://doi.org/10.1007/s12149-025-02084-x

Kiany, G.-R., ShayesteFar, P., & Amoosi, Y. (2017). Construction and validation of a tool for measuring English teacher candidates’ professional knowledge: Certification policy and practice evidence from teacher-education university in Iran. International Journal of Language Testing, 7(2), 116–154. Scopus.

Kim, D., Park, S. W., & Lee, J. (2025). CQELedu: Design and Implementation of a LangChain and GPT-4o mini-Based Web Application for Custom Question Generation and Error-Based Learning in Education. IEEE Access. Scopus. https://doi.org/10.1109/ACCESS.2025.3639087

Kolb, E. (2024). Mediation as a test format in German high-stakes school-leaving exams. In Mediation as Negotiation of Meanings, Plurilingualism and Language Education (pp. 93–109). Taylor and Francis; Scopus. https://doi.org/10.4324/9781003032069-5

Kucharczyk, R., & Krajka, J. (2021). Coherence in mediation activities at B1 and B2 levels. XLinguae, 14(4), 77–93. Scopus. https://doi.org/10.18355/XL.2021.14.04.06

Laajan, Y., Lotfi, F. Z., & Nachit, B. (2024). NEED FOR EDUCATIONAL ENGINEERING TO ENHANCE READING COMPREHENSION SKILLS. International Journal on Technical and Physical Problems of Engineering, 16(61), 47–54. Scopus.

Llorián González, S. L. (2019). Content analysis and construct validity evidences in Spanish tests with general and academic purposes. RLA, 57(2), 65–86. Scopus. https://doi.org/10.4067/s0718-48832019000200065

Lozi?, E., & Štular, B. (2023). Fluent but Not Factual: A Comparative Analysis of ChatGPT and Other AI Chatbots’ Proficiency and Originality in Scientific Writing for Humanities. Future Internet, 15(10). Scopus. https://doi.org/10.3390/fi15100336

Morris, W., Holmes, L., Choi, J. S., & Crossley, S. (2025). Uncovering Differential Sensitivity Toward Linguistic Features of Cohesion in Large Language Models. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, & S. Isotani (Eds.), Lect. Notes Comput. Sci.: Vol. 15882 LNAI (pp. 227–234). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-031-98465-5_29

Naderi, N., Atf, Z., Lewis, P. R., Far, A. M., Safavi-Naini, S. A. A., & Soroush, A. (2026). Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs. In D. Calvaresi, A. Najjar, A. Omicini, G. Ciatto, R. Aydogan, R. Carli, K. Främling, & S. Tiribelli (Eds.), Lect. Notes Comput. Sci.: Vol. 15936 LNCS (pp. 67–84). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-032-01399-6_5

Newbold, D. (2009). Co-certification: A new direction for external assessment? ELT Journal, 63(1), 51–59. Scopus. https://doi.org/10.1093/elt/ccn015

Nusi, A., Zaim, M., & Ardi, H. (2025). Developing English for Maritime Coursebook through Project-Based Concern: Aligning with Seafarers’ Certification Requirements. Studies in English Language and Education, 12(3), 1231–1247. Scopus. https://doi.org/10.24815/siele.v12i3.39794

Phelps, R., Ataide Pinheiro, W., Cherry Shive, E., Carrizales, D., Greenlees, L., Valle, F., & Sartor, K. (2025). Bilingual education preparation programs across the United States: A review of the past decade’s literature. International Journal of Bilingual Education and Bilingualism. Scopus. https://doi.org/10.1080/13670050.2025.2576060

Planelles Almeida, M., Duñabeitia, J. A., & de Saint-Preux, A. (2022). The VIDAS Data Set: A Spoken Corpus of Migrant and Refugee Spanish Learners. Frontiers in Psychology, 12. Scopus. https://doi.org/10.3389/fpsyg.2021.798614

Rizzo, M. F. (2020). The current “Ibero-Americanization” policy of the Cervantes Institute. Circulo de Linguistica Aplicada a la Comunicacion, 84, 133–142. Scopus. https://doi.org/10.5209/CLAC.72001

Salazar, L. J. (2025). Becoming a bilingual teacher on the border: Success as a language ideology. International Journal of Bilingual Education and Bilingualism. Scopus. https://doi.org/10.1080/13670050.2025.2591373

Schell, B. A. B., & Gillen, G. (2018). Willard and Spackman’s occupational therapy, 13th edition. In Willard and Spackmans Occupational Therapy, 13th Edition (p. 1242). Wolters Kluwer Health; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85059192527&partnerID=40&md5=912af2c9e9aeac6b2fa7d09e51d4c125

Severino, J. V. B., Berger, M. N., de Paula, P. A. B., Loures, F. S., Todeschini, S. A., Roeder, E. A., Veiga, M. H., Knopfholz, J., & Marques, G. L. (2025). Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam. International Journal of Cardiovascular Sciences, 38. Scopus. https://doi.org/10.36660/ijcs.20240231

Shermis, M. D. (2025). Using ChatGPT to score essays and short-form constructed responses. Assessing Writing, 66. Scopus. https://doi.org/10.1016/j.asw.2025.100988

Stanek, K. (2020). Politeness forms in constructing test tasks: The author’s analysis based on the example of the Turkish language. In Competences of the 21st Century: Certification of Language Profic. (pp. 186–200). Warsaw University; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-105015233276&partnerID=40&md5=655de7ea1893477e1128271737b9fa6f

Sujecka-Zaj?c, J., & Kucharczyk, R. (2020). Assessment of B2 writing subtest in the English certification exam: A qualitative analysis of pro-quality research results. In Competences of the 21st Century: Certification of Language Profic. (pp. 107–128). Warsaw University; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-105015239493&partnerID=40&md5=5defaaffbccac7586682406bf3f6414e

Tamboli, V. (2022). Using Task-Based Speaking Assessment to Measure Lexical and Syntactic Knowledge: Implications for ESL Learning. In Task-Based Language Teach. And Assess.: Contemporary Reflections from Across the World (pp. 293–321). Springer Nature; Scopus. https://doi.org/10.1007/978-981-16-4226-5_15

Tsagari, D., & Giannikas, C. N. (2021). The impact of cert-mania on English language learning and teaching: The cypriot case. European Journal of Applied Linguistics and TEFL, 10(1), 193–215. Scopus.

Waldock, W. J., Zhang, J., Guni, A., Nabeel, A., Darzi, A., & Ashrafian, H. (2024). The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. Journal of Medical Internet Research, 26. Scopus. https://doi.org/10.2196/56532

Wang, Y., Huang, J., Du, L., Guo, Y., Liu, Y., & Wang, R. (2025). Evaluating large language models as raters in large-scale writing assessments: A psychometric framework for reliability and validity. Computers and Education: Artificial Intelligence, 9. Scopus. https://doi.org/10.1016/j.caeai.2025.100481

Wilkens, R., Pintard, A., Alfter, D., Folny, V., & François, T. (2023). TCFLE-8: A Corpus of Learner Written Productions for French as a Foreign Language and its Application to Automated Essay Scoring. In H. Bouamor, J. Pino, & K. Bali (Eds.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. (pp. 3447–3465). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2023.emnlp-main.210

Yamada, S., Fujikawa, K., & Pangeni, K. P. (2015). Islanders’ educational choice: Determinants of the students’ performance in the Cambridge International Certificate Exams in the Republic of Maldives. International Journal of Educational Development, 41, 60–69. Scopus. https://doi.org/10.1016/j.ijedudev.2015.01.001

Yavuz, F., Çelik, Ö., & Yava? Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, 56(1), 150–166. Scopus. https://doi.org/10.1111/bjet.13494

Zelenická, E., Pavlová, R., Csalová, O., & Burcl, P. (2023). Language Testing and Certification in an International Context. Integration of Education, 27(1), 155–170. Scopus. https://doi.org/10.15507/1991-9468.110.027.202301.155-170

Zhang, H., & Lei, L. (2025). AlphaLexChinese: Measuring lexical complexity in Chinese texts and its predictive validity for L2 writing scores. System, 134. Scopus. https://doi.org/10.1016/j.system.2025.103809

Zhao, J.-L., Qin, T.-Y., & Shen, L. (2025). Test Performance of Artificial Intelligence in the Chinese Social Work Certification Examination. Research on Social Work Practice. Scopus. https://doi.org/10.1177/10497315251389554