THE VALIDITY OF AUTOMATED ESSAY SCORING USING NLP COMPARED TO HUMAN RATERS IN THE CONTEXT OF LANGUAGE CERTIFICATION EXAMS
Downloads
The integration of Automated Essay Scoring (AES) using Natural Language Processing (NLP) in educational settings has raised questions about its validity, particularly in high-stakes language certification exams. While AES offers the advantage of scalability and efficiency, its ability to replicate human judgment, especially in complex aspects of writing such as creativity and argumentation, remains a subject of debate. This study aims to compare the validity of AES systems to human raters in assessing essays within the context of language certification exams. The primary objective is to evaluate the accuracy, reliability, and alignment between machine-generated scores and those provided by human raters across various writing criteria. A mixed-methods approach was employed, combining quantitative analysis of essay scores and qualitative insights from expert raters. The results indicate a high correlation between AES and human scores for grammar, coherence, and relevance (r = 0.88–0.91), but moderate discrepancies were observed in assessing creativity and argumentation (r = 0.72). The findings suggest that while AES is effective for assessing technical writing aspects, human raters remain essential for evaluating subjective elements. The study concludes that a hybrid approach combining AES with human evaluation may offer a more balanced, reliable, and comprehensive scoring system for language certification exams.
Atilan, A. U., & Cetin, N. (2025). Benchmarking Large Language Models on the Turkish Dermatology Board Exam: A Comparative Multilingual Analysis. Turkish Journal of Dermatology, 19(3), 126–133. Scopus. https://doi.org/10.4274/tjd.galenos.2025.85856
Cox, T. L., Brown, A. V., & Malone, M. E. (2025). UNDERSTANDING THE ROLE OF STANDARDIZED EXAMS IN SECOND LANGUAGE PROGRAMS. In The Routledge Handb. Of Language Program Development and Administration (pp. 136–148). Taylor and Francis; Scopus. https://doi.org/10.4324/9781003361213-14
D?browski, A., Kucharczyk, R., Le?ko-Szyma?ska, A., & Sujecka-Zaj?c, J. (2020). COMPETENCES OF THE 21ST CENTURY: Certification of language proficiency. In Competences of the 21st Century: Certification of Language Profic. (p. 272). Warsaw University; Scopus. https://doi.org/10.31338/uw.9788323546917
Estell, J. K. (2007). Using a Java certification book and mock exam in an introductory programming course. Computers in Education Journal, 17(3), 36–43. Scopus.
Flor, M., & Cahill, A. (2025). Automated Scoring of Open-Ended Written Responses: Possibilities and Challenges. In Methodol. Educ. Meas. Assess.: Vol. Part F1024 (pp. 265–298). Springer Nature; Scopus. https://doi.org/10.1007/978-3-031-90951-1_11
Fukuda, A. (2024). Unveiling task value and self-regulated language learning strategies among Japanese learners of English: Insights from different EFL learning scenarios. AILA Review, 37(2), 388–415. Scopus. https://doi.org/10.1075/aila.24024
Inoshita, K. (2024). Assessing GPT’s Legal Knowledge in Japanese Real Estate Transactions Exam. Int. Conf. Innov. Intell. Informatics, Comput., Technol., 3ICT, 149–155. Scopus. https://doi.org/10.1109/3ICT64318.2024.10824669
Ito, R., Kato, K., Higashi, M., Abe, Y., Minamimoto, R., Kato, K., Taoka, T., & Naganawa, S. (2025). Vision-language model performance on the Japanese Nuclear Medicine Board Examination: High accuracy in text but challenges with image interpretation. Annals of Nuclear Medicine, 39(11), 1258–1266. Scopus. https://doi.org/10.1007/s12149-025-02084-x
Kiany, G.-R., ShayesteFar, P., & Amoosi, Y. (2017). Construction and validation of a tool for measuring English teacher candidates’ professional knowledge: Certification policy and practice evidence from teacher-education university in Iran. International Journal of Language Testing, 7(2), 116–154. Scopus.
Kim, D., Park, S. W., & Lee, J. (2025). CQELedu: Design and Implementation of a LangChain and GPT-4o mini-Based Web Application for Custom Question Generation and Error-Based Learning in Education. IEEE Access. Scopus. https://doi.org/10.1109/ACCESS.2025.3639087
Kolb, E. (2024). Mediation as a test format in German high-stakes school-leaving exams. In Mediation as Negotiation of Meanings, Plurilingualism and Language Education (pp. 93–109). Taylor and Francis; Scopus. https://doi.org/10.4324/9781003032069-5
Kucharczyk, R., & Krajka, J. (2021). Coherence in mediation activities at B1 and B2 levels. XLinguae, 14(4), 77–93. Scopus. https://doi.org/10.18355/XL.2021.14.04.06
Laajan, Y., Lotfi, F. Z., & Nachit, B. (2024). NEED FOR EDUCATIONAL ENGINEERING TO ENHANCE READING COMPREHENSION SKILLS. International Journal on Technical and Physical Problems of Engineering, 16(61), 47–54. Scopus.
Llorián González, S. L. (2019). Content analysis and construct validity evidences in Spanish tests with general and academic purposes. RLA, 57(2), 65–86. Scopus. https://doi.org/10.4067/s0718-48832019000200065
Lozi?, E., & Štular, B. (2023). Fluent but Not Factual: A Comparative Analysis of ChatGPT and Other AI Chatbots’ Proficiency and Originality in Scientific Writing for Humanities. Future Internet, 15(10). Scopus. https://doi.org/10.3390/fi15100336
Morris, W., Holmes, L., Choi, J. S., & Crossley, S. (2025). Uncovering Differential Sensitivity Toward Linguistic Features of Cohesion in Large Language Models. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, & S. Isotani (Eds.), Lect. Notes Comput. Sci.: Vol. 15882 LNAI (pp. 227–234). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-031-98465-5_29
Naderi, N., Atf, Z., Lewis, P. R., Far, A. M., Safavi-Naini, S. A. A., & Soroush, A. (2026). Evaluating Prompt Engineering Techniques for Accuracy and Confidence Elicitation in Medical LLMs. In D. Calvaresi, A. Najjar, A. Omicini, G. Ciatto, R. Aydogan, R. Carli, K. Främling, & S. Tiribelli (Eds.), Lect. Notes Comput. Sci.: Vol. 15936 LNCS (pp. 67–84). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-032-01399-6_5
Newbold, D. (2009). Co-certification: A new direction for external assessment? ELT Journal, 63(1), 51–59. Scopus. https://doi.org/10.1093/elt/ccn015
Nusi, A., Zaim, M., & Ardi, H. (2025). Developing English for Maritime Coursebook through Project-Based Concern: Aligning with Seafarers’ Certification Requirements. Studies in English Language and Education, 12(3), 1231–1247. Scopus. https://doi.org/10.24815/siele.v12i3.39794
Phelps, R., Ataide Pinheiro, W., Cherry Shive, E., Carrizales, D., Greenlees, L., Valle, F., & Sartor, K. (2025). Bilingual education preparation programs across the United States: A review of the past decade’s literature. International Journal of Bilingual Education and Bilingualism. Scopus. https://doi.org/10.1080/13670050.2025.2576060
Planelles Almeida, M., Duñabeitia, J. A., & de Saint-Preux, A. (2022). The VIDAS Data Set: A Spoken Corpus of Migrant and Refugee Spanish Learners. Frontiers in Psychology, 12. Scopus. https://doi.org/10.3389/fpsyg.2021.798614
Rizzo, M. F. (2020). The current “Ibero-Americanization” policy of the Cervantes Institute. Circulo de Linguistica Aplicada a la Comunicacion, 84, 133–142. Scopus. https://doi.org/10.5209/CLAC.72001
Salazar, L. J. (2025). Becoming a bilingual teacher on the border: Success as a language ideology. International Journal of Bilingual Education and Bilingualism. Scopus. https://doi.org/10.1080/13670050.2025.2591373
Schell, B. A. B., & Gillen, G. (2018). Willard and Spackman’s occupational therapy, 13th edition. In Willard and Spackmans Occupational Therapy, 13th Edition (p. 1242). Wolters Kluwer Health; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85059192527&partnerID=40&md5=912af2c9e9aeac6b2fa7d09e51d4c125
Severino, J. V. B., Berger, M. N., de Paula, P. A. B., Loures, F. S., Todeschini, S. A., Roeder, E. A., Veiga, M. H., Knopfholz, J., & Marques, G. L. (2025). Performance Benchmarking of Open-Source Large Language Models on the Brazilian Society of Cardiology’s Certification Exam. International Journal of Cardiovascular Sciences, 38. Scopus. https://doi.org/10.36660/ijcs.20240231
Shermis, M. D. (2025). Using ChatGPT to score essays and short-form constructed responses. Assessing Writing, 66. Scopus. https://doi.org/10.1016/j.asw.2025.100988
Stanek, K. (2020). Politeness forms in constructing test tasks: The author’s analysis based on the example of the Turkish language. In Competences of the 21st Century: Certification of Language Profic. (pp. 186–200). Warsaw University; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-105015233276&partnerID=40&md5=655de7ea1893477e1128271737b9fa6f
Sujecka-Zaj?c, J., & Kucharczyk, R. (2020). Assessment of B2 writing subtest in the English certification exam: A qualitative analysis of pro-quality research results. In Competences of the 21st Century: Certification of Language Profic. (pp. 107–128). Warsaw University; Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-105015239493&partnerID=40&md5=5defaaffbccac7586682406bf3f6414e
Tamboli, V. (2022). Using Task-Based Speaking Assessment to Measure Lexical and Syntactic Knowledge: Implications for ESL Learning. In Task-Based Language Teach. And Assess.: Contemporary Reflections from Across the World (pp. 293–321). Springer Nature; Scopus. https://doi.org/10.1007/978-981-16-4226-5_15
Tsagari, D., & Giannikas, C. N. (2021). The impact of cert-mania on English language learning and teaching: The cypriot case. European Journal of Applied Linguistics and TEFL, 10(1), 193–215. Scopus.
Waldock, W. J., Zhang, J., Guni, A., Nabeel, A., Darzi, A., & Ashrafian, H. (2024). The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis. Journal of Medical Internet Research, 26. Scopus. https://doi.org/10.2196/56532
Wang, Y., Huang, J., Du, L., Guo, Y., Liu, Y., & Wang, R. (2025). Evaluating large language models as raters in large-scale writing assessments: A psychometric framework for reliability and validity. Computers and Education: Artificial Intelligence, 9. Scopus. https://doi.org/10.1016/j.caeai.2025.100481
Wilkens, R., Pintard, A., Alfter, D., Folny, V., & François, T. (2023). TCFLE-8: A Corpus of Learner Written Productions for French as a Foreign Language and its Application to Automated Essay Scoring. In H. Bouamor, J. Pino, & K. Bali (Eds.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. (pp. 3447–3465). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2023.emnlp-main.210
Yamada, S., Fujikawa, K., & Pangeni, K. P. (2015). Islanders’ educational choice: Determinants of the students’ performance in the Cambridge International Certificate Exams in the Republic of Maldives. International Journal of Educational Development, 41, 60–69. Scopus. https://doi.org/10.1016/j.ijedudev.2015.01.001
Yavuz, F., Çelik, Ö., & Yava? Çelik, G. (2025). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, 56(1), 150–166. Scopus. https://doi.org/10.1111/bjet.13494
Zelenická, E., Pavlová, R., Csalová, O., & Burcl, P. (2023). Language Testing and Certification in an International Context. Integration of Education, 27(1), 155–170. Scopus. https://doi.org/10.15507/1991-9468.110.027.202301.155-170
Zhang, H., & Lei, L. (2025). AlphaLexChinese: Measuring lexical complexity in Chinese texts and its predictive validity for L2 writing scores. System, 134. Scopus. https://doi.org/10.1016/j.system.2025.103809
Zhao, J.-L., Qin, T.-Y., & Shen, L. (2025). Test Performance of Artificial Intelligence in the Chinese Social Work Certification Examination. Research on Social Work Practice. Scopus. https://doi.org/10.1177/10497315251389554
Copyright (c) 2025 Wirdatul Khasanah, Hale Yilmaz, Benjamin White

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.


















