Development of an Indonesian-English Parallel Corpus for Translation and Comparative Linguistics Research
Downloads
The development of parallel corpora plays a crucial role in the fields of translation studies, computational linguistics, and comparative linguistics. While significant parallel corpora have been developed for major languages like English, the availability of such resources for Indonesian-English translation research remains limited. This study aims to develop a comprehensive Indonesian-English parallel corpus, specifically designed to aid translation research and enhance linguistic comparisons between these two languages. The corpus is intended to serve as a foundational resource for further studies on machine translation, linguistic patterns, and cross-linguistic influence. The research adopts a corpus-driven methodology, where the corpus is compiled from diverse sources, including literary texts, news articles, academic papers, and everyday discourse, to ensure a broad representation of language use. The corpus is annotated for both syntax and semantics, with a focus on aligning sentence structures and identifying key linguistic features in both languages. The analysis of the corpus reveals significant differences and similarities in sentence structure, word order, and translation equivalence between Indonesian and English. The findings highlight the potential of the corpus to facilitate various types of linguistic research and translation studies. It serves as a valuable tool for enhancing the quality of machine translation systems and provides insights into the challenges of translating between Indonesian and English.
Ali, M. H., Mohammed, S. L., & Al-Naji, A. (2024). Voice-based gender classification: A comparative study based on machine learning algorithms. Dalam Hatem W.A., Obed A.A., Mosleh M.F., Gharghan S.K., & Al-Naji A. (Ed.), AIP Conf. Proc. (Vol. 3232, Nomor 1). American Institute of Physics; Scopus. https://doi.org/10.1063/5.0236193
Bagdon, C., Karmalker, P., Gurulingappa, H., & Klinger, R. (2024). “You are an expert annotator”: Automatic Best–Worst-Scaling Annotations for Emotion Intensity Modeling. Dalam Duh K., Gomez H., & Bethard S. (Ed.), Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL (Vol. 1, hlm. 7917–7929). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.naacl-long.439
Bourahouat, G., Abourezq, M., & Daoudi, N. (2024). Toward an efficient extractive Arabic text summarisation system based on Arabic large language models. International Journal of Data Science and Analytics. Scopus. https://doi.org/10.1007/s41060-024-00618-6
Budakoglu, G., & Emekci, H. (2025). Unveiling the Power of Large Language Models: A Comparative Study of Retrieval-Augmented Generation, Fine-Tuning, and Their Synergistic Fusion for Enhanced Performance. IEEE Access, 13, 30936–30951. Scopus. https://doi.org/10.1109/ACCESS.2025.3542334
Cho, I., Kwon, G., & Hockenmaier, J. (2024). TUTOR-ICL: Guiding Large Language Models for Improved In-Context Learning Performance. Dalam Al-Onaizan Y., Bansal M., & Chen Y.-N. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Find. EMNLP (hlm. 9496–9506). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.findings-emnlp.554
Darchuk, N., Zuban, O., Robeiko, V., Tsyhvintseva, Y., Sorokin, V., & Sazhok, M. (2024). THE SYSTEM FOR AUTOMATIC STYLOMETRIC ANALYSIS OF UKRAINIAN MEDIA TEXTS TEXTATTRIBUTOR 1.0 (TECHNIQUES, MEANS, FUNCTIONALITY). Acta Linguistica Lithuanica, 91, 224–247. Scopus. https://doi.org/10.35321/all91-09
Haitong, P. (2025). THE ROLE OF CORPUS LINGUISTICS IN CONTEMPORARY LINGUISTICS RESEARCH AND TRANSLATION STUDIES. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2. Yazykoznanie, 24(1), 95–106. Scopus. https://doi.org/10.15688/jvolsu2.2025.1.8
Harinieswari, V., Srimathi, T., Vaishnavi, R., & Aarthi, S. (2024). VHA at SemEval-2024 Task 7: Bridging Numerical Reasoning and Headline Generation for Enhanced Language Models. Dalam Ojha A.K., Dohruoz A.S., Madabushi H.T., Da San Martino G., Rosenthal S., & Rosa A. (Ed.), SemEval—Int. Workshop Semantic Eval., Proc. Workshop (hlm. 821–828). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85211596496&partnerID=40&md5=6f4f95e360a8b311e08b3906e5810c12
Hasan, M. A., Das, S., Anjum, A., Alam, F., Anjum, A., Sarker, A., & Noori, S. R. H. (2024). Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis. Dalam Calzolari N., Kan M.-Y., Hoste V., Lenci A., Sakti S., & Xue N. (Ed.), Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc. (hlm. 17808–17818). European Language Resources Association (ELRA); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195949997&partnerID=40&md5=736b30b98a3db408feb852e6cb0d27e9
Hussain, Z., Nurminen, J. K., & Ranta-aho, P. (2024). Training a language model to learn the syntax of commands. Array, 23. Scopus. https://doi.org/10.1016/j.array.2024.100355
Juzek, T. S., & Ward, Z. B. (2025). Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models. Dalam Rambow O., Wanner L., Apidianaki M., Al-Khalifa H., Di Eugenio B., & Schockaert S. (Ed.), Proc. Main Conf. Int. Conf. Comput. Linguist., COLING: Vol. Part F206484-1 (hlm. 6397–6411). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218490975&partnerID=40&md5=dec17e9f4f1f27174914344653a1fcea
Ledbetter, W. (2025). Trust as a Classification Tool: Analyzing Collaboration in Senate Floor Speeches on Gun Legislation Post-Uvalde and Sandy Hook. Dalam Stahlbock R. & Arabnia H.R. (Ed.), Commun. Comput. Info. Sci.: Vol. 2253 CCIS (hlm. 26–39). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-031-85856-7_3
Leippert, A., Anikina, T., Kiefer, B., & van Genabith, J. (2024). To Clarify or not to Clarify: A Comparative Analysis of Clarification Classification with Fine-Tuning, Prompt Tuning, and Prompt Engineering. Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL, 4, 105–115. Scopus. https://doi.org/10.18653/v1/2024.naacl-srw.12
Li, J., Liang, S., Liao, Y., Deng, H., & Yu, H. (2024). USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness. Dalam Ojha A.K., Dohruoz A.S., Madabushi H.T., Da San Martino G., Rosenthal S., & Rosa A. (Ed.), SemEval—Int. Workshop Semantic Eval., Proc. Workshop (hlm. 881–887). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85215523389&partnerID=40&md5=c8cfbae60754a4b7af6f3651c7ed2fe1
Li, Y., Chen, S., Liu, Z., Che, C., & Zhong, Z. (2024). Translation model based on discrete Fourier transform and Skipping Sub-Layer methods. International Journal of Machine Learning and Cybernetics, 15(10), 4435–4444. Scopus. https://doi.org/10.1007/s13042-024-02156-w
Liguori, P., Marescalco, C., Natella, R., Orbinato, V., & Pianese, L. (2024). The Power of Words: Generating PowerShell Attacks from Natural Language. Proc. USENIX WOOT Conf. Offensive Technol., WOOT, 27–43. Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85202205882&partnerID=40&md5=cdf056f0bfb39ad80955cc68cb6247ba
Lin, Z., Guan, S., Zhang, W., Zhang, H., Li, Y., & Zhang, H. (2024). Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 57(9). Scopus. https://doi.org/10.1007/s10462-024-10896-y
Liu, Y., Zhu, L., Rodriguez, R. M., Yao, Y., & Martinez, L. (2024). Three-Way Group Decision-Making With Personalized Numerical Scale of Comparative Linguistic Expression: An Application to Traditional Chinese Medicine. IEEE Transactions on Fuzzy Systems, 32(8), 4352–4363. Scopus. https://doi.org/10.1109/TFUZZ.2024.3396132
Lloret, S. A., Dhuliawala, S., Murugesan, K., & Sachan, M. (2024). Towards Aligning Language Models with Textual Feedback. Dalam Al-Onaizan Y., Bansal M., & Chen Y.-N. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. Conf. (hlm. 20240–20266). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.emnlp-main.1129
Lu, W., Xiong, L., Zhang, F., Qin, X., & Chen, Y. (2024). Xinference: Making Large Model Serving Easy. Dalam Farias D.I.H., Hope T., & Li M. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. Syst. Demonstr. (hlm. 291–300). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.emnlp-demo.30
Luo, J., Cherry, C., & Foster, G. (2024). To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation. Transactions of the Association for Computational Linguistics, 12, 355–371. Scopus. https://doi.org/10.1162/tacl_a_00645
Marques, N., Silva, R. R., & Bernardino, J. (2024). Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Future Internet, 16(6). Scopus. https://doi.org/10.3390/fi16060180
Miyata, T. (2024). ZEN-IQA: Zero-Shot Explainable and No-Reference Image Quality Assessment With Vision Language Model. IEEE Access, 12, 70973–70983. Scopus. https://doi.org/10.1109/ACCESS.2024.3402729
Murtaza, M., Cheng, C.-T., Fard, M., & Zeleznikow, J. (2024). Transforming Driver Education: A Comparative Analysis of LLM-Augmented Training and Conventional Instruction for Autonomous Vehicle Technologies. International Journal of Artificial Intelligence in Education. Scopus. https://doi.org/10.1007/s40593-024-00407-z
Ong, N., Shavarani, H. S., & Sarkar, A. (2024). Unified Examination of Entity Linking in Absence of Candidate Sets. Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL, 2, 113–123. Scopus. https://doi.org/10.18653/v1/2024.naacl-short.11
Pasad, A., Chien, C.-M., Settle, S., & Livescu, K. (2024). What Do Self-Supervised Speech Models Know About Words? Transactions of the Association for Computational Linguistics, 12, 372–391. Scopus. https://doi.org/10.1162/tacl_a_00656
Pedersen, B. S., Sørensen, N. C. H., Olsen, S., Nimb, S., & Gray, S. (2024). Towards a Danish Semantic Reasoning Benchmark—Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. Dalam Calzolari N., Kan M.-Y., Hoste V., Lenci A., Sakti S., & Xue N. (Ed.), Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc. (hlm. 16353–16363). European Language Resources Association (ELRA); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195953193&partnerID=40&md5=3423bfd40db6b168e50d406aa2eef42e
Ünlütabak, B., & Bal, O. (2025). Theory of mind performance of large language models: A comparative analysis of Turkish and English. Computer Speech and Language, 89. Scopus. https://doi.org/10.1016/j.csl.2024.101698
Vrins, A., Pruss, E., Ceccato, C., Prinsen, J., de Rooij, A., Alimardani, M., & de Wit, J. (2024). Wizard-of-Oz vs. GPT-4: A Comparative Study of Perceived Social Intelligence in HRI Brainstorming. ACM/IEEE Int. Conf. Hum.-Rob. Interact., 1090–1094. Scopus. https://doi.org/10.1145/3610978.3640755
Xia, H., Yang, Z., Dong, Q., Wang, P., Li, Y., Ge, T., Liu, T., Li, W., & Sui, Z. (2024). Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. Dalam Ku L.-W., Martins A., & Srikumar V. (Ed.), Proc. Annu. Meet. Assoc. Comput Linguist. (hlm. 7655–7671). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.findings-acl.456
Xue, X., Zhang, D., Sun, C., Shi, Y., Wang, R., Tan, T., Gao, P., Fan, S., Zhai, G., Hu, M., & Wu, Y. (2024). Xiaoqing: A Q&A model for glaucoma based on LLMs. Computers in Biology and Medicine, 174. Scopus. https://doi.org/10.1016/j.compbiomed.2024.108399
Yashwanth, Y. S., & Shettar, R. (2024). Zero and Few Short Learning Using Large Language Models for De-Identification of Medical Records. IEEE Access, 12, 110385–110393. Scopus. https://doi.org/10.1109/ACCESS.2024.3439680
Zafar, A., Wasim, M., Zulfiqar, S., Waheed, T., & Siddique, A. (2024). Transformer-Based Topic Modeling for Urdu Translations of the Holy Quran. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(10). Scopus. https://doi.org/10.1145/3694967
Zdravkova, K., Dalipi, F., Ahlgren, F., Ilijoski, B., & Olsson, T. (2024). Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study. IEEE Global Eng. Edu. Conf., EDUCON. IEEE Global Engineering Education Conference, EDUCON. Scopus. https://doi.org/10.1109/EDUCON60312.2024.10578855
Copyright (c) 2025 Mariana Pakaja, Ming Pong, Rit Som

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.