Development of an Indonesian-English Parallel Corpus for Translation and Comparative Linguistics Research

Marina Pakaja; Ming Pong; Rit Som; Hindri Febri Ana Sari

doi:10.55849/jiltech.v4i1.817

Authors

Marina Pakaja
marinapakaja23@gmail.com
Institut Agama Islam Negeri Sultan Amai Gorontalo, Indonesia
Ming Pong Chiang Mai University, Thailand
Rit Som Songkhla University, Thailand
Hindri Febri Ana Sari Politeknik Negeri Ambon, Indonesia

Vol. 4 No. 1 (2025)

Articles

Accepted May 15, 2025

Published May 15, 2025

Downloads

PDF

Abstract
How to Cite
Metrics
References
License

The development of parallel corpora plays a crucial role in the fields of translation studies, computational linguistics, and comparative linguistics. While significant parallel corpora have been developed for major languages like English, the availability of such resources for Indonesian-English translation research remains limited. This study aims to develop a comprehensive Indonesian-English parallel corpus, specifically designed to aid translation research and enhance linguistic comparisons between these two languages. The corpus is intended to serve as a foundational resource for further studies on machine translation, linguistic patterns, and cross-linguistic influence. The research adopts a corpus-driven methodology, where the corpus is compiled from diverse sources, including literary texts, news articles, academic papers, and everyday discourse, to ensure a broad representation of language use. The corpus is annotated for both syntax and semantics, with a focus on aligning sentence structures and identifying key linguistic features in both languages. The analysis of the corpus reveals significant differences and similarities in sentence structure, word order, and translation equivalence between Indonesian and English. The findings highlight the potential of the corpus to facilitate various types of linguistic research and translation studies. It serves as a valuable tool for enhancing the quality of machine translation systems and provides insights into the challenges of translating between Indonesian and English.

Pakaja, M., Pong, M., Som, R., & Sari, H. F. A. (2025). Development of an Indonesian-English Parallel Corpus for Translation and Comparative Linguistics Research . Journal International of Lingua and Technology, 4(1), 16–30. https://doi.org/10.55849/jiltech.v4i1.817

Download Citation

Ali, M. H., Mohammed, S. L., & Al-Naji, A. (2024). Voice-based gender classification: A comparative study based on machine learning algorithms. Dalam Hatem W.A., Obed A.A., Mosleh M.F., Gharghan S.K., & Al-Naji A. (Ed.), AIP Conf. Proc. (Vol. 3232, Nomor 1). American Institute of Physics; Scopus. https://doi.org/10.1063/5.0236193

Bagdon, C., Karmalker, P., Gurulingappa, H., & Klinger, R. (2024). “You are an expert annotator”: Automatic Best–Worst-Scaling Annotations for Emotion Intensity Modeling. Dalam Duh K., Gomez H., & Bethard S. (Ed.), Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL (Vol. 1, hlm. 7917–7929). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.naacl-long.439

Bourahouat, G., Abourezq, M., & Daoudi, N. (2024). Toward an efficient extractive Arabic text summarisation system based on Arabic large language models. International Journal of Data Science and Analytics. Scopus. https://doi.org/10.1007/s41060-024-00618-6

Budakoglu, G., & Emekci, H. (2025). Unveiling the Power of Large Language Models: A Comparative Study of Retrieval-Augmented Generation, Fine-Tuning, and Their Synergistic Fusion for Enhanced Performance. IEEE Access, 13, 30936–30951. Scopus. https://doi.org/10.1109/ACCESS.2025.3542334

Cho, I., Kwon, G., & Hockenmaier, J. (2024). TUTOR-ICL: Guiding Large Language Models for Improved In-Context Learning Performance. Dalam Al-Onaizan Y., Bansal M., & Chen Y.-N. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Find. EMNLP (hlm. 9496–9506). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.findings-emnlp.554

Darchuk, N., Zuban, O., Robeiko, V., Tsyhvintseva, Y., Sorokin, V., & Sazhok, M. (2024). THE SYSTEM FOR AUTOMATIC STYLOMETRIC ANALYSIS OF UKRAINIAN MEDIA TEXTS TEXTATTRIBUTOR 1.0 (TECHNIQUES, MEANS, FUNCTIONALITY). Acta Linguistica Lithuanica, 91, 224–247. Scopus. https://doi.org/10.35321/all91-09

Haitong, P. (2025). THE ROLE OF CORPUS LINGUISTICS IN CONTEMPORARY LINGUISTICS RESEARCH AND TRANSLATION STUDIES. Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriya 2. Yazykoznanie, 24(1), 95–106. Scopus. https://doi.org/10.15688/jvolsu2.2025.1.8

Harinieswari, V., Srimathi, T., Vaishnavi, R., & Aarthi, S. (2024). VHA at SemEval-2024 Task 7: Bridging Numerical Reasoning and Headline Generation for Enhanced Language Models. Dalam Ojha A.K., Dohruoz A.S., Madabushi H.T., Da San Martino G., Rosenthal S., & Rosa A. (Ed.), SemEval—Int. Workshop Semantic Eval., Proc. Workshop (hlm. 821–828). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85211596496&partnerID=40&md5=6f4f95e360a8b311e08b3906e5810c12

Hasan, M. A., Das, S., Anjum, A., Alam, F., Anjum, A., Sarker, A., & Noori, S. R. H. (2024). Zero- and Few-Shot Prompting with LLMs: A Comparative Study with Fine-tuned Models for Bangla Sentiment Analysis. Dalam Calzolari N., Kan M.-Y., Hoste V., Lenci A., Sakti S., & Xue N. (Ed.), Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc. (hlm. 17808–17818). European Language Resources Association (ELRA); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195949997&partnerID=40&md5=736b30b98a3db408feb852e6cb0d27e9

Hussain, Z., Nurminen, J. K., & Ranta-aho, P. (2024). Training a language model to learn the syntax of commands. Array, 23. Scopus. https://doi.org/10.1016/j.array.2024.100355

Juzek, T. S., & Ward, Z. B. (2025). Why Does ChatGPT “Delve” So Much? Exploring the Sources of Lexical Overrepresentation in Large Language Models. Dalam Rambow O., Wanner L., Apidianaki M., Al-Khalifa H., Di Eugenio B., & Schockaert S. (Ed.), Proc. Main Conf. Int. Conf. Comput. Linguist., COLING: Vol. Part F206484-1 (hlm. 6397–6411). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85218490975&partnerID=40&md5=dec17e9f4f1f27174914344653a1fcea

Ledbetter, W. (2025). Trust as a Classification Tool: Analyzing Collaboration in Senate Floor Speeches on Gun Legislation Post-Uvalde and Sandy Hook. Dalam Stahlbock R. & Arabnia H.R. (Ed.), Commun. Comput. Info. Sci.: Vol. 2253 CCIS (hlm. 26–39). Springer Science and Business Media Deutschland GmbH; Scopus. https://doi.org/10.1007/978-3-031-85856-7_3

Leippert, A., Anikina, T., Kiefer, B., & van Genabith, J. (2024). To Clarify or not to Clarify: A Comparative Analysis of Clarification Classification with Fine-Tuning, Prompt Tuning, and Prompt Engineering. Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL, 4, 105–115. Scopus. https://doi.org/10.18653/v1/2024.naacl-srw.12

Li, J., Liang, S., Liao, Y., Deng, H., & Yu, H. (2024). USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness. Dalam Ojha A.K., Dohruoz A.S., Madabushi H.T., Da San Martino G., Rosenthal S., & Rosa A. (Ed.), SemEval—Int. Workshop Semantic Eval., Proc. Workshop (hlm. 881–887). Association for Computational Linguistics (ACL); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85215523389&partnerID=40&md5=c8cfbae60754a4b7af6f3651c7ed2fe1

Li, Y., Chen, S., Liu, Z., Che, C., & Zhong, Z. (2024). Translation model based on discrete Fourier transform and Skipping Sub-Layer methods. International Journal of Machine Learning and Cybernetics, 15(10), 4435–4444. Scopus. https://doi.org/10.1007/s13042-024-02156-w

Liguori, P., Marescalco, C., Natella, R., Orbinato, V., & Pianese, L. (2024). The Power of Words: Generating PowerShell Attacks from Natural Language. Proc. USENIX WOOT Conf. Offensive Technol., WOOT, 27–43. Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85202205882&partnerID=40&md5=cdf056f0bfb39ad80955cc68cb6247ba

Lin, Z., Guan, S., Zhang, W., Zhang, H., Li, Y., & Zhang, H. (2024). Towards trustworthy LLMs: A review on debiasing and dehallucinating in large language models. Artificial Intelligence Review, 57(9). Scopus. https://doi.org/10.1007/s10462-024-10896-y

Liu, Y., Zhu, L., Rodriguez, R. M., Yao, Y., & Martinez, L. (2024). Three-Way Group Decision-Making With Personalized Numerical Scale of Comparative Linguistic Expression: An Application to Traditional Chinese Medicine. IEEE Transactions on Fuzzy Systems, 32(8), 4352–4363. Scopus. https://doi.org/10.1109/TFUZZ.2024.3396132

Lloret, S. A., Dhuliawala, S., Murugesan, K., & Sachan, M. (2024). Towards Aligning Language Models with Textual Feedback. Dalam Al-Onaizan Y., Bansal M., & Chen Y.-N. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. Conf. (hlm. 20240–20266). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.emnlp-main.1129

Lu, W., Xiong, L., Zhang, F., Qin, X., & Chen, Y. (2024). Xinference: Making Large Model Serving Easy. Dalam Farias D.I.H., Hope T., & Li M. (Ed.), EMNLP - Conf. Empir. Methods Nat. Lang. Process., Proc. Syst. Demonstr. (hlm. 291–300). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.emnlp-demo.30

Luo, J., Cherry, C., & Foster, G. (2024). To Diverge or Not to Diverge: A Morphosyntactic Perspective on Machine Translation vs Human Translation. Transactions of the Association for Computational Linguistics, 12, 355–371. Scopus. https://doi.org/10.1162/tacl_a_00645

Marques, N., Silva, R. R., & Bernardino, J. (2024). Using ChatGPT in Software Requirements Engineering: A Comprehensive Review. Future Internet, 16(6). Scopus. https://doi.org/10.3390/fi16060180

Miyata, T. (2024). ZEN-IQA: Zero-Shot Explainable and No-Reference Image Quality Assessment With Vision Language Model. IEEE Access, 12, 70973–70983. Scopus. https://doi.org/10.1109/ACCESS.2024.3402729

Murtaza, M., Cheng, C.-T., Fard, M., & Zeleznikow, J. (2024). Transforming Driver Education: A Comparative Analysis of LLM-Augmented Training and Conventional Instruction for Autonomous Vehicle Technologies. International Journal of Artificial Intelligence in Education. Scopus. https://doi.org/10.1007/s40593-024-00407-z

Ong, N., Shavarani, H. S., & Sarkar, A. (2024). Unified Examination of Entity Linking in Absence of Candidate Sets. Proc. Conf. North American Chapter Assoc. Comput. Linguist.: Hum. Lang. Technol., NAACL, 2, 113–123. Scopus. https://doi.org/10.18653/v1/2024.naacl-short.11

Pasad, A., Chien, C.-M., Settle, S., & Livescu, K. (2024). What Do Self-Supervised Speech Models Know About Words? Transactions of the Association for Computational Linguistics, 12, 372–391. Scopus. https://doi.org/10.1162/tacl_a_00656

Pedersen, B. S., Sørensen, N. C. H., Olsen, S., Nimb, S., & Gray, S. (2024). Towards a Danish Semantic Reasoning Benchmark—Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models. Dalam Calzolari N., Kan M.-Y., Hoste V., Lenci A., Sakti S., & Xue N. (Ed.), Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc. (hlm. 16353–16363). European Language Resources Association (ELRA); Scopus. https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195953193&partnerID=40&md5=3423bfd40db6b168e50d406aa2eef42e

Ünlütabak, B., & Bal, O. (2025). Theory of mind performance of large language models: A comparative analysis of Turkish and English. Computer Speech and Language, 89. Scopus. https://doi.org/10.1016/j.csl.2024.101698

Vrins, A., Pruss, E., Ceccato, C., Prinsen, J., de Rooij, A., Alimardani, M., & de Wit, J. (2024). Wizard-of-Oz vs. GPT-4: A Comparative Study of Perceived Social Intelligence in HRI Brainstorming. ACM/IEEE Int. Conf. Hum.-Rob. Interact., 1090–1094. Scopus. https://doi.org/10.1145/3610978.3640755

Xia, H., Yang, Z., Dong, Q., Wang, P., Li, Y., Ge, T., Liu, T., Li, W., & Sui, Z. (2024). Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. Dalam Ku L.-W., Martins A., & Srikumar V. (Ed.), Proc. Annu. Meet. Assoc. Comput Linguist. (hlm. 7655–7671). Association for Computational Linguistics (ACL); Scopus. https://doi.org/10.18653/v1/2024.findings-acl.456

Xue, X., Zhang, D., Sun, C., Shi, Y., Wang, R., Tan, T., Gao, P., Fan, S., Zhai, G., Hu, M., & Wu, Y. (2024). Xiaoqing: A Q&A model for glaucoma based on LLMs. Computers in Biology and Medicine, 174. Scopus. https://doi.org/10.1016/j.compbiomed.2024.108399

Yashwanth, Y. S., & Shettar, R. (2024). Zero and Few Short Learning Using Large Language Models for De-Identification of Medical Records. IEEE Access, 12, 110385–110393. Scopus. https://doi.org/10.1109/ACCESS.2024.3439680

Zafar, A., Wasim, M., Zulfiqar, S., Waheed, T., & Siddique, A. (2024). Transformer-Based Topic Modeling for Urdu Translations of the Holy Quran. ACM Transactions on Asian and Low-Resource Language Information Processing, 23(10). Scopus. https://doi.org/10.1145/3694967

Zdravkova, K., Dalipi, F., Ahlgren, F., Ilijoski, B., & Olsson, T. (2024). Unveiling the Impact of Large Language Models on Student Learning: A Comprehensive Case Study. IEEE Global Eng. Edu. Conf., EDUCON. IEEE Global Engineering Education Conference, EDUCON. Scopus. https://doi.org/10.1109/EDUCON60312.2024.10578855

Development of an Indonesian-English Parallel Corpus for Translation and Comparative Linguistics Research

Authors

Downloads

Similar Articles

Most read articles by the same author(s)

People

Journal Policies

Submission

Indexed In

Article Template

Tools

Visitor Counter

Address

Contact Info:

Development of an Indonesian-English Parallel Corpus for Translation and Comparative Linguistics Research

Authors

Downloads

Similar Articles

Most read articles by the same author(s)

Login

People

Journal Policies

Submission

Indexed In

Article Template

Tools

Visitor Counter