Peningkatan Akurasi Mesin Penerjemah Bahasa Inggris - Indonesia dengan Memaksimalkan Kualitas dan Kuantitas Korpus Paralel

Penulis

  • Herry Sujaini Universitas Tanjungpura

DOI:

https://doi.org/10.25126/jtiik.2020732076

Abstrak

Korpus paralel memiliki peran yang sangat penting dalam mesin penerjemah statistik (MPS). Korpus paralel yang diperoleh berbagai sumber biasanya memiliki kualitas yang kurang baik, sedangkan kuantitas korpus paralel merupakan tuntutan utama bagi hasil penerjemahan yang baik. Penelitian ini bertujuan untuk mengetahui efek ukuran dan kualitas korpus paralel di MPS. Penelitian ini menggunakan metode bilingual evaluation understudy (BLEU) untuk mengklasifikasikan pasangan kalimat paralel sebagai kalimat berkualitas tinggi atau buruk. Metode ini diterapkan ke korpus paralel yang berisi 1,5 M pasangan kalimat Inggris-Indonesia paralel dan memperoleh 900K pasangan kalimat paralel berkualitas tinggi. Beberapa sistem MPS dengan berbagai ukuran korpus paralel mentah dan korpus berkualitas tinggi yang difilter dilatih dengan MOSES dan dievaluasi kinerjanya. Hasil percobaan yang dilakukan menunjukkan bahwa ukuran korpus paralel merupakan  faktor utama dalam kinerja terjemahan. Selain itu, kinerja terjemahan yang  lebih baik dapat dicapai dengan korpus berkualitas tinggi yang lebih kecil menggunakan metode filter berkualitas. Hasil eksperimen pada MPS bahasa Inggris-Indonesia menunjukkan bahwa dengan menggunakan 60% kalimat yang kualitas terjemahannya baik, kualitas terjemahan dapat meningkat sebesar 7,31%.

 

Abstract

The parallel corpus has a very important role in the statistical machine translator (SMT) system. The parallel corpus obtained by various sources usually has poor quality, while the quantity of parallel corpus is the main demand for good translation results. This study aims to determine the effect of the size and quality of parallel corpus at SMT. This study uses the bilingual evaluation understudy (BLEU) method to classify pairs of parallel sentences as high-quality or bad sentences. This method is applied to a parallel corpus containing 1.5 M parallel English-Indonesian sentence pairs and obtaining 900K pairs of high-quality parallel sentences. Some SMT systems with various sizes of raw parallel bodies and high-quality corpus filtered are trained with MOSES and evaluated for performance. The experimental results show that the size of the parallel corpus is a major factor in translation performance. In addition, better translation performance can be achieved with a smaller high-quality corpus using a quality filter method.The experimental results in the English-Indonesian SMT show that by using 60% of sentences whose translation quality is good, the quality of the translation can increase by 7.31%.


Downloads

Download data is not yet available.

Referensi

APRIANI, T., SUJAINI. H., dan SAFRIADI, N., 2016. Pengaruh Kuantitas Korpus terhadap Akurasi Mesin Penerjemah Statistik Bahasa Bugis Wajo ke Bahasa Indonesia. Jurnal Sistem dan Teknologi Informasi (JUSTIN), 4(1), pp. 168-173.

ALOTAIBI, H.M., 2017. Arabic-English Parallel Corpus: A New Resource for Translation Training and Language Teaching, Arab World English Journal (AWEJ), 8(3), pp. 319-337.

AXELROD, A., HE, X., dan GAO., J., 2011. Domain adaptation via pseudo in-domain data selection. Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

CUI, L., ZHANG, D., LIU, S., LI, M., dan ZHOU, M., 2013. Bilingual data cleaning for SMT using graph-based random walk. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2 , pp. 340-345.

FOSTER, G., GOUTTE, C.R., dan KUHN. 2010. Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Cambridge, US-MA, pp 451-459.

HADLA, L.S., HAILAT, T.M., dan AL-KABI, M.N. 2015, Comparative Study Between METEOR and BLEU Methods of MT: Arabic into English Translation as a Case Study. International Journal of Advanced Computer Science and Applications, 6(11).

HOANG, C., THAI, M.P., dan BAO, H.T., 2012. Exploiting non-parallel corpora for statistical machine translation, In Proceedings of The 9th IEEE-RIVF International Conference on Computing dan Communication Technologies, IEEE Computer Society, pp. 97 – 102.

KOEHN, P., HOANG H., BIRCH, A., BURCH, C.C., FEDERICO, M., BERTOLDI, N., dan HERBST, E., 2007. Moses: Open source toolkit for statistical machine translation, In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster dan Demonstration Sessions , pp. 177-180.

LIU, X., dan ZHOU, M., 2010. Evaluating the quality of web-mined bilingual sentences using multiple linguistic features, Asian Language Processing (IALP), 2010 International Conference on. IEEE, 2010.

MATSOUKAS, S., ROSTI, A.I.,dan ZHANG, B., 2009. Discriminative Corpus Weight Estimation for Machine Translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore, pp. 708-717

MUNTEANU, D.S., dan MARCU, D., 2005. Improving machine translation performance by exploiting non-parallel corpora, Computational Linguistics, 31(4), pp. 477-504.

PAPINENI, K., ROUKOS, S., WARD T., dan ZHU, W.J., 2002. BLEU: A Method for Automatic Evaluation of Machine Translation, Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311-318

SANTOS, A., 2011. A survey on parallel corpora alignment, MI-STAR 2011, pp. 117–128.

SCHWENK, H., 2008. Investigations on Large-Scale Lightly-Supervised Training for Statistical Machine Translation, In Proc. of the International Workshop on Spoken Language Translation Evaluation

SUJAINI, H., 2018. Peningkatan Akurasi Penerjemah Bahasa Daerah dengan Optimasi Korpus Paralel. Jurnal Nasional Teknik Elektro dan Teknologi Informasi (JNTETI), 7(1).

SUJAINI, H., dan BIJAKSANA, A., 2014. Strategi Memperbaiki Kualitas Korpus untuk Meningkatkan Kualitas Mesin Penerjemah Statistik, Seminar Nasional Teknologi Informasi XI Tahun 2014, Jakarta.

TAGHIPOUR, K.., AFHAMI, N., KHADIVI, S., dan SHIRY, S., 2010. A discriminative approach to filter out noisy sentence pairs from bilingual corpora, Telecommunications (IST), 5th International Symposium on 2010, pp. 537-541.

TIEDEMANN, J., 2016. Finding Alternative Translations in a Large Corpus of Movie Subtitles. In Proceedings of the 10th International Conference on Language Resources dan Evaluation (LREC 2016).

YASUDA, K., ZHANG, R., YAMAMOTO, H., dan SUMITA, E., 2008. Method of Selecting Training Data to Build a Compact dan Efficient Translation Model. In: Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Hyderabad, India

YILDIZ, E., dan CÜNEYD, T.A., 2012. Evaluation of Sentence Alignment Methods for EnglishTurkish Par-allel Texts, LREC 2012: The International Con-ference on Language Resources dan Evaluation. Istanbul.

Diterbitkan

22-05-2020

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Peningkatan Akurasi Mesin Penerjemah Bahasa Inggris - Indonesia dengan Memaksimalkan Kualitas dan Kuantitas Korpus Paralel. (2020). Jurnal Teknologi Informasi Dan Ilmu Komputer, 7(3), 471-476. https://doi.org/10.25126/jtiik.2020732076