Arsitektur Sistem Percakapan Otomatis Berbahasa Indonesia dengan Normalisasi Bahasa Informal Menjadi Baku

Muhammad Fathur Rahman Khairul; Rizal Setya Perdana

doi:10.25126/jtiik.2024117984

Penulis

Muhammad Fathur Rahman Khairul Universitas Brawijaya, Malang
Rizal Setya Perdana Universitas Brawijaya, Malang

DOI:

https://doi.org/10.25126/jtiik.2024117984

Kata Kunci:

NLP , finetuning, gpt2, normalization, machine translation, deep learning

Abstrak

Komunikasi merupakan hal yang paling penting dalam kehidupan sehari-hari. Setiap orang berkomunikasi dengan cara mereka berdasarkan latar belakang serta kedekatan antar pembicara. Oleh karena itu, perkembangan bahasa informal terjadi sangat cepat dan tidak jarang menciptakan kata-kata baru sebagai pengganti bahasa formal. Hal ini menjadi masalah jika dilihat dari perspektif pemrosesan bahasa alami (NLP). NLP umumnya hanya dapat dilakukan dengan bahasa yang formal dan tidak mampu menginterpretasikan makna dari kalimat informal. Maka dari itu, penulis mengusulkan pendekatan untuk memungkinkan mesin memahami bahasa informal dengan melakukan normalisasi bahasa infomal menjadi baku dengan memanfaatkan NLP. Pendekatan yang dilakukan akan melatih model pre-trained GPT-2 berbahasa Indonesia dengan data parallel corpus untuk memahami makna dari bahasa informal dan mampu menerjemahkannya ke dalam bentuk baku. Melalui eksperimen yang dilakukan, pendekatan ini mencapai tingkat akurasi 91% dan dapat menerjemahkan bahasa informal dengan baik. Performa ini dapat diraih dengan konfigurasi hiperparameter yaitu Adam optimizer dengan learning rate 1e-4, batch size sebesar 16 dan dropout rate sebesar 0,5.

Abstract

Communication is the most essential thing in daily life. Everyone communicates in their own way based on their background and the closeness between speakers. Thus, the development of informal language occurs quickly and it is often to create new words as a substitute for formal language. This is an issue from a natural language processing (NLP) perspective. NLP generally only works with formal language and is unable to interpret the meaning of informal sentences. Therefore, the authors propose an approach to enable machines to understand informal language by normalizing the informal language to standard by utilizing NLP. The approach will train a pre-trained GPT-2 model in Indonesian with parallel corpus data to understand the meaning of informal language and be able to translate it into standardized form. Through experiments, the method achieved 91% accuracy and can translate informal language well. This performance can be achieved with a hyperparameter configuration, namely Adam optimizer with a learning rate of 1e-4, batch size of 16 and dropout rate of 0.5.

Downloads

Download data is not yet available.

Referensi

ALIYAH SALSABILA, N., ARDHITO WINATMOKO, Y., AKBAR SEPTIANDRI, A., & JAMAL, A. 2019. Colloquial Indonesian Lexicon. Proceedings of the 2018 International Conference on Asian Language Processing, IALP 2018. https://doi.org/10.1109/IALP.2018.8629151.

ANDERSEN, G. 2005. Pragmatic markers and sociolinguistic variation. Journal of Pragmatics, 37(4).

ARIFIN, M. B., HEFNI, A., & PURWANTI, P. 2022. Slang dalam Bahasa Indonesia: Kajian Morfosemantik. Diglosia: Jurnal Kajian Bahasa, Sastra, dan Pengajarannya, 5(1s). https://doi.org/10.30872/diglosia.v5i1s.402.

AW, A. T., ZHANG, M., Xiao, J., & Su, J. 2006. A phrase-based statistical model for SMS text normalization. COLING/ACL 2006 - 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Main Conference Poster Sessions. https://doi.org/10.3115/1273073.1273078.

BANITZ, B. 2020. Machine translation: A critical look at the performance of rule-based and statistical machine translation. Cadernos de Traducao, 40(1). https://doi.org/10.5007/2175-7968.2020v40n1p54.

GILLIOZ, A., CASAS, J., MUGELLINI, E., & KHALED, O. A. 2020. Overview of the Transformer-based Models for NLP Tasks. Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, FedCSIS 2020. https://doi.org/10.15439/2020F20

HANAFIAH, N., KEVIN, A., SUTANTO, C., FIONA, ARIFIN, Y., & HARTANTO, J. 2017. Text Normalization Algorithm on Twitter in Complaint Category. International Conference on Computer Science and Computational Intelligence.

MADAAN, A., SETLUR, A., PAREKH, T., PPOCZOS, B., NEUBIG, G., YANG, Y., SALAKHUTDINOV, R., BLACK, A. W., & PRABHUMOYE, S. 2020. Politeness transfer: A tag and generate approach. Proceedings of the Annual Meeting of the Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.acl-main.169.

VASWANI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A. N., KAISER, Ł., & POLOSUKHIN, I. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 2017-December.

WANG, Y., WU, Y., MOU, L., LI, Z., & CHAO, W. 2019 Harnessing pre-trained neural networks with rules for formality style transfer. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference. https://doi.org/10.18653/v1/d19-1365.

WIBOWO, H. A., PRAWIRO, T. A., IHSAN M., AJI, A. F., PRASOJO, R. E., MAHENDRA, R., & FITRIANY, S. 2020. Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation. 2020 International Conference on Asian Language Processing (IALP), 310–315. https://doi.org/10.1109/IALP51396.2020.9310459

Arsitektur Sistem Percakapan Otomatis Berbahasa Indonesia dengan Normalisasi Bahasa Informal Menjadi Baku

Penulis

DOI:

Kata Kunci:

Abstrak

Downloads

Referensi

Unduhan

Diterbitkan

Terbitan

Bagian

Lisensi

Cara Mengutip

Kirim Naskah

side menu

sertifikat akreditasi

pengindeks

Mendeley

Citations & Reference Manager

pengunjung

Keywords

Information

Supported by

Technical Support

Laboratorium

Direktori UB