BIJAKAWEB: Platform Berbasis Web Untuk Deteksi Hate Speech Pada Komentar Berita Bahasa Indonesia

Penulis

  • Moh. Firdaus Universitas Prasetya Mulya, Kabupaten Tangerang
  • Permata Nur Miftahur Rizki Universitas Prasetya Mulya, Kabupaten Tangerang

DOI:

https://doi.org/10.25126/jtiik.1148719

Kata Kunci:

IndoBERT, Ujaran Kebencian, Django, Web Scrapping, Portal Berita

Abstrak

Jumlah pengguna internet di Indonesia telah mencapai lebih dari 221 juta jiwa, mayoritas penduduk Indonesia menggunakan internet dengan tujuan agar tetap update dengan berita terbaru. Detik, Kompas, dan CNNIndonesia merupakan portal berita daring favorit sebagian besar penduduk Indonesia. Fitur komentar pada portal berita yang ada saat ini memungkinkan pembaca berita dapat memberikan umpan-balik terhadap berita, namun sering kali tidak terkontrol, memicu munculnya ujaran kebencian. Meskipun tersedia fitur moderasi seperti "Laporkan", pendekatan manual ini sering kali lambat dan kurang efektif. Penelitian ini bertujuan untuk mengembangkan sistem deteksi otomatis terhadap ujaran kebencian pada komentar berita daring. Proses penelitian dimulai dengan scraping lebih dari 15 ribu data komentar dari portal berita menggunakan library Python, dilanjutkan dengan pelabelan manual ke dalam dua kategori: “Hate” dan “Non-Hate,” dengan jumlah data yang berhasil dilabeli sebanyak 11.478, yang dibagi ke dalam dua kelas seimbang. Dataset yang telah berlabel kemudian digunakan untuk fine-tuning model IndoBERT selama 14 epoch, dengan akurasi terbaik sebesar 95,91% yang dicapai pada epoch ke-14. Model dengan akurasi terbaik diimplementasikan pada platform web yang diberi nama BijakaWeb (Web Bijak Dalam Berkomentar) dengan menggunakan framework Django. Penelitian ini menghasilkan beberapa kontribusi penting, termasuk tersedianya dataset baru untuk penelitian relevan, model fine-tuned IndoBERT baru yang dapat diakses publik di HuggingFace, serta pengembangan platform Website Bijaka dengan menggunakan framework fullstack Django yang mampu melakukan scraping dan prediksi ujaran kebencian secara real-time. Harapannya, penelitian ini dapat membantu portal berita dalam moderasi komentar berita daring dalam melawan komentar yang mengandung ujaran dan menyediakan model yang dapat digunakan serta diadaptasi oleh platform berita daring lainnya untuk mencegah penyebaran ujaran kebencian di internet.

 

Abstract

 

The number of internet users in Indonesia has surpassed 221 million, with the majority of the population using the internet to stay updated with the latest news. Detik, Kompas, and CNNIndonesia are among the most popular online news portals for many Indonesians. The comment features on these news portals allow readers to provide feedback on news articles; however, this is often unregulated, leading to the spread of hate speech. Although moderation features like "Report" are available, these manual approaches are often slow and ineffective. This study aims to develop an automatic detection system for hate speech in online news comments. The research process began by scraping over 15,000 comment data from news portals using Python libraries, followed by manually labeling the comments into two categories: "Hate" and "Non-Hate." A total of 11,478 labeled data points were obtained, which were divided into two balanced classes. The labeled dataset was then used to fine-tune the IndoBERT model over 14 epochs, with the best accuracy of 95.91% achieved on the 14th epoch. The model with the best accuracy was implemented on a web platform named BijakaWeb (Web Bijak Dalam Berkomentar) using Django fullstack framework. This research has produced several significant contributions, including the availability of a new dataset for relevant research, a fine-tuned IndoBERT model accessible to the public on HuggingFace, and the development of the BijakaWeb platform using the full-stack Django framework, capable of real-time scraping and hate speech prediction. It is hoped that this research can assist news portals in moderating online news comments to combat hate speech and provide a model that can be used and adapted by other online news platforms to prevent the spread of hate speech on the internet.

 

Downloads

Download data is not yet available.

Referensi

ARUMI, E.R. AND SUKMASETYA, P., 2020. Exploiting Web Scraping for Education News Analysis Using Depth-First Search Algorithm. Jurnal Online Informatika, 5(1).

AULIA, N. and BUDI, I., 2019. Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach. In: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence. [online] ICCAI ’19: 2019 5th International Conference on Computing and Artificial Intelligence. Bali Indonesia: ACM. pp.164–169. https://doi.org/10.1145/3330482.3330491.

DEVLIN, J., CHANG, M.-W., LEE, K. and TOUTANOVA, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. pp.4171–4186.

FAKHRUZZAMAN, M.N. and GUNAWAN, S.W., 2021. Web-based Application for Detecting Indonesian Clickbait Headlines using IndoBERT. [online] https://doi.org/10.48550/ARXIV.2102.10601.

GENI, L., YULIANTI, E. and SENSUSE, D.I., 2023. Sentiment Analysis of Tweets Before the 2024 Elections in Indonesia Using IndoBERT Language Models. 9(3).

HAFIDZ, 2021. Kejahatan Siber Meningkat di Masa Pandemi. Universitas Indonesia. Available at: <https://www.ui.ac.id/kejahatan-siber-meningkat-di-masa-pandemi/>.

HERWANTO, G.B., MAULIDA NINGTYAS, A., NUGRAHA, K.E. and NYOMAN PRAYANA TRISNA, I., 2019. Hate Speech and Abusive Language Classification using fastText. In: 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). [online] 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). Yogyakarta, Indonesia: IEEE. pp.69–72. https://doi.org/10.1109/ISRITI48646.2019.9034560.

ISA, S.M., NICO, G. and PERMANA, M., 2022. IndoBERT for Indonesian Fake News Detection. https://doi.org/10.24507/icicel.16.03.289.

ISNAIN, A.R., SIHABUDDIN, A. and SUYANTO, Y., 2020. Bidirectional Long Short Term Memory Method and Word2vec Extraction Approach for Hate Speech Detection. IJCCS (Indonesian Journal of Computing and Cybernetics Systems), 14(2), p.169. https://doi.org/10.22146/ijccs.51743.

KIASATI DESRUL, D.R. and ROMADHONY, A., 2019. Abusive Language Detection on Indonesian Online News Comments. In: 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). [online] 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). Yogyakarta, Indonesia: IEEE. pp.320–325. https://doi.org/10.1109/ISRITI48646.2019.9034620.

LUBIS, A.R., LASE, Y.Y., RAHMAN, D.A. and WITARSYAH, D., 2023. Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingénierie des systèmes d information, 28(5), pp.1335–1342. https://doi.org/10.18280/isi.280522.

MATHEW, L. and BINDU, V.R., 2020. A Review of Natural Language Processing Techniques for Sentiment Analysis using Pre-trained Models. In: 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). [online] 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). Erode, India: IEEE. pp.340–345. https://doi.org/10.1109/ICCMC48092.2020.ICCMC-00064.

NEWMAN, N., FLETCHER, R., EDDY, K., ROBINSON, C.T. and NIELSEN, R.K., 2023. Reuters Institute digital news report 2023. [online] Reuters Institute for the Study of Journalism. https://doi.org/10.60625/RISJ-P6ES-HB13.

PAMUNGKAS, E.W., PUTRI, D.G.P. and FATMAWATI, A., 2023. Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities. International Journal of Advanced Computer Science and Applications, [online] 14(6). https://doi.org/10.14569/IJACSA.2023.01406125.

PUTRA, I.G.M. and NURJANAH, D., 2020. Hate Speech Detection In Indonesian Language Instagram. In: 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS). [online] 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS). Depok, Indonesia: IEEE. pp.413–420. https://doi.org/10.1109/ICACSIS51025.2020.9263084.

RUPARELIA, N.B., 2010. Software development lifecycle models. ACM SIGSOFT Software Engineering Notes, 35(3), pp.8–13. https://doi.org/10.1145/1764810.1764814.

WILIE, B., VINCENTIO, K., WINATA, G.I., CAHYAWIJAYA, S., LI, X., LIM, Z.Y., SOLEMAN, S., MAHENDRA, R., FUNG, P., BAHAR, S. and PURWARIANTI, A., 2020. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Available at: <http://arxiv.org/abs/2009.05387>.

WOLF, T., DEBUT, L., SANH, V., CHAUMOND, J., DELANGUE, C., MOI, A., CISTAC, P., RAULT, T., LOUF, R., FUNTOWICZ, M., DAVISON, J., SHLEIFER, S., VON PLATEN, P., MA, C., JERNITE, Y., PLU, J., XU, C., LE SCAO, T., GUGGER, S., DRAME, M., LHOEST, Q. and RUSH, A., 2020. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. [online] Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics. pp.38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6.

Diterbitkan

26-08-2024

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

BIJAKAWEB: Platform Berbasis Web Untuk Deteksi Hate Speech Pada Komentar Berita Bahasa Indonesia. (2024). Jurnal Teknologi Informasi Dan Ilmu Komputer, 11(4), 939-948. https://doi.org/10.25126/jtiik.1148719