BIJAKAWEB: Platform Berbasis Web Untuk Deteksi Hate Speech Pada Komentar Berita Bahasa Indonesia
DOI:
https://doi.org/10.25126/jtiik.1148719Kata Kunci:
IndoBERT, Ujaran Kebencian, Django, Web Scrapping, Portal BeritaAbstrak
Jumlah pengguna internet di Indonesia telah mencapai lebih dari 221 juta jiwa, mayoritas penduduk Indonesia menggunakan internet dengan tujuan agar tetap update dengan berita terbaru. Detik, Kompas, dan CNNIndonesia merupakan portal berita daring favorit sebagian besar penduduk Indonesia. Fitur komentar pada portal berita yang ada saat ini memungkinkan pembaca berita dapat memberikan umpan-balik terhadap berita, namun sering kali tidak terkontrol, memicu munculnya ujaran kebencian. Meskipun tersedia fitur moderasi seperti "Laporkan", pendekatan manual ini sering kali lambat dan kurang efektif. Penelitian ini bertujuan untuk mengembangkan sistem deteksi otomatis terhadap ujaran kebencian pada komentar berita daring. Proses penelitian dimulai dengan scraping lebih dari 15 ribu data komentar dari portal berita menggunakan library Python, dilanjutkan dengan pelabelan manual ke dalam dua kategori: “Hate” dan “Non-Hate,” dengan jumlah data yang berhasil dilabeli sebanyak 11.478, yang dibagi ke dalam dua kelas seimbang. Dataset yang telah berlabel kemudian digunakan untuk fine-tuning model IndoBERT selama 14 epoch, dengan akurasi terbaik sebesar 95,91% yang dicapai pada epoch ke-14. Model dengan akurasi terbaik diimplementasikan pada platform web yang diberi nama BijakaWeb (Web Bijak Dalam Berkomentar) dengan menggunakan framework Django. Penelitian ini menghasilkan beberapa kontribusi penting, termasuk tersedianya dataset baru untuk penelitian relevan, model fine-tuned IndoBERT baru yang dapat diakses publik di HuggingFace, serta pengembangan platform Website Bijaka dengan menggunakan framework fullstack Django yang mampu melakukan scraping dan prediksi ujaran kebencian secara real-time. Harapannya, penelitian ini dapat membantu portal berita dalam moderasi komentar berita daring dalam melawan komentar yang mengandung ujaran dan menyediakan model yang dapat digunakan serta diadaptasi oleh platform berita daring lainnya untuk mencegah penyebaran ujaran kebencian di internet.
Abstract
The number of internet users in Indonesia has surpassed 221 million, with the majority of the population using the internet to stay updated with the latest news. Detik, Kompas, and CNNIndonesia are among the most popular online news portals for many Indonesians. The comment features on these news portals allow readers to provide feedback on news articles; however, this is often unregulated, leading to the spread of hate speech. Although moderation features like "Report" are available, these manual approaches are often slow and ineffective. This study aims to develop an automatic detection system for hate speech in online news comments. The research process began by scraping over 15,000 comment data from news portals using Python libraries, followed by manually labeling the comments into two categories: "Hate" and "Non-Hate." A total of 11,478 labeled data points were obtained, which were divided into two balanced classes. The labeled dataset was then used to fine-tune the IndoBERT model over 14 epochs, with the best accuracy of 95.91% achieved on the 14th epoch. The model with the best accuracy was implemented on a web platform named BijakaWeb (Web Bijak Dalam Berkomentar) using Django fullstack framework. This research has produced several significant contributions, including the availability of a new dataset for relevant research, a fine-tuned IndoBERT model accessible to the public on HuggingFace, and the development of the BijakaWeb platform using the full-stack Django framework, capable of real-time scraping and hate speech prediction. It is hoped that this research can assist news portals in moderating online news comments to combat hate speech and provide a model that can be used and adapted by other online news platforms to prevent the spread of hate speech on the internet.
Downloads
Referensi
ARUMI, E.R. AND SUKMASETYA, P., 2020. Exploiting Web Scraping for Education News Analysis Using Depth-First Search Algorithm. Jurnal Online Informatika, 5(1).
AULIA, N. and BUDI, I., 2019. Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach. In: Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence. [online] ICCAI ’19: 2019 5th International Conference on Computing and Artificial Intelligence. Bali Indonesia: ACM. pp.164–169. https://doi.org/10.1145/3330482.3330491.
DEVLIN, J., CHANG, M.-W., LEE, K. and TOUTANOVA, K., 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. pp.4171–4186.
FAKHRUZZAMAN, M.N. and GUNAWAN, S.W., 2021. Web-based Application for Detecting Indonesian Clickbait Headlines using IndoBERT. [online] https://doi.org/10.48550/ARXIV.2102.10601.
GENI, L., YULIANTI, E. and SENSUSE, D.I., 2023. Sentiment Analysis of Tweets Before the 2024 Elections in Indonesia Using IndoBERT Language Models. 9(3).
HAFIDZ, 2021. Kejahatan Siber Meningkat di Masa Pandemi. Universitas Indonesia. Available at: <https://www.ui.ac.id/kejahatan-siber-meningkat-di-masa-pandemi/>.
HERWANTO, G.B., MAULIDA NINGTYAS, A., NUGRAHA, K.E. and NYOMAN PRAYANA TRISNA, I., 2019. Hate Speech and Abusive Language Classification using fastText. In: 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). [online] 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). Yogyakarta, Indonesia: IEEE. pp.69–72. https://doi.org/10.1109/ISRITI48646.2019.9034560.
ISA, S.M., NICO, G. and PERMANA, M., 2022. IndoBERT for Indonesian Fake News Detection. https://doi.org/10.24507/icicel.16.03.289.
ISNAIN, A.R., SIHABUDDIN, A. and SUYANTO, Y., 2020. Bidirectional Long Short Term Memory Method and Word2vec Extraction Approach for Hate Speech Detection. IJCCS (Indonesian Journal of Computing and Cybernetics Systems), 14(2), p.169. https://doi.org/10.22146/ijccs.51743.
KIASATI DESRUL, D.R. and ROMADHONY, A., 2019. Abusive Language Detection on Indonesian Online News Comments. In: 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). [online] 2019 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI). Yogyakarta, Indonesia: IEEE. pp.320–325. https://doi.org/10.1109/ISRITI48646.2019.9034620.
LUBIS, A.R., LASE, Y.Y., RAHMAN, D.A. and WITARSYAH, D., 2023. Improving Spell Checker Performance for Bahasa Indonesia Using Text Preprocessing Techniques with Deep Learning Models. Ingénierie des systèmes d information, 28(5), pp.1335–1342. https://doi.org/10.18280/isi.280522.
MATHEW, L. and BINDU, V.R., 2020. A Review of Natural Language Processing Techniques for Sentiment Analysis using Pre-trained Models. In: 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). [online] 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC). Erode, India: IEEE. pp.340–345. https://doi.org/10.1109/ICCMC48092.2020.ICCMC-00064.
NEWMAN, N., FLETCHER, R., EDDY, K., ROBINSON, C.T. and NIELSEN, R.K., 2023. Reuters Institute digital news report 2023. [online] Reuters Institute for the Study of Journalism. https://doi.org/10.60625/RISJ-P6ES-HB13.
PAMUNGKAS, E.W., PUTRI, D.G.P. and FATMAWATI, A., 2023. Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities. International Journal of Advanced Computer Science and Applications, [online] 14(6). https://doi.org/10.14569/IJACSA.2023.01406125.
PUTRA, I.G.M. and NURJANAH, D., 2020. Hate Speech Detection In Indonesian Language Instagram. In: 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS). [online] 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS). Depok, Indonesia: IEEE. pp.413–420. https://doi.org/10.1109/ICACSIS51025.2020.9263084.
RUPARELIA, N.B., 2010. Software development lifecycle models. ACM SIGSOFT Software Engineering Notes, 35(3), pp.8–13. https://doi.org/10.1145/1764810.1764814.
WILIE, B., VINCENTIO, K., WINATA, G.I., CAHYAWIJAYA, S., LI, X., LIM, Z.Y., SOLEMAN, S., MAHENDRA, R., FUNG, P., BAHAR, S. and PURWARIANTI, A., 2020. IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding. Available at: <http://arxiv.org/abs/2009.05387>.
WOLF, T., DEBUT, L., SANH, V., CHAUMOND, J., DELANGUE, C., MOI, A., CISTAC, P., RAULT, T., LOUF, R., FUNTOWICZ, M., DAVISON, J., SHLEIFER, S., VON PLATEN, P., MA, C., JERNITE, Y., PLU, J., XU, C., LE SCAO, T., GUGGER, S., DRAME, M., LHOEST, Q. and RUSH, A., 2020. Transformers: State-of-the-Art Natural Language Processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. [online] Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Online: Association for Computational Linguistics. pp.38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6.
Unduhan
Diterbitkan
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2024 Jurnal Teknologi Informasi dan Ilmu Komputer
Artikel ini berlisensiCreative Commons Attribution-ShareAlike 4.0 International License.
Artikel ini berlisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Penulis yang menerbitkan di jurnal ini menyetujui ketentuan berikut:
- Penulis menyimpan hak cipta dan memberikan jurnal hak penerbitan pertama naskah secara simultan dengan lisensi di bawah Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) yang mengizinkan orang lain untuk berbagi pekerjaan dengan sebuah pernyataan kepenulisan pekerjaan dan penerbitan awal di jurnal ini.
- Penulis bisa memasukkan ke dalam penyusunan kontraktual tambahan terpisah untuk distribusi non ekslusif versi kaya terbitan jurnal (contoh: mempostingnya ke repositori institusional atau menerbitkannya dalam sebuah buku), dengan pengakuan penerbitan awalnya di jurnal ini.
- Penulis diizinkan dan didorong untuk mem-posting karya mereka online (contoh: di repositori institusional atau di website mereka) sebelum dan selama proses penyerahan, karena dapat mengarahkan ke pertukaran produktif, seperti halnya sitiran yang lebih awal dan lebih hebat dari karya yang diterbitkan. (Lihat Efek Akses Terbuka).