Deteksi Spam Berbahasa Indonesia Berbasis Teks Menggunakan Model Bert
DOI:
https://doi.org/10.25126/jtiik.1168121Kata Kunci:
spam, deteksi spam, pemrosesan bahasa alami, BERT, text mining, klasifikasi teksAbstrak
Spam pada SMS dan Email menyebabkan pengalaman kurang menyenangkan bagi pengguna dalam pemanfaatan teknologi. Spam secara umum merupakan sebuah tindakan mengirim pesan yang tidak diinginkan atau tidak diminta kepada sejumlah besar orang. Spam kini dapat ditemui dalam berbagai bentuk, seperti web maupun multimedia. Penelitian ini bertujuan untuk mengevaluasi model berbasis BERT, khususnya IndoBERT dan MultilingualBERT, dalam mendeteksi dan mengklasifikasi spam berbahasa Indonesia pada pesan SMS dan Email. Model yang dipilih kemudian dilatih untuk mengidentifikasi perbedaan antara pesan spam dan bukan spam. Hasil evaluasi pada percobaan menggunakan dataset SMS dan Email memiliki nilai akurasi sebesar 98% pada model IndoBERT dan 95% pada model MultilingualBERT, yang menunjukkan tingkat akurasi yang tinggi. Hasil ini menunjukkan bahwa model BERT efektif dalam mendeteksi pesan spam dalam Bahasa Indonesia.
Abstract
Spam on SMS and Email causes an unpleasant experience for users in using technology. Spam in general is the act of sending unwanted or unsolicited messages to a large number of people. Spam can now be found in various forms, such as web and multimedia. This research aims to evaluate BERT-based models, specifically IndoBERT and MultilingualBERT, in detecting and classifying Indonesian spam in SMS and Email messages. The selected model is then trained to identify the differences between spam and non-spam messages. Evaluation results in experiments using SMS and Email datasets have an accuracy value of 98% in the IndoBERT model and 95% in the MultilingualBERT model, which shows a high level of accuracy. These results indicate that the BERT model is effective in detecting spam messages in Indonesian.
Downloads
Referensi
A. K. UYSAL, S. GUNAL, S. ERGIN & E. S. GUNAL, 2012. "The Impact of Feature Extraction and Selection on SMS Spam Filtering," in Elektronika ir Elektrotechnika (Electronics and Electrical Engineering).
AMBER, S. & JAMES, P., 2012. Natural Language Annotation for Machine Learning. California: O’Reilly.
BHOWMICK, A., & HAZARIKA, S. M., 2017. E-Mail Spam Filtering: A Review of Techniques and Trends. Lecture Notes in Electrical Engineering, 583–590. https://doi.org/10.1007/978-981-10-4765-7
DEVLIN, J., CHANG, M. W., LEE, K., & TOUTANOVA, K., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 – 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1(Mlm), 4171–4186.
DEWI, L.C., MEILIANA, N. & CHANDRA, A., 2019. Social Media Web Scraping using Social Media Developers API and Regex. Procedia Computer Science, [online] 157, pp.444–449. https://doi.org/10.1016/j.procs.2019.08.237.
F. KOTO, A. RAHIMI, J. H. LAU, & T. BALDWIN, 2020. “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” 2020, https://doi.org/10.48550/arXiv.2011.00677
FELDMAN, R & SANGER, J., 2007. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press: New York
GEGICK, M., ROTELLA, P. & XIE, T., 2010. Identifying Security Bug Reports via Text Mining: An Industrial Case Study. IEEE.
HARIKRISHNAN, N. B., 2019. “Confusion Matrix, Accuracy, Precision, Recall, F1 Score, Binary Classification Metric”. [online]. Tersedia di: https://medium.com/analytics-vidhya/confusion-matrix-accuracy-precision-recall-f1-score-ade299cf63cd.
HARTONO, M. B., DARMAWAN, A. K., & HORIYAH, H., 2023. “Komparasi Deep Learning Dan Traditional Machine Learning Untuk Email Spam Filtering”. Jurnal Minfo Polgan, 12(1), 636–643. https://doi.org/10.33395/jmp.v12i1.12474.
I. RAHMAYANI, 2015. “Atasi SMS Spam, Ini Langkah Operator Seluler.”, Kominfo [online]. Tersedia di: https://kominfo.go.id/content/detail/6042/atasi-sms-spam-ini-langkah-operatorseluler/0/sorotan_media.
I. RAHMAYANI, 2019. “Indonesia Raksasa Teknologi Digital Asia.”, Kominfo, [online]. Tersedia di: https://www.kominfo.go.id/content/detail/6095/indonesia-raksasa-teknologi-digital-asia/0/sorotan_media.
KUMAR, K., 2022. “Spam Email Classification using BERT”. [online]. Tersedia di: https://www.kaggle.com/code/kshitij192/spam-email-classification-using-bert.
KUMAR, V., 2009. Text Mining, Classification, Clustering, and Applications. CRC Press.
LUTFIYANI, R. S., & RETNOWATI, N., 2021. Implementasi Pendeteksian Spam Email Menggunakan Metode Text Mining Dengan Algoritma Naïve Bayes Dan Decision Tree J48.
Jurnal Komputer Dan Informatika, 9(2), 244–252. https://doi.org/10.35508/jicon.v9i2.5304
MA, J., ZHANG, Y., LIU, J., & YU, K., 2016. Intelligent SMS Spam Filtering Using Topic Model. 2016 International Conference on Intelligent Networking and Collaborative Systems, 380-383.
MCMAHAN, B. & D, RAO., 2019. Natural Language Processing with Pytorch. Gravenstein Highway North, Sebastopol: O’Reilly Media, Inc.
MOONEY, R. J., 2006. CS 391L Machine Learning Text Categorization. University of Texas, Austin.
PAUL, S., & SAHA, S., 2020. CyberBERT: BERT for cyberbullying identification: BERT for cyberbullying identification. Multimedia Systems, 0123456789. https://doi.org/10.1007/s00530-020-0710-4.
PIRES, T., SCHLINGER, E., & GARRETTE, D. (2019). How multilingual is multilingual BERT? Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4996--5001. https://doi.org/10.18653/v1/P19-1493
PRIYATNO & ARIF, M., 2019. "Deteksi bot spammer twitter berbasis time interval entropy dan global vectors for word representations tweet’s hashtag." Register: Jurnal Ilmiah Teknologi Sistem Informasi 5.1: 37-46.
MAQSOOD, U., REHMAN, S.U., ALI, T., MAHMOOD, K., ALSAEDI, T. and Kundi, M., 2023b. An Intelligent Framework Based on Deep Learning for SMS and e-mail Spam Detection. Applied Computational Intelligence and Soft Computing, 2023, pp.1–16. https://doi.org/10.1155/2023/6648970.
SHAABAN, M.A., HASSAN, Y.F. and GUIRGUIS, S.K., 2022. Deep convolutional forest: a dynamic deep ensemble approach for spam detection in text. Complex & Intelligent Systems, 8(6), pp.4897–4909. https://doi.org/10.1007/s40747-022-00741-6.
PUTRI, E.K., & SETIADI, T., 2014. Penerapan Text Mining Pada Sistem Klasifikasi Email Spam Menggunakan Naive Bayes: Jurnal Sarjana Teknik Informatika, Vol. 2(3), 73-83.
RAHMI, F. & WIBISONO, Y. , 2016. Aplikasi SMS Spam Filtering pada Android menggunakan Naive Bayes, Unpublished manuscript.
S. NILUH P. V. D., NOVANTO YUDISTIRA, N. & ADIKARA, P. P., 2023. "Analisis Sentimen terhadap Perundungan Siber pada Twitter menggunakan Algoritma Bidirectional Encoder Representations from Transformer (BERT)." Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer 7.2. 909-916.
SAXENA, S., 2018. “Precision vs Recall”. [online]. Tersedia di: https://medium.com/@shrutisaxena0617/precision-vs-recall-386cf9f89488#:~:text=Precision%20means%20the%20percentage%20of,correctly%20classified%20by%20your%20algorithm.
WAHYUNINGTYAS, ANDITA, IMAS SUKAESIH SITANGGANG & HUSNUL KHOTIMAH. "Deteksi Spam pada Twitter Menggunakan Algoritme Naïve Bayes Spam Detection on Twitter using Naïve Bayes Algorithm." vol 7: 31-40.
RAJDEV, M., & LEE, K., 2016. Fake and spam messages: Detecting misinformation during natural disasters on social media. Proceedings -2015 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT 2015, 1, 17–20. https://doi.org/10.1109/WI-IAT.2015.102
HIDAYAT, A., 2023. KLASIFIKASI SPAM EMAIL MENGGUNAKAN METODE NAIVE BAYES. Jurnal Teknologi Pintar, 3(2).
ALSHAHRANI, A., 2021. Intelligent Security Schema for SMS Spam Message Based on Machine Learning Algorithms. International Journal of Interactive Mobile Technologies, 15(16), 52–62. https://doi.org/10.3991/ijim.v15i16.24197
VASNAWI, A., SHAZEER, N., PARMAR, N., USZKOREIT, J., JONES, L., GOMEZ, A.N., KAISER, L. and POLOSHUKIN, I., 2017. Attention Is All You Need. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1706.03762.
WOLF, T., DEBUT, L., SANH, V., CHAUMOND, J., DELANGUE, C., MOI, A., CISTAC, P., RAULT, T., LOUF, R., FUNTOWICZ, M. and BREW, J., 2019. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv (Cornell University). https://doi.org/10.48550/arxiv.1910.03771.
PASZKE, A., GROSS, S., MASSA, F., LERER, A., BRADBURY, J., CHANAN, G., KILLEN, T., LIN, Z., GIMELSHEIN, N., ANTIGA, L., DESMAISON, A., KOPF, A., YANG, E., DEVITO, Z., RAISON, M., TEJANI, A., CHILAMKURTHY, S., STEINER, B., FANG, L., BAI, J. and CHINTALA, S., n.d. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. Available at: <https://proceedings.neurips.cc/paper_files/paper/2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf>.
Unduhan
Diterbitkan
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2024 Jurnal Teknologi Informasi dan Ilmu Komputer

Artikel ini berlisensiCreative Commons Attribution-ShareAlike 4.0 International License.

Artikel ini berlisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Penulis yang menerbitkan di jurnal ini menyetujui ketentuan berikut:
- Penulis menyimpan hak cipta dan memberikan jurnal hak penerbitan pertama naskah secara simultan dengan lisensi di bawah Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) yang mengizinkan orang lain untuk berbagi pekerjaan dengan sebuah pernyataan kepenulisan pekerjaan dan penerbitan awal di jurnal ini.
- Penulis bisa memasukkan ke dalam penyusunan kontraktual tambahan terpisah untuk distribusi non ekslusif versi kaya terbitan jurnal (contoh: mempostingnya ke repositori institusional atau menerbitkannya dalam sebuah buku), dengan pengakuan penerbitan awalnya di jurnal ini.
- Penulis diizinkan dan didorong untuk mem-posting karya mereka online (contoh: di repositori institusional atau di website mereka) sebelum dan selama proses penyerahan, karena dapat mengarahkan ke pertukaran produktif, seperti halnya sitiran yang lebih awal dan lebih hebat dari karya yang diterbitkan. (Lihat Efek Akses Terbuka).