Otomatisasi Pendeteksi Kata Baku dan Tidak Baku pada Data Twitter Berbasis KBBI

Penulis

  • M. Irfan Raif Universitas Maritim Raja Ali Haji, Tanjung Pinang
  • Nuraisa Novia Hidayati Badan Riset dan Inovasi Nasional, Jakarta
  • Tekad Matulatan Universitas Maritim Raja Ali Haji, Tanjung Pinang

DOI:

https://doi.org/10.25126/jtiik.20241127404

Abstrak

Penelitian ini berfokus pada pengembangan sistem deteksi otomatis untuk membedakan kata baku dan tidak baku pada data Twitter, berdasarkan Kamus Besar Bahasa Indonesia (KBBI). Karena Twitter merupakan platform media sosial yang sering menggunakan kata-kata yang tidak baku, penelitian ini penting untuk memastikan komunikasi yang efektif. Melalui normalisasi kata-kata tidak baku, penelitian ini berkontribusi signifikan terhadap pra-pemrosesan dan analisis tweet, yang merupakan langkah penting dalam klasifikasi teks media sosial. Sistem otomatis yang dikembangkan tidak hanya membantu peneliti dengan mudah mengidentifikasi penggunaan kata-kata slang atau tidak baku, namun juga meningkatkan kualitas komunikasi dan pemahaman pesan dalam tweet yang mencerminkan tren bahasa terkini. Pendekatan yang dilakukan dalam penelitian ini meliputi langkah-langkah seperti pengumpulan data, preprocessing, identifikasi bahasa tidak baku, penghapusan kata berimbuhan, identifikasi slang, dan penggunaan metode lexicon-based untuk kamus opini. Pendekatan ini efektif dalam mendukung analisis sentimen pada teks mining dan memastikan hasil klasifikasi sentimen pada data Twitter lebih akurat. Hasil percobaan menunjukkan bahwa langkah preprocessing tersebut berhasil meningkatkan akurasi metode penentuan polarisasi, dengan tingkat akurasi InSet sebesar 66,66% dan F1-score sebesar 61,40%.

Abstract

This research focuses on developing an automatic detection system to distinguish between standard and nonstandard words in Twitter data, based on the Kamus Besar Bahasa Indonesia (KBBI). As Twitter is a social media platform that often uses nonstandard words, this research is important to ensure effective communication. Through the normalization of nonstandard words, this research contributes significantly to the pre-processing and analysis of tweets, which is an important step in social media text classification. The automated system developed not only helps researchers easily identify the use of slang or nonstandard words, but also improves the quality of communication and message understanding in tweets that reflect current language trends. The approach taken in this research includes steps such as data collection, preprocessing, nonstandard language identification, removal of affixed words, slang identification, and the use of lexicon-based methods for opinion dictionaries. This approach is effective in supporting sentiment analysis in text mining and ensures more accurate sentiment classification results on Twitter data. Experimental results show that these preprocessing steps successfully improve the accuracy of the polarization determination method, with an InSet accuracy rate of 66.66% and F1-score of 61.40%.

Downloads

Download data is not yet available.

Referensi

A., V. and SONAWANE, S.S., 2016. Sentiment Analysis of Twitter Data: A Survey of Techniques. International Journal of Computer Applications, 139(11), pp.5–15. https://doi.org/10.5120/ijca2016908625.

ANGELINA, S.J., BIJAKSANA, A., NEGARA, P. and MUHARDI, H., 2023. Analisis Pengaruh Penerapan Stopword Removal Pada Performa Klasifikasi Sentimen Tweet Bahasa Indonesia. [online] 02(1), pp.165–173. https://doi.org/10.26418/juara.v2i1.69680.

ANTINASARI, P., PERDANA, R.S. and FAUZI, M.A., 2017. Analisis Sentimen Tentang Opini Film Pada Dokumen Twitter Berbahasa Indonesia Menggunakan Naive Bayes Dengan Perbaikan Kata Tidak Baku. Jurnal Pengembangan Teknologi Informasi dan Ilmu Komputer, [online] 1(12), pp.1718–1724. Available at: <http://j-ptiik.ub.ac.id>.

ARIEF, R. and IMANUEL, K., 2019. ANALISIS SENTIMEN TOPIK VIRAL DESA PENARI PADA MEDIA SOSIAL TWITTER DENGAN METODE LEXICON BASED. Jurnal Ilmiah Matrik, [online] 21(3), pp.242–250. https://doi.org/10.33557/jurnalmatrik.v21i3.727.

AZHAR, Y., 2018. Metode Lexicon-Learning Based Untuk Identifikasi Tweet Opini Berbahasa Indonesia. Jurnal Nasional Pendidikan Teknik Informatika (JANAPATI), 6(3), p.237. https://doi.org/10.23887/janapati.v6i3.11739.

BENGI RUHAMAH, ADNAN, H., 2018. KEMAMPUAN SISWA DALAM MEMBEDAKAN KATA BAKU DAN KATA TIDAK BAKU DI KELAS V SDNEGERI 3 BANDA ACEHNo Title. Jurnal Ilmiah Pendidikan Guru Sekolah Dasar, 3, pp.160–163.

BHATIA, S., SHARMA, M. and BHATIA, K.K., 2018. Sentiment Analysis and Mining of Opinions. Studies in Big Data, 30(May), pp.503–523. https://doi.org/10.1007/978-3-319-60435-0_20.

CHAKRABORTY, G.S., BATRA, S., SINGH, A., MUHAMMAD, G., TORRES, V.Y. and MAHAJAN, M., 2023. A Novel Deep Learning-Based Classification Framework for COVID-19 Assisted with Weighted Average Ensemble Modeling. Diagnostics, 13(10). https://doi.org/10.3390/diagnostics13101806.

DARWIS, D., PRATIWI, E.S. and PASARIBU, A.F.O., 2020. Penerapan Algoritma Svm Untuk Analisis Sentimen Pada Data Twitter Komisi Pemberantasan Korupsi Republik Indonesia. Edutic - Scientific Journal of Informatics Education, 7(1), pp.1–11. https://doi.org/10.21107/edutic.v7i1.8779.

DURAN, M.S., AVANÇO, L. and Nunes, M.G.V., 2015. A Normalizer for UGC in Brazilian Portuguese. ACL-IJCNLP 2015 - Workshop on Noisy User-Generated Text, WNUT 2015 - Proceedings of the Workshop, (October 2017), pp.38–47. https://doi.org/10.18653/v1/w15-4305.

FADLI, H.F. and HIDAYATULLAH, A.F., 2019. Identifikasi Cyberbullying Pada Media Sosial Twitter Menggunakan Metode Klasifikasi Random Forest. Automata.

GRANDINI, M., BAGLI, E. and VISANI, G., 2020. Metrics for Multi-Class Classification: an Overview. [online] pp.1–17. https://doi.org/https://doi.org/10.48550/arXiv.2008.05756.

HERAWATI, R., JUANSAH, D.E. and TISNASARI, S., 2019. Analisis Afiksasi Dalam Kata-Kata Mutiara Pada Caption Di Media Sosial Instagram Dan Implikasinya Terhadap Pembelajaran Bahasa Indonesia Di Smp. Membaca Bahasa dan Sastra Indonesia, [online] 4(1), pp.45–50. https://doi.org/http://dx.doi.org/10.30870/jmbsi.v4i1.6236.

HIDAYAT, W., ARDIANSYAH, M. and SETYANTO, A., 2021. Pengaruh Algoritma ADASYN dan SMOTE terhadap Performa Support Vector Machine pada Ketidakseimbangan Dataset Airbnb. Edumatic: Jurnal Pendidikan Informatika, 5(1), pp.11–20. https://doi.org/10.29408/edumatic.v5i1.3125.

IVAN, Y.A.S. and ADIKARA, P.P., 2019. Classification of Indonesian Hate Speech on Twitter Using NaïVe Bayes and Selection of Information Gain Feature with Word Normalization. Journal of Information Technology Development and Computer Science, 3(5), pp.4914–4922.

KOTO, F. and RAHMANINGTYAS, G.Y., 2018. Inset lexicon: Evaluation of a word list for Indonesian sentiment analysis in microblogs. Proceedings of the 2017 International Conference on Asian Language Processing, IALP 2017, 2018-Janua(December), pp.391–394. https://doi.org/10.1109/IALP.2017.8300625.

KUMAR, P. and GRUZD, A., 2019. Social media for informal learning: A case of #Twitterstorians. Proceedings of the Annual Hawaii International Conference on System Sciences, 2019-Janua, pp.2527–2535. https://doi.org/10.24251/hicss.2019.304.

MEFTAH, S., SEMMAR, N., SADAT, F. and HX, K.A., 2018. A neural network model for part-of-speech tagging of social media texts. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, [online] pp.2821–2828. Available at: <https://aclanthology.org/L18-1446>.

NAFAN, M.Z. and AMALIA, A.E., 2019. Kecenderungan Tanggapan Masyarakat terhadap Ekonomi Indonesia berbasis Lexicon Based Sentiment Analysis. Jurnal Media Informatika Budidarma, 3(4), p.268. https://doi.org/10.30865/mib.v3i4.1283.

PEBIANA, S., HIDAYATI, N.N., AFRA, D.I.N., NURFADHILAH, E., PRAFITIA, H.A., PRIHANTORO, J., FAJRI, R., ULINIANSYAH, M.T., SANTOSA, A., AINI, L.R., SAHREZA, Y., SUBEKTI, A.H.K.M., PINEM, J.G., ALFIN, M.R., SEPTADI, A., SHALEHA, S., WIBOWANTO, G.S., JARIN, A., GUNARSO, LATIEF, A.D. and RIZA, H., 2022. Experimentation of Various Preprocessing Pipelines for Sentiment Analysis on Twitter Data about New Indonesia’s Capital City Using SVM and CNN. 2022 25th Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques, O-COCOSDA 2022 - Proceedings. https://doi.org/10.1109/O-COCOSDA202257103.2022.9997982.

RAMACHANDRAN, D. and PARVATHI, R., 2019. Analysis of Twitter Specific Preprocessing Technique for Tweets. Procedia Computer Science, 165(2019), pp.245–251. https://doi.org/10.1016/j.procs.2020.01.083.

RASOOL, A., TAO, R., MARJAN, K. and NAVEED, T., 2019. Twitter Sentiment Analysis: A Case Study for Apparel Brands. Journal of Physics: Conference Series, 1176(2). https://doi.org/10.1088/1742-6596/1176/2/022015.

RAY, D., 2017. Lexicon Based Sentiment Analysis of Twitter Data. International Journal for Research in Applied Science and Engineering Technology, V(X), pp.910–915. https://doi.org/10.22214/ijraset.2017.10130.

RINANDYASWARA, R., SARI, Y.A. and FURQON, M.T., 2022. Pembentukan Daftar Stopword Menggunakan Term Based Random Sampling Pada Analisis Sentimen Dengan Metode Naïve Bayes (Studi Kasus: Kuliah Daring Di Masa Pandemi). Jurnal Teknologi Informasi dan Ilmu Komputer, 9(4), p.717. https://doi.org/10.25126/jtiik.2022934707.

ROZI, I.F., ARDIANSYAH, R. and REBEKA, N., 2019. Penerapan Normalisasi Kata Tidak Baku Menggunakan Levenshtein Distance pada Analisa Sentimen Layanan PT. KAI di Twitter. Seminar Informatika Aplikatif, [online] pp.106–112. Available at: <http://jurnalti.polinema.ac.id/index.php/SIAP/article/view/563>.

SAINI, S., PUNHANI, R., BATHLA, R. and SHUKLA, V.K., 2019. Sentiment Analysis on Twitter Data using R. 2019 International Conference on Automation, Computational and Technology Management, ICACTM 2019, (April 2020), pp.68–72. https://doi.org/10.1109/ICACTM.2019.8776685.

UTAMA, H.S., ROSIYADI, D., ARIDARMA, D. and PRAKOSO, B.S., 2019. Sentimen Analisis Kebijakan Ganjil Genap Di Tol Bekasi Menggunakan Algoritma Naive Bayes Dengan Optimalisasi Information Gain. Jurnal Pilar Nusa Mandiri, 15(2), pp.247–254. https://doi.org/10.33480/pilar.v15i2.705.

Unduhan

Diterbitkan

25-04-2024

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Otomatisasi Pendeteksi Kata Baku dan Tidak Baku pada Data Twitter Berbasis KBBI. (2024). Jurnal Teknologi Informasi Dan Ilmu Komputer, 11(2), 337-348. https://doi.org/10.25126/jtiik.20241127404