Evaluasi Kinerja MLLIB APACHE SPARK pada Klasifikasi Berita Palsu dalam Bahasa Indonesia

Antonius Angga Kurniawan; Metty Mustikasari

doi:10.25126/jtiik.2022923538

Penulis

Antonius Angga Kurniawan Universitas Gunadarma, Depok
Metty Mustikasari Universitas Gunadarma, Depok

DOI:

https://doi.org/10.25126/jtiik.2022923538

Abstrak

Machine learning digunakan untuk menganalisis, mengklasifikasikan, atau memprediksi data. Untuk melakukan tugas dari machine learning diperlukan alat bantu dengan kinerja serta lingkungan yang kuat demi mendapatkan akurasi dan efisiensi waktu yang baik. MLlib Apache Spark adalah library machine learning yang memiliki kemampuan dan kecepatan yang sangat baik. Hal ini dikarenakan dalam melakukan pemrosesan data, MLlib berjalan di atas memori. Penelitian ini menggunakan MLlib Apache Spark untuk melakukan klasifikasi berita palsu berbahasa Indonesia dengan jumlah data sebanyak 1786 yang diperoleh dari situs penyedia berita palsu dan fakta, yaitu TurnBackHoax.id. Algoritma klasifikasi yang diterapkan adalah Naïve Bayes, Gradient-Boosted Tree, SVM dan Logistic Regression. Keempat algoritma dipilih karena kemampuannya yang sudah terbukti baik dalam melakukan klasifikasi dan beberapa algoritma yang jarang digunakan namun memiliki kemampuan yang baik juga dalam hal klasifikasi. Tahap pengolahan data diantaranya adalah preprocessing, feature extraction, penerapan algoritma. Evaluasi dilakukan berdasarkan accuracy, test error, f1-score, confusion matrix, dan running time. Hasil menunjukkan bahwa MLlib Apache Spark terbukti memiliki kinerja yang cepat dan baik karena dalam melakukan pemrosesan machine learning, running time tercepat yang didapat adalah 6.46 detik dengan menggunakan algoritma Logistic Regression. Akurasi yang didapat juga cukup baik dengan rata-rata test error dari keempat algoritma hanya 0.180. F1-score yang diperoleh pada keempat algoritma juga cukup baik dengan rata-rata sebesar 0.818. Confusion matrix yang dihasilkan juga baik, karena jumlah prediksi benar jauh lebih banyak dibandingkan dengan jumlah yang salah.

Abstract

Machine learning is used to analyze, classify, or predict data. To do the task of machine learning, we need tools with a strong performance and environment to get good accuracy and time efficiency. MLlib Apache Spark is a machine learning library that has excellent capabilities and speed. This is because in performing data processing, MLlib runs on memory. This research uses MLlib Apache Spark to classify fake news in Indonesian language with 1786 data that were obtained from fake news and fact provider sites, TurnBackHoax.id. The classification algorithm applied was Naïve Bayes, Gradient-Boosted Tree, SVM and Logistic Regression. The four algorithms were chosen because of their proven ability to classify and several algorithms that are rarely used but have good abilities in terms of classification. Data processing stages include preprocessing, feature extraction, and algorithm implementation. Evaluation was done based on accuracy, error test, f1-score, confusion matrix, and running time. The results showed that MLlib Apache Spark was proven to have a fast and good performance because in doing machine learning processing, the fastest running time was 6.46 seconds using the Logistic Regression algorithm. The accuracy obtained was also quite good with an average test error of the four algorithms of only 0.180. F1-scores obtained on the four algorithms were also quite good with an average of 0.818. The result of confusion matrix was also good, because the number of correct predictions was far more than the number of incorrect ones.

Downloads

Download data is not yet available.

Referensi

AL-SAQQA, S., AL-NAYMAT, G. & AWAJAN, A. 2018. A large-scale sentiment data classification for online reviews under apache spark. Procedia Computer Science. doi: 10.1016/j.procs.2018.10.166.

ALICE ZHENG. 2015. Evaluating Machine Learning Models - O’Reilly Media, Oreilly.

ASSEFI, M. Dkk. 2017. Big data machine learning using apache spark Mllib. 2017 IEEE International Conference on Big Data (Big Data). IEEE, pp. 3492–3498. doi: 10.1109/BigData.2017.8258338.

CHAUHAN, G. 2018. All about Naive Bayes - Towards Data Science. Available at: https://towardsdatascience.com/all-about-naive-bayes-8e13cef044cf (Accessed: 17 June 2020).

CIOS, K. J. dkk. 2007. Data mining: A knowledge discovery approach, Data Mining: A Knowledge Discovery Approach. doi: 10.1007/978-0-387-36795-8.

CLASSIFICATION & REGRESSION - SPARK 2.4.6 DOCUMENTATION (no date). Available at: https://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier (Accessed: 16 June 2020).

DE’ATH, G. 2007. Boosted trees for ecological modeling and prediction’, Ecology. doi: 10.1890/0012-9658(2007)88[243:BTFEMA]2.0.CO;2.

DESHAI, N., SEKHAR, B. V. D. S. & VENKATARAMANA, S. 2019. Mllib: machine learning in apache spark. International Journal of Recent Technology and Engineering.

ETAIWI, W., BILTAWI, M. & NAYMAT, G. 2017. Evaluation of classification algorithms for banking customer’s behavior under Apache Spark Data Processing System. Procedia Computer Science. doi: 10.1016/j.procs.2017.08.280.

FU, J., SUN, J. & WANG, K. 2017. SPARK-A Big Data Processing Platform for Machine Learning’, in Proceedings - 2016 International Conference on Industrial Informatics - Computing Technology, Intelligent Technology, Industrial Information Integration. ICIICII 2016. doi: 10.1109/ICIICII.2016.0023.

GANESAN, K. & SUBOTIN, M. 2014. A general supervised approach to segmentation of clinical texts’, in Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014. doi: 10.1109/BigData.2014.7004390.

HOSSIN, M. & SULAIMAN, M. N. 2015. A Review on Evaluation Metrics for Data Classification Evaluations. International Journal of Data Mining & Knowledge Management Process. doi: 10.5121/ijdkp.2015.5201.

ID-STOPWORDS/ID.STOPWORDS.02.01.2016.TXT AT MASTER • MASDEVID/ID-STOPWORDS • GITHUB (no date). Available at: https://github.com/masdevid/ID-stopwords/blob/master/id.stopwords.02.01.2016.txt (Accessed: 16 June 2020).

JAMIL, L. S. 2016. Data Analysis Based on Data Mining Algorithms Using Weka Workbench. International Journal of Engineering Sciences & Research Technology. doi: 10.5281/zenodo.59630.

JURAFSKY, D. & MARTIN, J. H. 2019. Logistic Regression’, in Speech and Language Processing. 3rd edn, pp. 75–92.

JUWIANTHO, H. ET AL. 2020. Sentiment Analysis Twitter Bahasa Indonesia Berbasis Word2vec Menggunakan Deep Convolutional Neural Network. Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), 7(1), pp. 181–188. doi: 10.25126/jtiik.202071758.

LESKOVEC, J., RAJARAMAN, A. &ULLMAN, J. D. 2014. Mining of massive datasets: Second edition, Mining of Massive Datasets: Second Edition. doi: 10.1017/CBO9781139924801.

MENG, X. ET AL. 2016. MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research.

SOPYŁA, K. & DROZDA, P. 2015. Stochastic gradient descent with Barzilai-Borwein update step for SVM. Information Sciences. doi: 10.1016/j.ins.2015.03.073.

SUYANTO., 2017. Data mining Untuk Klasifikasi dan Klasterisasi Data, SpringerReference. doi: 10.1007/SpringerReference_5414.

TENTANG KAMI – TURNBACKHOAX, 2016. Available at: https://turnbackhoax.id/tentang-kami/ (Accessed: 14 June 2020).

TRIPATHY, A., AGRAWAL, A. & RATH, S. K. 2015. Classification of Sentimental Reviews Using Machine Learning Techniques’, in Procedia Computer Science. doi: 10.1016/j.procs.2015.07.523.

YASROBI, S. dkk. 2017. Performance Analysis of Sparks Machine Learning Library. Transactions on Machine Learning and Data Mining, 10(2).

Evaluasi Kinerja MLLIB APACHE SPARK pada Klasifikasi Berita Palsu dalam Bahasa Indonesia

Penulis

DOI:

Abstrak

Downloads

Referensi

Unduhan

Diterbitkan

Terbitan

Bagian

Lisensi

Cara Mengutip

Kirim Naskah

side menu

sertifikat akreditasi

Pengindeks Jurnal

Mendeley

Citations & Reference Manager

pengunjung

Keywords

Information

Supported by

Technical Support

Laboratorium

Direktori UB