Strategi Penanganan Imbalance Class Pada Model Klasifikasi Penerima Kartu Indonesia Pintar Kuliah Berbasis Neural Network Menggunakan Kombinasi SMOTE dan ENN

Penulis

  • Zaqiatud Darojah Politeknik Elektronika Negeri Surabaya
  • Ronny Susetyoko Politeknik Elektronika Negeri Surabaya
  • Nana Ramadijanti Politeknik Elektronika Negeri Surabaya

DOI:

https://doi.org/10.25126/jtiik.20231026480

Abstrak

Keterbatasan kuota penerima program Kartu Indonesia Pintar Kuliah (KIP Kuliah) dari pemerintah mengharuskan Perguruan Tinggi (PT) menyeleksi dengan cermat calon mahasiswa yang berhak menerima program tersebut. Pembentukan model klasifikasi penerima program KIP Kuliah merupakan salah satu cara yang dapat membantu PT dalam menyeleksi calon mahasiswa agar tepat sasaran berdasarkan data lampau. Penelitian ini bertujuan untuk membentuk model klasifikasi penerima KIP Kuliah menggunakan Neural Network (NN).  Strategi data processing level digunakan untuk mengatasi ketidakseimbangan data atau imbalance class yang terjadi antara kelas penerima KIP Kuliah sebagai kelas minoritas dan kelas bukan penerima KIP Kuliah sebagai kelas mayoritas. Teknik yang digunakan pada penelitian ini adalah mengkombinaskan metode oversampling Syntetic Minority Oversampling Technique (SMOTE), metode undersampling Edited Nearest Neighbor Rule (ENN),  dan metode undersampling dengan penghapusan langsung pada sampel terpilih. Skema penggabungan dilakukan dengan cara mengelompokkan terlebih dahulu kelas mayoritas menjadi beberapa sub kelas (cluster) menggunakan algoritma k-means. Metode SMOTE dan ENN diterapkan secara bersamaan menggunakan rasio sampling tertentu pada dataset yang berasal dari kelas minoritas dan sub kelas mayoritas yang merupakan tetangga terdekat kelas minoritas tersebut. Metode penghapusan sampel diterapkan pada sub kelas mayoritas yang memiliki jarak yang sangat signifikan dari kelas minoritas. Tujuan dari skema yang diajukan adalah untuk meminimalkan terjadinya pembangkitan false sample pada kelas minoritas dan penghapusan sampel informatif pada kelas mayoritas. Hasil simulasi menunjukkan bahwa kombinasi teknik undersampling dan oversampling dengan skema yang diusulkan mampu meningkatkan kinerja model klasifikasi NN secara signifikan. Model klasifikasi terbaik menghasilkan  nilai accuracy sebesar 93.45%,  TPR sebesar 90,00%, TNR sebesar 93.67%, G-Mean sebesar 91,51%, dan nMCC sebesar 81.25%. 

 

Abstract 

 

The limited quota for recipients of the Kartu Indonesia Pintar Kuliah (KIP Kuliah) program requires the university to select carefully the students who are entitled to receive the program. This study aims to build the classification model for KIP Kuliah recipients using Neural Network (NN) which can be utilized by universities in selecting prospective KIP Kuliah recipients students. To solve the imbalanced KIP Kuliah recipients data, we propose a hybrid sampling technique that combines the Synthetic Minority Over-Sampling Technique (SMOTE) and the Edited Nearest Neighbor (ENN) and also samples selected deletion method with a new scheme. Firstly, the majority class is clustered into several sub-classes using the k-means algorithm.  The SMOTE and ENN methods are applied simultaneously on a dataset derived from a minority class and a majority sub-class that is the nearest neighbor of the minority class with a certain sampling ratio. Furthermore, the sample-selected deletion method is applied to the majority sub-classes that have a very significant distance from the minority class. Lastly, The resampling results of the proposed scheme are combined into one training dataset in ANN. The objective of the proposed scheme is to minimize the generation of ‘false samples’ in the minority class and the elimination of informative samples in the majority class. The results show that the proposed scheme can significantly improve the performance of the NN classification model. The best classification model produces an accuracy value of 93.45%, TPR of 90.00%, TNR of 93.67%, G-Mean of 91.51%, and MCC of 81.25%.

Downloads

Download data is not yet available.

Biografi Penulis

  • Zaqiatud Darojah, Politeknik Elektronika Negeri Surabaya

    Program Studi Sarjana Terapana Teknik Mekatronika

    Departemen Teknik Mekanika dan Energi

  • Ronny Susetyoko, Politeknik Elektronika Negeri Surabaya

    Program Studi Sarjana Terapan Sains Data Terapan

    Departemen Teknik Informatika dan Komputer

  • Nana Ramadijanti, Politeknik Elektronika Negeri Surabaya

    Programs Studi Sarjana Terapan Teknik Informatika

    Departemen Teknik Informatika dan Komputer

Referensi

BACH, M., WERNER, A., ŻYWIEC, J., & PLUSKIEWICZ, W., 2017. The study of under- and over-sampling methods’ utility in analysis of highly imbalanced data on osteoporosis. Information Sciences, 384. https://doi.org/10.1016/j.ins.2016.09.038.

CHICCO, D., & JURMAN, G., 2020. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1), 6. https://doi.org/10.1186/s12864-019-6413-7.

DENG, M., GUO, Y., WANG, C., & WU, F., 2021. An oversampling method for multi-class imbalanced data based on composite weights. PloS One, 16(11), e0259227. https://doi.org/10.1371/journal.pone.0259227.

DEVI, DEBASHREE, BISWAS, SAROJ KR., PURKAYASTHA, B., 2020. A Review on Solution to Class Imbalance Problem: Undersampling Approaches. 2020 International Conference on Computational Performance Evaluation (ComPE), 626–631. http://compe2020.com/.

FARIZAWANI, A. G., PUTEH, M., MARINA, Y., & RIVAIE, A., 2020. A review of artificial neural network learning rule based on multiple variant of conjugate gradient approaches. Journal of Physics Conference Series, 1529, 22040. https://doi.org/10.1088/1742-6596/1529/2/022040.

FERNÁNDEZ, A., GARCIA, S., HERRERA, F., & CHAWLA, N., 2018. SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary. Journal of Artificial Intelligence Research, 61, 863–905. https://doi.org/10.1613/jair.1.11192.

HANCOCK, J. T., & KHOSHGOFTAAR, T. M., 2020. Survey on categorical data for neural networks. Journal of Big Data, 7, 1–41.

HASSANAT, A. B., TARAWNEH, A. S., ABED, S. S., ALTARAWNEH, G. A., ALRASHIDI, M., & ALGHAMDI, M., 2022. RDPVR: Random Data Partitioning with Voting Rule for Machine Learning from Class-Imbalanced Datasets. Electronics (Switzerland), 11(2). https://doi.org/10.3390/electronics11020228.

HASSANAT, A. B., TARAWNEH, A. S., ALTARAWNEH, G. A., & ALMUHAIMEED, A., 2022. Stop Oversampling for Class Imbalance Learning: A Review. IEEE Access, 10, 47643–47660. https://doi.org/10.1109/ACCESS.2022.3169512.

KOZIARSKI, M., 2021. CSMOUTE: Combined Synthetic Oversampling and Undersampling Technique for Imbalanced Data Classification. Proceedings of the International Joint Conference on Neural Networks, 2021-July. https://doi.org/10.1109/IJCNN52387.2021.9533415.

KUMAR, P., BHATNAGAR, R., GAUR, K., & BHATNAGAR, A., 2021. Classification of Imbalanced Data:Review of Methods and Applications. IOP Conference Series: Materials Science and Engineering, 1099(1), 12077. https://doi.org/10.1088/1757-899X/1099/1/012077.

LEMAITRE, G., NOGUEIRA, F., & ARIDAS, C. K., 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research, 18, 1–5. https://hal.inria.fr/hal-01516244.

LUQUE, A., CARRASCO, A., MART’IN, A., & DE LAS HERAS, A., 2019. The Impact of Class Imbalance in Classification Performance Metrics Based on the Binary Confusion Matrix. Pattern Recogn., 91(C), 216–231. https://doi.org/10.1016/j.patcog.2019.02.023.

NUGRAHA, W., MAULANA, M. S., & SASONGKO, A., 2020. Clustering Based Undersampling for Handling Class Imbalance in C4.5 Classification Algorithm. Journal of Physics: Conference Series, 1641(1), 12014. https://doi.org/10.1088/1742-6596/1641/1/012014.

PUSLAPDIK, K. R., 2022. Pedoman Pendaftaran Kartu Indonesia Pintar Kuliah - KIP Kuliah Merdeka. https://kip-kuliah.kemdikbud.go.id/uploads/Pedoman-Pendaftaran-KIP-K-2022-ver-20220202---final_cd9b5e.pdf.

SARITAS, M. M., & YASAR, A., 2019. Performance Analysis of ANN and Naive Bayes Classification Algorithm for Data Classification. International Journal of Intelligent Systems and Applications in Engineering, 7(2), 88–91. https://doi.org/10.18201//ijisae.2019252786.

SUSETYOKO, R., YUWONO, W., and PURWANTINI, E., 2022. Model Klasifikasi Pada Seleksi Mahasiswa Baru Penerima KIP Kuliah Menggunakan Regresi Logistik Biner, JIP, vol. 8, no. 4, pp. 31-40, Aug. 2022.

TSAI, C.-F., LIN, W.-C., HU, Y.-H., & YAO, G.-T., 2019. Under-sampling class imbalanced datasets by combining clustering analysis and instance selection. Information Sciences, 477, 47–54, https://doi.org/https://doi.org/10.1016/j.ins.2018.10.029.

XU, Z., SHEN, D., NIE, T., & KOU, Y., 2020. A hybrid sampling algorithm combining M-SMOTE and ENN based on Random forest for medical imbalanced data. Journal of Biomedical Informatics, 107, 103465. https://doi.org/10.1016/j.jbi.2020.103465.

XU, Z., SHEN, D., NIE, T., KOU, Y., YIN, N., & HAN, X., 2021. A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data. Inf. Sci., 572, 574–589.

YANUAR, 2022. Mahasiswa Penerima KIP Kuliah Dapat Diganti Bila Memenuhi Syarat Ini. https://puslapdik.kemdikbud.go.id/artikel/mahasiswa-penerima-kip-kuliah-dapat-diganti-bila-memenuhi-syarat-ini#.

Diterbitkan

14-04-2023

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Strategi Penanganan Imbalance Class Pada Model Klasifikasi Penerima Kartu Indonesia Pintar Kuliah Berbasis Neural Network Menggunakan Kombinasi SMOTE dan ENN. (2023). Jurnal Teknologi Informasi Dan Ilmu Komputer, 10(2), 457-466. https://doi.org/10.25126/jtiik.20231026480