Model Klasifikasi dengan Logistic Regression dan Recursive Feature Elimination pada Data Tidak Seimbang
DOI:
https://doi.org/10.25126/jtiik.1148198Abstrak
Logistic Regression merupakan metode pengklasifikasi yang sangat populer dan digunakan secara luas pada berbagai penelitian. Logistic Regression dapat memberikan hasil yang baik pada masalah klasifikasi maupun prediksi. Fitur dataset yang besar mengakibatkan beban komputasi, dan menurunkan kinerja klasifikasi. Terdapat tiga dataset yang digunakan pada penelitian ini yaitu Bank marketing, Glass, dan Musk II. Dataset tersebut bersumber dari UCI Repository dan memiliki karakteristik yang berbeda. Ada dua tantangan penggunaan dataset tersebut, yaitu ketidakseimbangan kelas, dan jumlah fitur yang besar. Ada dua tahapan utama pada penelitian ini, yaitu pemrosesan awal dan klasifikasi. Tahapan pemrosesan awal menerapkan seleksi fitur melalui recursive feature elimination, dan penyeimbangan data menggunakan teknik SMOTE. Tahapan klasifikasi menerapkan Logistic Regression. Teknik ridge regression (L2-regularization) diterapkan untuk menghindari overfitting pada tahap validasi model LR. Evaluasi kinerja model didasarkan pada matrik konfusi dan grafik ROC. Hasil penelitian menunjukkan bahwa seleksi fitur dan peyeimbangan kelas memiliki dampak yang baik. Melalui ROC, model LR+RFE+SMOTE memiliki luas sebesar 93%. Hasil ini lebih baik dibanding dengan empat model klasifikasi lainnya, yaitu Naïve Bayes, Decision Tree, K-NN, dan Random Forest.
Abstract
Logistic regression is a widely popular classification method extensively used in various studies. Logistic regression can yield good results in classification and prediction problems. The extensive features of the dataset can lead to computational burdens and reduced classification performance. Three datasets were utilized in this research: Bank Marketing, Glass, and Musk II. The dataset is sourced from the UCI Repository and contains various characteristics. There are two challenges associated with using this dataset: class imbalance and a large number of features. There are two main stages in this research: initial processing and classification. At the initial processing stage, feature selection is conducted through recursive feature elimination, and data balancing is achieved using the SMOTE technique. The classification stage applies logistic regression. The ridge regression technique (L2-regularization) is applied to prevent overfitting during the validation stage of the linear regression model. The model performance evaluation is based on confusion matrices and ROC graphs. The research results show that feature selection and class balancing have a positive impact. Through the Receiver Operating Characteristics (ROC) analysis, the LR+RFE+SMOTE model achieved an area under the curve of 93%. These results are better than those of four other classification models, namely Naïve Bayes, Decision Tree, K-NN, and Random Forest.
Downloads
Referensi
ASGHARIEH AHARI, S., & KOCUK, B. 2023. A mixed-integer exponential cone programming formulation for feature subset selection in logistic regression. EURO Journal on Computational Optimization, 11, 100069. https://doi.org/https://doi.org/10.1016/j.ejco.2023.100069
CHALICHALAMALA, S., GOVINDAN, N., & KASARAPU, R. 2023. Logistic Regression Ensemble Classifier for Intrusion Detection System in Internet of Things. Sensors, 23(23). https://doi.org/10.3390/s23239583
CHARIZANOS, G., DEMIRHAN, H., & İÇEN, D. 2024. A Monte Carlo fuzzy logistic regression framework against imbalance and separation. Information Sciences, 655(August 2023), 119893. https://doi.org/10.1016/j.ins.2023.119893
DING, X., YANG, F., & MA, F. 2022. An efficient model selection for linear discriminant function-based recursive feature elimination. Journal of Biomedical Informatics, 129, 104070. https://doi.org/https://doi.org/10.1016/j.jbi.2022.104070
FENG, C. H., DISIS, M. L., CHENG, C., & ZHANG, L. 2022. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models. Laboratory Investigation, 102(3), 236–244. https://doi.org/10.1038/s41374-021-00662-x
GE, C., LUO, L., ZHANG, J., MENG, X., & CHEN, Y. 2021. FRL: An Integrative Feature Selection Algorithm Based on the Fisher Score, Recursive Feature Elimination, and Logistic Regression to Identify Potential Genomic Biomarkers. BioMed Research International, 2021. https://doi.org/10.1155/2021/4312850
GUPTA, P., VARSHNEY, A., KHAN, M. R., AHMED, R., SHUAIB, M., & ALAM, S. 2023. Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques. Procedia Computer Science, 218, 2575–2584.
https://doi.org/https://doi.org/10.1016/j.procs.2023.01.231
HAIRANI, H. 2021. Peningkatan Konerja Metode SVM Menggunakan Metode KNN Imputasi dan K-Means-Smote untuk Klasifikasi Kelulusan Mahasiswa Universitas Bumigora. Jurnal Teknologi Informasi Dan Ilmu Komputer, 8(4), 713–718. https://doi.org/10.25126/jtiik.2021843428
HAN, Y., DU, Z., HU, X., LI, Y., CAI, D., FAN, J., & GENG, Z. 2023. Production prediction modeling of food waste anaerobic digestion for resources saving based on SMOTE-LSTM. Applied Energy, 352, 122024. https://doi.org/https://doi.org/10.1016/j.apenergy.2023.122024
INGWERSEN, E. W., STAM, W. T., MEIJS, B. J. V, ROOR, J., BESSELINK, M. G., GROOT KOERKAMP, B., DE HINGH, I. H. J. T., VAN SANTVOORT, H. C., STOMMEL, M. W. J., & DAAMS, F. 2023. Machine learning versus logistic regression for the prediction of complications after pancreatoduodenectomy. Surgery, 174(3), 435–440. https://doi.org/https://doi.org/10.1016/j.surg.2023.03.012
JIAO, Y., YUAN, J., QIANG, Y., & FEI, S. 2021. Deep embeddings and logistic regression for rapid active learning in histopathological images. Computer Methods and Programs in Biomedicine, 212, 106464. https://doi.org/https://doi.org/10.1016/j.cmpb.2021.106464
KIGUCHI, M., SAEED, W., & MEDI, I. 2022. Churn prediction in digital game-based learning using data mining techniques: Logistic regression, decision tree, and random forest. Applied Soft Computing, 118, 108491. https://doi.org/https://doi.org/10.1016/j.asoc.2022.108491
KIM, B., & SHIN, S. J. 2019. Principal weighted logistic regression for sufficient dimension reduction in binary classification. Journal of the Korean Statistical Society, 48(2), 194–206.
https://doi.org/https://doi.org/10.1016/j.jkss.2018.11.001
LI, Y., & HSU, W. W. 2022. A classification for complex imbalanced data in disease screening and early diagnosis. Statistics in Medicine, 41(19), 3679–3695. https://doi.org/10.1002/sim.9442
MATHEW, T. E. 2019. A Logistic Regression with Recursive Feature Elimination Model for Breast Cancer Diagnosis. International Journal on Emerging Technologies, 10(3), 55–63
MERINO, T., STILLWELL, M., STEELE, M., COPLAN, M., PATTON, J., STOYANOV, A., & DENG, L. 2020. Expansion of Cyber Attack Data from Unbalanced Datasets Using Generative Adversarial Networks. In R. Lee (Ed.), Software Engineering Research, Management and Applications (pp. 131–145). Springer International Publishing. https://doi.org/10.1007/978-3-030-24344-9_8
MOUHAJIR, M., NECHBA, M., & SEDJARI, Y. 2023. High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison. 1–5.
https://doi.org/10.1109/aibthings58340.2023.10291024
PULUNGAN, M. P., PURNOMO, A., & KURNIASIH, A. 2023. Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Kepribadian MBTI Menggunakan Naive Bayes Classifier. Jurnal Teknologi Informasi Dan Ilmu Komputer, 10(7), 1493–1502. https://doi.org/10.25126/jtiik.1077989
STRZELECKA, A., KURDYŚ-KUJAWSKA, A., & ZAWADZKA, D. 2020. Application of logistic regression models to assess household financial decisions regarding debt. Procedia Computer Science, 176, 3418–3427. https://doi.org/https://doi.org/10.1016/j.procs.2020.09.055
SUN, P., WANG, Z., JIA, L., & XU, Z. 2024. SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Systems with Applications, 238, 121848. https://doi.org/https://doi.org/10.1016/j.eswa.2023.121848
TANG, Z., FENG, X., QIAN, W., & SONG, J. 2011. Evaluation of magnetic resonance imaging criteria for Meckel’s cave lesion: logistic regression analysis and correlation with surgical findings. Clinical Imaging, 35(5), 329–335. https://doi.org/https://doi.org/10.1016/j.clinimag.2010.08.013
WANG, J., WANG, H., NIE, F., & LI, X. 2023. Feature selection with multi-class logistic regression. Neurocomputing, 543, 126268. https://doi.org/https://doi.org/10.1016/j.neucom.2023.126268
WANG, W., & SUN, D. 2021. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358–374. https://doi.org/https://doi.org/10.1016/j.ins.2021.03.042
WIBAWA, M. S., & NOVIANTI, K. D. P. 2017. Reduksi Fitur untuk Optimalisasi Klasifikasi Tumor Payudara Berdasarkan Data Citra FNA. Konferensi Nasional Sistem & Informatika, 73–78.
WICHITAKSORN, N., KANG, Y., & ZHANG, F. 2023. Random feature selection using random subspace logistic regression. Expert Systems with Applications, 217, 119535.
https://doi.org/https://doi.org/10.1016/j.eswa.2023.119535
ZHU, B., QIAN, C., VANDEN BROUCKE, S., XIAO, J., & LI, Y. 2023. A bagging-based selective ensemble model for churn prediction on imbalanced data. Expert Systems with Applications, 227, 120223. https://doi.org/https://doi.org/10.1016/j.eswa.2023.120223
Unduhan
Diterbitkan
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2024 Jurnal Teknologi Informasi dan Ilmu Komputer
Artikel ini berlisensiCreative Commons Attribution-ShareAlike 4.0 International License.
Artikel ini berlisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Penulis yang menerbitkan di jurnal ini menyetujui ketentuan berikut:
- Penulis menyimpan hak cipta dan memberikan jurnal hak penerbitan pertama naskah secara simultan dengan lisensi di bawah Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) yang mengizinkan orang lain untuk berbagi pekerjaan dengan sebuah pernyataan kepenulisan pekerjaan dan penerbitan awal di jurnal ini.
- Penulis bisa memasukkan ke dalam penyusunan kontraktual tambahan terpisah untuk distribusi non ekslusif versi kaya terbitan jurnal (contoh: mempostingnya ke repositori institusional atau menerbitkannya dalam sebuah buku), dengan pengakuan penerbitan awalnya di jurnal ini.
- Penulis diizinkan dan didorong untuk mem-posting karya mereka online (contoh: di repositori institusional atau di website mereka) sebelum dan selama proses penyerahan, karena dapat mengarahkan ke pertukaran produktif, seperti halnya sitiran yang lebih awal dan lebih hebat dari karya yang diterbitkan. (Lihat Efek Akses Terbuka).