Model Klasifikasi dengan Logistic Regression dan Recursive Feature Elimination pada Data Tidak Seimbang

Penulis

  • Sutarman Universitas Sumatera Utara, Medan
  • Rimbun Siringoringo Universitas Sumatera Utara, Medan
  • Dedy Arisandi Universitas Sumatera Utara, Medan
  • Edi Kurniawan Universitas Sumatera Utara, Medan
  • Erna Budhiarti Nababan Universitas Sumatera Utara, Medan

DOI:

https://doi.org/10.25126/jtiik.1148198

Abstrak

Logistic Regression merupakan metode pengklasifikasi yang sangat populer dan digunakan secara luas pada berbagai penelitian. Logistic Regression dapat memberikan hasil yang baik pada masalah klasifikasi maupun prediksi. Fitur dataset yang besar mengakibatkan beban komputasi,  dan  menurunkan kinerja klasifikasi. Terdapat tiga dataset yang digunakan pada penelitian ini yaitu Bank marketing, Glass, dan Musk II. Dataset tersebut bersumber dari  UCI Repository dan memiliki karakteristik yang berbeda. Ada dua tantangan penggunaan dataset tersebut, yaitu ketidakseimbangan kelas, dan jumlah fitur yang besar. Ada dua tahapan utama pada penelitian ini, yaitu pemrosesan awal dan klasifikasi.  Tahapan pemrosesan awal menerapkan seleksi  fitur melalui recursive feature elimination, dan penyeimbangan data menggunakan teknik  SMOTE. Tahapan klasifikasi menerapkan Logistic Regression. Teknik ridge regression (L2-regularization) diterapkan untuk menghindari overfitting pada tahap validasi model LR.  Evaluasi kinerja model didasarkan pada matrik konfusi dan grafik ROC. Hasil penelitian menunjukkan bahwa seleksi fitur dan peyeimbangan kelas memiliki dampak yang baik. Melalui ROC, model LR+RFE+SMOTE memiliki luas sebesar 93%. Hasil ini lebih baik dibanding dengan empat model klasifikasi lainnya, yaitu  Naïve Bayes, Decision Tree, K-NN, dan Random Forest.

 

Abstract

 

Logistic regression is a widely popular classification method extensively used in various studies. Logistic regression can yield good results in classification and prediction problems. The extensive features of the dataset can lead to computational burdens and reduced classification performance. Three datasets were utilized in this research: Bank Marketing, Glass, and Musk II. The dataset is sourced from the UCI Repository and contains various characteristics. There are two challenges associated with using this dataset: class imbalance and a large number of features. There are two main stages in this research: initial processing and classification. At the initial processing stage, feature selection is conducted through recursive feature elimination, and data balancing is achieved using the SMOTE technique. The classification stage applies logistic regression. The ridge regression technique (L2-regularization) is applied to prevent overfitting during the validation stage of the linear regression model. The model performance evaluation is based on confusion matrices and ROC graphs. The research results show that feature selection and class balancing have a positive impact. Through the Receiver Operating Characteristics (ROC) analysis, the LR+RFE+SMOTE model achieved an area under the curve of 93%. These results are better than those of four other classification models, namely Naïve Bayes, Decision Tree, K-NN, and Random Forest.

Downloads

Download data is not yet available.

Referensi

ASGHARIEH AHARI, S., & KOCUK, B. 2023. A mixed-integer exponential cone programming formulation for feature subset selection in logistic regression. EURO Journal on Computational Optimization, 11, 100069. https://doi.org/https://doi.org/10.1016/j.ejco.2023.100069

CHALICHALAMALA, S., GOVINDAN, N., & KASARAPU, R. 2023. Logistic Regression Ensemble Classifier for Intrusion Detection System in Internet of Things. Sensors, 23(23). https://doi.org/10.3390/s23239583

CHARIZANOS, G., DEMIRHAN, H., & İÇEN, D. 2024. A Monte Carlo fuzzy logistic regression framework against imbalance and separation. Information Sciences, 655(August 2023), 119893. https://doi.org/10.1016/j.ins.2023.119893

DING, X., YANG, F., & MA, F. 2022. An efficient model selection for linear discriminant function-based recursive feature elimination. Journal of Biomedical Informatics, 129, 104070. https://doi.org/https://doi.org/10.1016/j.jbi.2022.104070

FENG, C. H., DISIS, M. L., CHENG, C., & ZHANG, L. 2022. Multimetric feature selection for analyzing multicategory outcomes of colorectal cancer: random forest and multinomial logistic regression models. Laboratory Investigation, 102(3), 236–244. https://doi.org/10.1038/s41374-021-00662-x

GE, C., LUO, L., ZHANG, J., MENG, X., & CHEN, Y. 2021. FRL: An Integrative Feature Selection Algorithm Based on the Fisher Score, Recursive Feature Elimination, and Logistic Regression to Identify Potential Genomic Biomarkers. BioMed Research International, 2021. https://doi.org/10.1155/2021/4312850

GUPTA, P., VARSHNEY, A., KHAN, M. R., AHMED, R., SHUAIB, M., & ALAM, S. 2023. Unbalanced Credit Card Fraud Detection Data: A Machine Learning-Oriented Comparative Study of Balancing Techniques. Procedia Computer Science, 218, 2575–2584.

https://doi.org/https://doi.org/10.1016/j.procs.2023.01.231

HAIRANI, H. 2021. Peningkatan Konerja Metode SVM Menggunakan Metode KNN Imputasi dan K-Means-Smote untuk Klasifikasi Kelulusan Mahasiswa Universitas Bumigora. Jurnal Teknologi Informasi Dan Ilmu Komputer, 8(4), 713–718. https://doi.org/10.25126/jtiik.2021843428

HAN, Y., DU, Z., HU, X., LI, Y., CAI, D., FAN, J., & GENG, Z. 2023. Production prediction modeling of food waste anaerobic digestion for resources saving based on SMOTE-LSTM. Applied Energy, 352, 122024. https://doi.org/https://doi.org/10.1016/j.apenergy.2023.122024

INGWERSEN, E. W., STAM, W. T., MEIJS, B. J. V, ROOR, J., BESSELINK, M. G., GROOT KOERKAMP, B., DE HINGH, I. H. J. T., VAN SANTVOORT, H. C., STOMMEL, M. W. J., & DAAMS, F. 2023. Machine learning versus logistic regression for the prediction of complications after pancreatoduodenectomy. Surgery, 174(3), 435–440. https://doi.org/https://doi.org/10.1016/j.surg.2023.03.012

JIAO, Y., YUAN, J., QIANG, Y., & FEI, S. 2021. Deep embeddings and logistic regression for rapid active learning in histopathological images. Computer Methods and Programs in Biomedicine, 212, 106464. https://doi.org/https://doi.org/10.1016/j.cmpb.2021.106464

KIGUCHI, M., SAEED, W., & MEDI, I. 2022. Churn prediction in digital game-based learning using data mining techniques: Logistic regression, decision tree, and random forest. Applied Soft Computing, 118, 108491. https://doi.org/https://doi.org/10.1016/j.asoc.2022.108491

KIM, B., & SHIN, S. J. 2019. Principal weighted logistic regression for sufficient dimension reduction in binary classification. Journal of the Korean Statistical Society, 48(2), 194–206.

https://doi.org/https://doi.org/10.1016/j.jkss.2018.11.001

LI, Y., & HSU, W. W. 2022. A classification for complex imbalanced data in disease screening and early diagnosis. Statistics in Medicine, 41(19), 3679–3695. https://doi.org/10.1002/sim.9442

MATHEW, T. E. 2019. A Logistic Regression with Recursive Feature Elimination Model for Breast Cancer Diagnosis. International Journal on Emerging Technologies, 10(3), 55–63

MERINO, T., STILLWELL, M., STEELE, M., COPLAN, M., PATTON, J., STOYANOV, A., & DENG, L. 2020. Expansion of Cyber Attack Data from Unbalanced Datasets Using Generative Adversarial Networks. In R. Lee (Ed.), Software Engineering Research, Management and Applications (pp. 131–145). Springer International Publishing. https://doi.org/10.1007/978-3-030-24344-9_8

MOUHAJIR, M., NECHBA, M., & SEDJARI, Y. 2023. High Performance Computing Applied to Logistic Regression: A CPU and GPU Implementation Comparison. 1–5.

https://doi.org/10.1109/aibthings58340.2023.10291024

PULUNGAN, M. P., PURNOMO, A., & KURNIASIH, A. 2023. Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Kepribadian MBTI Menggunakan Naive Bayes Classifier. Jurnal Teknologi Informasi Dan Ilmu Komputer, 10(7), 1493–1502. https://doi.org/10.25126/jtiik.1077989

STRZELECKA, A., KURDYŚ-KUJAWSKA, A., & ZAWADZKA, D. 2020. Application of logistic regression models to assess household financial decisions regarding debt. Procedia Computer Science, 176, 3418–3427. https://doi.org/https://doi.org/10.1016/j.procs.2020.09.055

SUN, P., WANG, Z., JIA, L., & XU, Z. 2024. SMOTE-kTLNN: A hybrid re-sampling method based on SMOTE and a two-layer nearest neighbor classifier. Expert Systems with Applications, 238, 121848. https://doi.org/https://doi.org/10.1016/j.eswa.2023.121848

TANG, Z., FENG, X., QIAN, W., & SONG, J. 2011. Evaluation of magnetic resonance imaging criteria for Meckel’s cave lesion: logistic regression analysis and correlation with surgical findings. Clinical Imaging, 35(5), 329–335. https://doi.org/https://doi.org/10.1016/j.clinimag.2010.08.013

WANG, J., WANG, H., NIE, F., & LI, X. 2023. Feature selection with multi-class logistic regression. Neurocomputing, 543, 126268. https://doi.org/https://doi.org/10.1016/j.neucom.2023.126268

WANG, W., & SUN, D. 2021. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences, 563, 358–374. https://doi.org/https://doi.org/10.1016/j.ins.2021.03.042

WIBAWA, M. S., & NOVIANTI, K. D. P. 2017. Reduksi Fitur untuk Optimalisasi Klasifikasi Tumor Payudara Berdasarkan Data Citra FNA. Konferensi Nasional Sistem & Informatika, 73–78.

WICHITAKSORN, N., KANG, Y., & ZHANG, F. 2023. Random feature selection using random subspace logistic regression. Expert Systems with Applications, 217, 119535.

https://doi.org/https://doi.org/10.1016/j.eswa.2023.119535

ZHU, B., QIAN, C., VANDEN BROUCKE, S., XIAO, J., & LI, Y. 2023. A bagging-based selective ensemble model for churn prediction on imbalanced data. Expert Systems with Applications, 227, 120223. https://doi.org/https://doi.org/10.1016/j.eswa.2023.120223

Diterbitkan

26-08-2024

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Model Klasifikasi dengan Logistic Regression dan Recursive Feature Elimination pada Data Tidak Seimbang. (2024). Jurnal Teknologi Informasi Dan Ilmu Komputer, 11(4), 735-742. https://doi.org/10.25126/jtiik.1148198