Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine

Penulis

  • Nabila Sekar Ramadhanti Institut Pertanian Bogor
  • Wisnu Ananta Kusuma Institut Pertanian Bogor
  • Annisa Annisa Institut Pertanian Bogor

DOI:

https://doi.org/10.25126/jtiik.2020762857

Abstrak

Data tidak seimbang menjadi salah satu masalah yang muncul pada masalah prediksi atau klasifikasi. Penelitian ini memfokuskan untuk mengatasi masalah data tidak seimbang pada prediksi drug-target interaction (interaksi senyawa-protein). Ada banyak protein target dan senyawa obat yang terdapat pada basis data interaksi senyawa-protein yang belum divalidasi interaksinya secara eksperimen. Belum diketahuinya interaksi antar senyawa dan target tersebut membuat proporsi antara data yang diketahui interaksinya dan yang belum dikethui menjadi tidak seimbang. Data interaksi yang sangat tidak seimbang dapat menyebabkan hasil prediksi menjadi bias. Terdapat banyak cara untuk mengatasi data tidak seimbang ini, namun pada penelitian ini diimplementasikan metode yang menggabungkan Biased Support Vector Machine (BSVM), oversampling, dan undersampling dengan Ensemble Support Vector Machine (SVM). Penelitian ini mengeksplorasi efek sampling yang digabungkan dalam metode tersebut pada data interaksi senyawa-protein. Metode ini sudah diuji pada dataset Nuclear Receptor, G-Protein Coupled Receptor dan Ion Channel dengan rasio ketidakseimbangannya sebesar 14.6%, 32.36%, dan 28.2%. Hasil pengujian dengan menggunakan ketiga dataset tersebut menunjukkan nilai area under curve (AUC) secara berturut-turut sebesar 63.4%, 71.4%, 61.3% dan F-measure sebesar 54%, 60.7% dan 39%. Nilai akurasi dari metode yang digunakan masih terbilang cukup baik, walaupun nilai tersebut lebih kecil dari metode SVM tanpa perlakuan apapun. Nilai tersebut bias karena nilai AUC dan F-measure ternyata lebih kecil. Hal ini membuktikan bahwa metode yang diusulkan dapat menurunkan tingkat bias pada data tidak seimbang yang diuji dan meningkatkan nilai AUC dan f-measure sekitar 5%-20%.

 

Abstract

Imbalanced data has been one of the problems that arise in processing data. This research is focusing on handling imbalanced data problem for drug-target (compound-protein) interaction data. There are many target protein and drug compound existed in compound-protein interaction databases, which many interactions are not validated yet by experiment. This unknown interaction led drug target interaction to become imbalanced data. A really imbalanced data may cause bias to prediction result. There are many ways of handling imbalanced data, but this research implemented some methods such as BSVM, oversampling, undersampling with SVM ensemble. These method already solve the imbalanced data problem on other kind of data like image data. This research is focusing on exploration of effect on the sampling that used in these method for compound-protein interaction data. This method had been tested on compound-protein interaction Nuclear Receptor, GPCR and Ion Channel with 14.6%, 32.36% and 28.2% of imbalance ratio. The evaluation result using these three dataset show the value of AUC respectively 63.4%, 71.4%, 61.3% and F-measure of 54%, 60.7% and 39%. The score from this method is quite good, even though the score of accuracy and precision is smaller than the SVM. The value is bias because the AUC and F-measure score is smaller. This proves that the proposed method could reduce the bias rate in the evaluated imbalanced data and increase AUC and f-measure score from 5% to 20%.


Downloads

Download data is not yet available.

Referensi

CAO, D.S., XIAO, N., XU, Q.S., CHEN, A.F. 2014. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics, 31(2), 279-281.

CHANG, CHIH-CHUNG; LIN, CHIH-JEN, 2011. LIBSVM: A library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2011, 2.3: 27.

CHAWLA, N.V., BOWYER, K.W., HALL, L.O., KEGELMEYER, W.P., 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321-357.

CHAWLA, N.V., 2009. Data Mining for Imbalanced Datasets an Overview. Di dalam: Maimon O, Rokach L, editor. Data Mining and Knowledge Discovery Handbook. Boston (US): Springer. hlm 853-867.

CHOI, J.M., 2010. A selective sampling method for imbalanced data learning on support vector machines.Graduate Theses, Ioawa State University.

CHONG, C.R., SULLIVAN, JR. DJ., 2007. New uses for old drugs. Nature. Aug 8;448(7154):645.

ELHASSAN, T., ALJURF, M., 2016. Classification of Imbalance Data using Tomek Link (T-Link) Combined with Random Under-sampling (RUS) as a Data Reduction Method.".

EZZAT, A., WU, M., LI, X.L., KWOH, C.K., 2016. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC bioinformatics. 2016 Dec;17(19):509.

FITRIAWAN, A., 2013. Sistem klasifikasi khasiat formula jamu dengan metode support vector machine [skripsi]. Bogor (ID): Institut Pertanian Bogor.

GU, J., ZHANG, H., CHEN, L., XU, S., YUAN, G., XU, X., 2011. Drug–target network and polypharmacology studies of a Traditional Chinese Medicine for type II diabetes mellitus. Computational Biology and Chemistry. 35(5):293–297.

HUANG, M.W., CHEN, C.W., LIN, W.C., KE, S.W.,TSAI, C.F., 2017. SVM and SVM Ensembles in BreastCancer Prediction. PLoS ONE 12(1): e0161501. doi:10.1371/journal.pone.0161501

JIAN, C., JIAN, G., AO, Y., 2016. A New Sampling Method for Classifying Imbalanced Data Based on Support Vector Machine Ensemble. Neurocomputing. 2(6). doi:10.1016/j.neucom.2016.02.006.

KIM, H.C., PANG, S., JE, H.M., KIM, D., BANG, S.Y., 2002. Support Vector Machine Ensemble with Bagging. Di dalam: Lee SW, Verri A, editor. Pattern Recognition with Support Vector Machines: First International Workshop, SVM; 2002 Agustus 10; Niagara Falls, Kanada. Proceedings. doi:2388. 397-407. 10.1007/3-540-45665-1_31.

KURNIA, A,. 2017. Prediksi Formula Jamu Berkhasiat MenggunakanTeknik Link Prediction dari Jejaring Bipartite Senyawa Aktif dan Protein [skripsi]. Bogor (ID): Institut Pertanian Bogor.

LARASATI. 2019. Model Prediksi Interaksi Senyawa dan Protein untuk Drug Repositioning Menggunakan Deep Belief Network dan Stacked Auto Encoder. [thesis]. Bogor (ID): Institut Pertanian Bogor.

LI, Z. R., LIN, H. H., HAN, L. Y., JIANG, L., CHEN, X., & CHEN, Y. Z. (2006). PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research, 34(suppl_2), W32-W37.

LIN, K.B., WENG, W., LAI, K., LU, P., 2014. Imbalance data classification algorithm based on SVM and clustering function. Conference: 2014 9th International Conference on Computer Science & Education (ICCSE). hlm 544-548. doi:10.1109/ICCSE.2014.6926521.

PESSETTO, Z.Y., WEIR, S.J., SETHI, G., Broward, M.A., Godwin, A.K., 2013. Drug repurposing for gastrointestinal stromal tumor. Molecular cancer therapeutics. Jul 1;12(7):1299-309.

RAHMI, A.S., 2018. Implementasi hybrid sampling technique untuk prediksi interaksi senyawa aktif dan protein pada data yang tidak seimbang. ICDALC International Conference on Digital Agriculture from Land to Consumer; 2018 20-21 Sep; Bogor, Indonesia.

RODER C., THOMSON M.J., 2015. Auranofin: repurposing an old drug for a golden new age. Drugs in R&D. 2015 Mar 1;15(1):13-20.

YAMANISHI, Y., ARAKI, M., GUTTERIDGE, A., HONDA, W., KAMEHISA, M.,2008. Prediction of drug–target interaction networks from the intefration of chemical and genomic spaces. Bioinformatics. 24(8):i232–i240.

ZHANG, Y. P., ZHANG, L. N., & WANG, Y. C. (2010, September). Cluster-based majority under-sampling approaches for class imbalance learning. In 2010 2nd IEEE International Conference on Information and Financial Engineering (pp. 400-404). IEEE.

Diterbitkan

02-12-2020

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Optimasi Data Tidak Seimbang pada Interaksi Drug Target dengan Sampling dan Ensemble Support Vector Machine. (2020). Jurnal Teknologi Informasi Dan Ilmu Komputer, 7(6), 1221-1230. https://doi.org/10.25126/jtiik.2020762857