Deteksi Informasi Sensitif dalam Dokumen Teks di Sektor Jasa Keuangan dengan Model CNN Berbasis TF-IDF dan TF-RF

Penulis

  • Steven Adriandi Vodegel Universitas Budi Luhur, Jakarta
  • Septian Charles Vodegel Universitas Budi Luhur, Jakarta
  • Imelda Vodegel Universitas Budi Luhur, Jakarta

DOI:

https://doi.org/10.25126/jtiik.2025125

Kata Kunci:

sensitive information detection, machine learning, convolutional neural networks, term frequency-inverse document frequency, term frequency-relevance frequency

Abstrak

Penelitian ini berfokus pada pengembangan model pembelajaran mesin untuk mendeteksi informasi sensitif dalam dokumen teks di industri jasa keuangan. Masalah utama yang diidentifikasi adalah potensi penyalahgunaan informasi oleh karyawan yang mengundurkan diri, keterbatasan metode deteksi tradisional, dan kebutuhan akan model pembelajaran mesin yang efektif. Ruang lingkup penelitian mencakup pengembangan model Convolutional Neural Networks (CNN) dengan metode pembobotan Term Frequency-Inverse Document Frequency (TF-IDF) dan Term Frequency-Relevance Frequency (TF-RF). Penelitian menggunakan pendekatan kuantitatif dan eksperimental, dengan tahapan pengumpulan data, pra-pemrosesan, penerapan pembobotan, pelatihan dan evaluasi model, serta validasi hasil. Data terdiri dari dokumen teks perusahaan jasa keuangan seperti laporan keuangan dan data nasabah. Pra-pemrosesan dilakukan untuk menghilangkan noise dan informasi tidak relevan, diikuti oleh metode pembobotan untuk memberi bobot pada kata-kata penting. Model CNN dilatih untuk mendeteksi pola yang menunjukkan informasi sensitif. Hasil penelitian menunjukkan metode TF-IDF lebih baik daripada TF-RF dalam mendeteksi informasi sensitif, dengan akurasi tertinggi 93,26%. Model CNN mampu mengenali pola kompleks dan mendeteksi informasi sensitif dengan akurasi tinggi. Evaluasi dengan akurasi, presisi, recall, dan f1-score menunjukkan bahwa model ini dapat diandalkan dan diaplikasikan dalam situasi nyata. Penelitian ini berkontribusi pada keamanan informasi dan penerapan pembobotan dalam meningkatkan kinerja model pembelajaran mesin.

 

Abstract

This research focuses on the development of a machine learning model to detect sensitive information in text documents within the financial services industry. The main issues identified are the potential misuse of information by employees who resign, the limitations of traditional detection methods, and the need for an effective machine learning model. The scope of the research includes the development of a Convolutional Neural Networks (CNN) model with Term Frequency-Inverse Document Frequency (TF-IDF) and Term Frequency-Relevance Frequency (TF-RF) weighting methods. The study employs a quantitative and experimental approach, with stages including data collection, preprocessing, application of weighting methods, model training and evaluation, and result validation. The data consists of text documents from financial services companies such as financial reports and customer data. Preprocessing was carried out to remove noise and irrelevant information, followed by the application of weighting methods to assign importance to significant words. The CNN model was trained to detect patterns indicating sensitive information. The results show that the TF-IDF method performed better than TF-RF in detecting sensitive information, with the highest accuracy of 93.26%. The CNN model was able to recognize complex patterns and detect sensitive information with high accuracy. Evaluation using accuracy, precision, recall, and f1-score demonstrates that this model is reliable and applicable in real-world situations. This research contributes to information security and the use of weighting methods to improve the performance of machine learning models.

Downloads

Download data is not yet available.

Referensi

AHSAN, M.M., MAHMUD, M.A.P., SAHA, P.K., GUPTA, K.D. dan SIDDIQUE, Z., 2021. Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance. Technologies, 9(3), p.52.

AL‐SHEHARI, T. dan ALSOWAIL, R.A., 2021. An insider data leakage detection using one‐hot encoding, synthetic minority oversampling and machine learning techniques. Entropy, 23(10).

ALI, A., ABD RAZAK, S., OTHMAN, S.H., EISA, T.A.E., AL-DHAQM, A., NASSER, M., ELHASSAN, T., ELSHAFIE, H. dan SAIF, A., 2022. Financial Fraud Detection Based on Machine Learning: A Systematic Literature Review. Applied Sciences (Switzerland), 12(19).

ALI, B.H., JALAL, A.A. dan IBRAHEM AL-OBAYDY, W.N., 2020. Data loss prevention by using MRSH-v2 algorithm. International Journal of Electrical and Computer Engineering, 10(4), pp.3615–3622.

ALZUBAIDI, L., ZHANG, J., HUMAIDI, A.J., AL-DUJAILI, A., DUAN, Y., AL-SHAMMA, O., SANTAMARÍA, J., FADHEL, M.A., AL-AMIDIE, M. dan FARHAN, L., 2021. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data, [online] 8, p.53. Available at: <https://doi.org/10.1186/s40537-021-00444-8> [Accessed 4 Jun. 2021].

GHORBANI, R. dan GHOUSI, R., 2020. Comparing Different Resampling Methods in Predicting Students’ Performance Using Machine Learning Techniques. IEEE Access, 8, pp.67899–67911.

GUHA, A., SAMANTA, D., BANERJEE, A. dan AGARWAL, D., 2021. A Deep Learning Model for Information Loss Prevention from Multi-Page Digital Documents. IEEE Access, 9, pp.80451–80465.

HARMANDINI, K.P. dan L, K.M., 2024. Analysis of TF-IDF and TF-RF Feature Extraction on Product Review Sentiment. Sinkron, 8(2), pp.929–937.

IRENA, B. dan SETIAWAN, E.B., 2020. Fake News (Hoax) Identification on Social Media Twitter using Decision Tree C4.5 Method. Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), 4(4), pp.711–716.

KRISNABAYU, R.Y., RIDOK, A. dan BUDI, A.S., 2021. Hepatitis Detection using Random Forest based on SVM-RFE ( Recursive Feature Elimination ) Feature Selection and SMOTE. 6th International Conferenceon Sustainable Information Engineering and Technology 2021 (SIET ’21), pp.151–156.

NUGROHO, K.S., ISTIADI, I. dan MARISA, F., 2020. Naive Bayes classifier optimization for text classification on e-government using particle swarm optimization. Jurnal Teknologi dan Sistem Komputer, 8(1), pp.21–26.

RACHMAN, F.P. dan SANTOSO, H., 2021. Perbandingan Model Deep Learning untuk Klasifikasi Sentiment Analysis dengan Teknik Natural Languange Processing. Jurnal Teknologi dan Manajemen Informatika, 7(2), pp.113–121.

SOLDATOS, J. dan KYRIAZIS, D., 2022. Big Data and Artificial Intelligence in Digital Finance. Big Data and Artificial Intelligence in Digital Finance.

STOJANOVIĆ, B., BOŽI, J., HOFER-SCHMITS, K., NAHRGANG, K., WEBER, A., BADII, A., SUNDARAM, M., JORDAN, E. dan RUNEVIC, J., 2021. Follow the Trail : Machine Learning for Fraud Detection in Fintech Applications. Sensors, [online] 21(5), p.1594. Available at: <https://doi.org/s21051594>.

SUARTANA, I.M., 2022. Analisis Penerapan Deep Learning untuk Klasifikasi Serangan Terhadap Keamanan Jaringan. Klik-Kumpulan Jurnal Ilmu Komputer, 9(1), pp.100–109.

TAJBAKHSH, N., SHIN, J.Y., GURUDU, S.R., HURST, R.T., KENDALL, C.B., GOTWAY, M.B. dan LIANG, J., 2017. Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning? IEEE Transactions on Medical Imaging, 35(5), pp.1299–1312.

TAMA, F.R. dan SIBARONI, Y., 2023. Fake News (Hoaxes) Detection on Twitter Social Media Content through Convolutional Neural Network (CNN) Method. JINAV: Journal of Information and Visualization, 4(1), pp.70–78.

TEMPL, M. dan SARIYAR, M., 2022. A systematic overview on methods to protect sensitive data provided for various analyses. International Journal of Information Security, [online] 21(6), pp.1233–1246. Available at: <https://doi.org/10.1007/s10207-022-00607-5>.

Y YUHANA, U.L., IMAMAH, I., FATICHAH, C. dan SANTOSO, B.J., 2022. Effectiveness of Deep Learning Approach for Text Classification in Adaptive Learning. Jurnal Ilmiah Kursor, 11(3), p.137.

Diterbitkan

31-10-2025

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Deteksi Informasi Sensitif dalam Dokumen Teks di Sektor Jasa Keuangan dengan Model CNN Berbasis TF-IDF dan TF-RF. (2025). Jurnal Teknologi Informasi Dan Ilmu Komputer, 12(5), 997-1006. https://doi.org/10.25126/jtiik.2025125