Penerapan Text Augmentation untuk Mengatasi Data yang Tidak Seimbang pada Klasifikasi Teks Berbahasa Indonesia

Iftitah Athiyyah Rahma; Lya Hulliyyatus Suadaa

doi:10.25126/jtiik.2023107325

Penulis

Iftitah Athiyyah Rahma Politeknik Statistika STIS, Jakarta Timur
Lya Hulliyyatus Suadaa Politeknik Statistika STIS, Jakarta Timur

DOI:

https://doi.org/10.25126/jtiik.2023107325

Abstrak

Klasifikasi teks merupakan salah satu tugas yang fundamental dalam natural language processing (NLP). Dalam dunia nyata, data dan sumber daya yang tersedia untuk pengklasifikasian teks terbatas. Salah satu kendala pada data berlabel yang digunakan yaitu imbalanced data atau data yang tidak seimbang. Permasalahan data yang tidak seimbang memengaruhi kinerja dan keakuratan model karena model hanya terfokus pada data dengan label mayoritas. Sementara itu, data berlabel minoritas cenderung diklasifikasikan tidak tepat oleh model, padahal untuk beberapa kasus kemampuan model untuk memprediksi data dengan label minoritas lebih penting. Untuk mengatasinya, penelitian ini melakukan pendekatan oversampling yaitu menambah data untuk menyeimbangkan dataset. Penerapan oversampling pada data teks dikenal dengan text augmentation. Pada penelitian ini dilakukan dua teknik text augmentation yaitu synonym replacement dan back translation pada beberapa kondisi ketidakseimbangan dan skenario augmentasi terhadap dua dataset. Berdasarkan hasil eksperimen, augmentasi mampu meningkatkan skor F1 label minoritas. Augmentasi lebih signifikan dalam dataset kecil dan kondisi ketidakeimbangan yang parah. Hasil dari teknik back translation lebih baik dibandingkan dengan teknik synonym replacement. Selain itu, hasil penelitian menunjukkan bahwa skenario jumlah augmentasi juga berpengaruh terhadap kenaikan skor F1. Semakin banyak jumlah data augmentasi belum tentu memberikan hasil yang semakin baik karena terindikasi overfitting pada data latih. Kata-kata yang tidak normal atau tidak baku pada dataset teks informal memengaruhi proses augmentasi sehingga hasil teks sintetis yang diperoleh tidak sebaik pada dataset teks formal.

Abstract

Text classification is one of the fundamental tasks in natural language processing (NLP). However, data and resources for text classification are limited in actual application. One of the constraints on the dataset for text classification is imbalanced data, or the condition when one label has more data than the others. Imbalanced data affects the performance and accuracy of the model because the model only focuses on the majority label data. Meanwhile, the minority label data tends to be classified incorrectly by the model, even though, in some cases, the model's ability to predict data with minority labels is more important. To solve this problem, this research uses an oversampling approach to augment data and balance the dataset. The application of oversampling text data is known as text augmentation. This research uses two text augmentation techniques, synonym replacement and back translation, applied to several imbalance conditions and augmentation scenarios for two datasets. Based on experimental results, augmentation can increase the F1 score of the minority class. Augmentation is more significant in small datasets and severe imbalance conditions. The results of the back translation technique are better than synonym replacement. In addition, this study's results show that the number of augmentation scenarios affects an increase in F1-score. However, increasing the augmentation data cannot ensure the results are getting better. Furthermore, words that are not normal in informal text datasets affect the augmentation process, so the results of synthetic text are better than the formal text dataset.

Downloads

Download data is not yet available.

Biografi Penulis

Iftitah Athiyyah Rahma, Politeknik Statistika STIS, Jakarta Timur

Mahasiswa Program Studi Komputasi Statistik Politeknik Statistika STIS
Lya Hulliyyatus Suadaa, Politeknik Statistika STIS, Jakarta Timur

Lektor Politeknik Statistika STIS

Referensi

ABDURRHAMAN dan PURWARIANTI, A., 2019. Effective Use of Augmentation Degree and Language Model for Synonym-based Text Augmentation on Indonesian Text Classification. In 2019 International Conference on Advanced Computer Science and Information Systems (ICACSIS 2019), pp. 217-222.

ALI, A., SHAMSUDDIN, S.M., dan RALESCU, A.L., 2015. Classification with Class Imbalance Problem: A Review. International Journal of Advances in Soft Computing and its Applications, 7(3), pp. 176-204.

BARUA, S., ISLAM, M.M., YAO, X., dan MURASE, K., 2014. MWMOTE – Majority Weighted Minority Oversampling Technique for Imbalanced Data Set Learning. IEEE Transactions of Knowledge and Data Engineering, 26(2), pp. 405-425.

CAHYAWIJAYA, S., WINAYA, G.I., WILIE, B., VINCENTIO, K., LI, K., KUNCORO, A., RUDER, S., LIM, Z.Y., BAHAR, S., KHODRA, M., PURWARIANTI, A., dan FUNG, P., 2021. IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 8875-8898.

CHOWDHARY, K.R., 2020. Natural Language Processing BT – Fundamentals of Artificial Intelligence. Jodhpur: Springer.

COULOMBE, C., 2018. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs, [online] Tersedia di: <http://arxiv.org/abs/1812.04718> [Diakses 1 November 2022]

DINAKARAMANI, A., RASHEL, F., LUTHFI, A., dan MANURUNG, R., 2014. Designing an Indonesian Part of Speech Tagset and Manually Tagged Indonesian Corpus. In Proceedings of the International Conference on Asian Language Processing 2014 (IALP 2014), pp. 66-69.

ESTABROOKS, A., JO, T., JAPKOWICZ, N., 2004. A Multiple Resampling Method for Learning from Imbalanced Data Sets. Computational Intelligence, 20(1), pp. 18-36.

FAN, A., GRAVE, E., dan JOULIN, A., 2019. Reducing Transformer Depth on Demand with Structured Dropout. ArXiv, abs/1909.11556.

FENG, STEVEN Y., GANGAL, V., KANG, D., MITAMURA, T., dan HOVY, E., 2020. GenAug: Data Augmentation for Finetuning Text generators. In Proceedings of Deep Learning Inside Out (DeeLIO): The First Workshop on Knowledge Extraction and Integration for Deep Learning Architecture, pp. 29-42, Online. Association for Computational Linguistics.

GU, Q., WANG, X.M., WU, Z., NING, B., dan XIN, C.S., 2016. An Improved SMOTE Algorithm Based on Genetic Algorithm for Imbalanced Data. Journal of Digital Information Management, 14(2), pp. 92-103.

GUDIVADA, V.N., BAEZA-YATES, R., dan RAGHAVAN, V.V., 2015. Big Data: Promises and Problems. Computer (Long. Beach Calif), 48(3), pp. 20-23.

HAN, H. dan JIANG, X., 2014. Ovecome Support Vector Machine Diagnosis Overfitting. Cancer Informatics, 13(S1), pp. 145-158.

HARYWANTO, G.N., VERON, J.S., dan SUHARTONO, D., 2022. A BERTweet-based Design for Monitoring Behaviour Change Based on Five Doors Theory on Coral Bleaching Campaign. Journal of Big Data, 9(1), pp. 1-22.

HIRSCHBERG, J. dan MANNING, C.D., 2015. Advanced in Natural Language Processing. Science, 349(6245), pp. 261-266.

JIANG, Z., PAN, T., ZHANG, C., dan YANG, J., 2021. A new Oversampling Method Based on the Classification Contribution Degree. Computer Science and Symmetry/Asymmetry, 13(2), pp. 1-13.

JUNGIEWICZ, M. dan SMYWINSKI-POHL, A., 2019. Towards Textual Data Augmentation for Neural Networks: Synonym and Maximum Loss. Computer Science, 20(1), pp. 57-84.

KOBAYASHI, S., 2018. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference (NAACL HLT 2018), vol. 2, pp. 452-457.

KURNIA, N., 2020. Big Data untuk Ilmu Sosial: Antara Metode Riset dan Realitas Sosial. Yogyakarta: UGM PRESS.

LIU, R., XU, G., JIA, C., MA, W., WANG, L., dan VOSOUGHI S., 2020. Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 9031-9041, Online. Association for Computational Linguistics.

LOGIT, I. dan ORME, B., 2009. MaxDiff Analysis: Simple Counting, Individual-Level Logit, and HB. Sawtooth Software Research Paper Series, 98382(360), pp. 1-7.

LU, Q., DOU, D., dan NGUYEN, T.H., 2021. Textual Data Augmentation for Patient Outcomes Prediction. 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 2817-2821, Houston, TX. USA.

MADUKWE, K.J., GAO, X., dan XUE, B., 2022. Token Replacement-based Data Augmentation Methods for Hate Speech Detection. World Wide Web, 25(3), pp. 1129-1150.

NATASYA dan GIRSANG, A.S., 2023. Modified EDA and Backtranslation Augmentation in Deep Learning Models for Indonesian Aspect-Based Sentiment Analysis. Emerging Science Journal, 7(1), pp. 256-272.

OKIMURA, 1., REID, M., KAWANO, M., dan MATSUO, Y., 2022. On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing. In Proceedings of the Third Workshop on Insights from Negative results in NLP, pp. 88-93.

PANDEY, S., AKHTAR, M.S., dan CHAKRABORTY, T., 2021. Syntactically Coherent Text Augmentation for Sequence Classification. In IEEE Transactions on Computational Social Systems, 8(6), pp. 1323-1332.

PAPINENI, K., ROUKOS, S., WARD, T., dan ZHU, W.J., 2002. Bleu: A Method for Automatic Evaluation for Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318.

RAZNO, M., 2019. Machine Learning Text Classification Model with NLP Approach. In Computational Linguistics and Intelligent Systems, Proceedings of the 3rd International Conference, vol. 2, pp. 71-73.

RUPAPARA, V., RUSTAM, F., SHAHZAD, H.F., MEHMOOD, A., ASHRAF, I., dan CHOI, G.S., 2021. Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model. IEEE Access, 9, pp. 78621-78634.

SAEZ, J.A., LUENGO, J., STEFANOWSKI, J., dan HERRERA, F., 2015. SMOTE-IPF: Addressing the Noisy and Borderline Examples Problem in Imbalanced Classification by A Resampling Method with Filtering. Information Sciences, vol. 291, pp. 184-203.

SANYA, A.D. dan SUADAA, L.H., 2022. Handling Imbalanced Dataset on Hate Speech detection in Indonesian Online News Comments. In 10th International Conference on Information and Communication Technology (ICoICT), pp. 380-385.

SOLTANZADEH, P. dan HASHEMZADEH, M., 2021. RCSMOTE: Range-Controlled Synthetic Minority Oversampling Technique for Handling the Class Imbalance Problem. Information Sciences, vol. 542, pp. 92-111.

SUN, A., LIM, E.P., dan LIU, Y., 2009. On Strategies for Imbalanced text Classification using SVM: A Comparative Study. Decision Support Systems, 48(1), pp. 191-201.

SUTOYO, E. dan FADLURRAHMAN, M.A., 2020. Penerapan SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Television Advertisement Performance Rating Menggunakan Artificial Neural Network. Jurnal Edukasi dan Penelitian Informatika, 6(3), p. 379.

TESFAGERGISH, S.G., DAMASEVICIUS, R., dan KAPOCIUTE-DZIKIENE, J., 2021. Deep Fake Recognition in Tweets Using Text Augmentation, Word Embeddings, and Deep Learning. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 12954 LNCS, pp. 523-538.

VAN DER MAATEN, L., 2015. Accelerating t-SNE Using Tree-Based Algorithm. Journal of Machine Learning Research, vol. 15, pp. 3221-3245.

VERDIKHA, N.A., ADJI, T.B., dan PERMANASARI, A.E., 2018. Komparasi Metode Oversampling untuk Klasifikasi Teks Ujaran Kebencian. Seminar Nasional Teknologi Informasi dan Multimedia, 1(2), pp. 85-90.

WANG, S., LI, Z., CHAO, W., dan CAO, Q., 2012. Applying Adaptive Oversampling Technique Based on Data Density and Cost-sensitive SVM to Imbalanced Learning. In Proceedings of the International Joint Conference on Neural Networks, pp. 10-15.

WANG, Y., LIU, F., VERSPOOR, K., dan BALDWIN, T., 2020. Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, pp. 105–111, Online. Association for Computational Linguistics.

WILLIAM, A. dan SARI, Y., 2020. CLICK-ID: A Novel Dataset for Indonesian Clickbait Headlines. Data In Brief, vol. 32, p. 106231.

XIANG, R., CHERSONI, E., LU, Q., HUANG, C.R., LI, W., dan LONG, Y., 2021. Lexical Data Augmentation for Sentiment Analysis. Journal of the Association for Information Science and Technology, 72(11), pp. 1432-1447.

ZHANG, T., KISHORE, V., WU, F., WEINBERGER, K.Q., dan ARTZI, Y., 2019. Bertscore: Evaluating Text Generation with Bert, [online] Tersedia di: <https://arxiv.org/abs/1904.09675v3> [Diakses 6 Januari 2023]

ZHANG, X. dan LECUN, Y., 2015. Text Understanding from Scratch, [online] Tersedia di: <http://arxiv.org/abs/1502.01710> [Diakses 1 November 2022]

ZHU, D., LIN, W., ZHANG, Y., ZHONG, Q., ZENG, G., WU, W., dan TANG, J., 2021. AT-BERT: Adversarial Training BERT for Acronym Identification Winning Solution for SDU@AAAI-21. CEUR-WS, 2831(28).

Penerapan Text Augmentation untuk Mengatasi Data yang Tidak Seimbang pada Klasifikasi Teks Berbahasa Indonesia

Penulis

DOI:

Abstrak

Downloads

Biografi Penulis

Referensi

Unduhan

Diterbitkan

Terbitan

Bagian

Lisensi

Cara Mengutip

Kirim Naskah

side menu

sertifikat akreditasi

Pengindeks Jurnal

Mendeley

Citations & Reference Manager

pengunjung

Keywords

Information

Supported by

Technical Support

Laboratorium

Direktori UB