Pengenalan Entitas Biomedis dalam Teks Konsultasi Kesehatan Online Berbahasa Indonesia Berbasis Arsitektur Transformers

Penulis

  • Abid Famasya Abdillah Institut Teknologi Sepuluh Nopember, Surabaya
  • Diana Purwitasari Institut Teknologi Sepuluh Nopember, Surabaya http://orcid.org/0000-0001-7000-7628
  • Safitri Juanita Institut Teknologi Sepuluh Nopember, Surabaya, Universitas Budi Luhur, Jakarta, Indonesia
  • Mauridhi Hery Purnomo Institut Teknologi Sepuluh Nopember, Surabaya

DOI:

https://doi.org/10.25126/jtiik.20231016337

Abstrak

Pengenalan entitas biomedis merupakan salah satu tahapan penting dalam ekstraksi informasi pada domain kesehatan. Untuk melakukannya, penelitian terkini banyak menggunakan model ekstraksi biomedis berbasis deep learning yang juga dikenal sebagai Biomedical NER (BioNER). Banyak penelitian menggunakan data sosial media sebagai basis data latih BioNER untuk memenuhi kebutuhan data yang besar. Di sisi lain, banyaknya topik bahasan pada sosial media membuat sumber data ini kurang representatif digunakan dalam pelatihan BioNER seiring dengan melimpahnya bias serta kurangnya data terkait biomedis. Oleh karena itu, penelitian ini mengusulkan suatu model BioNER yang telah dilatih pada situs konsultasi kesehatan online (KKO) agar memiliki representasi data medis lebih baik dibandingkan dengan  penelitian lain yang sejenis. Kontribusi utama penelitian ini adalah terbentuknya model BioNER yang dapat digunakan dalam metode ekstraksi informasi biomedis dalam Bahasa Indonesia. Model ini dibangun menggunakan arsitektur state-of-the-art Transformers sehingga mendapatkan hasil evaluasi F1 score sebesar 0.7691, mengungguli model LSTM sebesar 0.03 poin. Hasil simulasi terhadap data riil juga menunjukkan bahwa model BioNER mampu mengenali entitas biomedis secara umum meskipun dilatih pada data yang terbatas. Selain itu, dengan digunakannya model berbasis XLM-R, maka model juga memiliki kemampuan pengenalan multibahasa sehingga potensi implementasinya tidak terbatas pada entitas Bahasa Indonesia saja. Untuk mendukung penelitian lanjutan, model pengenalan entitas biomedis ini juga dapat diakses secara publik untuk di https://huggingface.co/abid/indonesia-bioner.

 

Abstract

Biomedical entity recognition is one of the important stage in the information extraction, particularly in the health domain. Recent research uses a deep learning-based biomedical extraction model known as Biomedical NER (BioNER). Due to extensive data requirement, many studies still use social media data as a BioNER training data. On the other hand, social media data is less representative because it contains a lot of bias and lack of medical representation terms as the impact of many topics discussed. Therefore, this study proposes a BioNER model that has trained on an online health consultation platform to gain a better representation of biomedical data. This model also built using the state-of-the-art Transformers architecture. Hence, its evaluation results show that this model is able to achieve an F1 score of 0.7691, outperforming the LSTM model by 0.03. Simulation results on the real data also indicate that the BioNER model is able to recognize biomedical entities in general cases despite only trained on limited data. In addition, by using an XLM-R-based model, the recognition model also has multilingual recognition capabilities. Therefore, there is a potential implementation to apply the our BioNER model beyond Indonesian biomedical entities. Our biomedical entity recognition model is also accessible at https://huggingface.co/abid/indonesia-bioner.


Downloads

Download data is not yet available.

Referensi

S. WAKAMIYA, MORITA M., KANO Y., OHKUMA T. dan ARAMAKI E., 2019. Tweet Classification Toward Twitter-Based Disease Surveillance: New Data, Methods, and Evaluations. Journal of Medical Internet Research, 21(2).

ZHANG Y., CHEN K., WENG Y., CHEN Z., ZHANG J. dan HUBBARD R., 2022. An intelligent early warning system of analyzing Twitter data using machine learning on COVID-19 surveillance in the US. Expert Systems With Applications, Volume 198.

BLACK C. M., MENG W., YAO L. dan MILED Z. B., 2022. Inferring the patient’s age from implicit age clues in health forum posts. Journal of Biomedical Informatics, Volume 125.

LYU J. C., HAN E. L. dan LULI G. K., 2021. COVID-19 Vaccine–Related Discussion on Twitter: Topic Modeling and Sentiment Analysis. Journal of Medical Internet Research, 23(6).

SONG B. , LI F., LIU Y. dan ZENG X., 2021. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics, 22(6).

AIELLO A. E., RENSON A. dan ZIVICH P., 2020. Social media- and internet-based disease surveillance for public health. Annual Review of Public Health, Volume 41.

ARACHCHIGE I. A. N., SANDANAPITCHAI P. dan WEERASINGHE R., 2021. Investigating Machine Learning & Natural Language Processing Techniques Applied for Predicting Depression Disorder from Online Support Forums: A Systematic Literature Review. Information, Volume 12.

ATHIRA B., JONES J., IDICULA S. M., KULANTHAIVEL A. dan ZHANG E., 2021. Annotating and detecting topics in social media forum and modelling the annotation to derive directions-a case study. Journal of Big Data, Volume 8.

TUNSTALL L., WERRA L. v. dan WOLF T., 2022. Natural Language Processing with Transformers. s.l.:O'Reilly.

SANTOSO J., SETIAWANA E. I., PURWANTO C. N., YUNIARNO E. M., HARIADI M. dan PURNOMO M. H., 2021. Named entity recognition for extracting concept in ontology building on Indonesian language using end-to-end bidirectional long short term memory. Expert Systems With Applications, Volume 176.

VASWANI A., SHAZEER N., PARMAR N., USZKOREIT J., JONES L., GOMEZ A. N., KAISER Ł. dan POLOSUKHIN I., 2017. Attention Is All You Need. s.l., Advances in Neural Information Processing Systems 30.

PURWITASARI D., ABDILLAH A. F., JUANITA S. dan PURNOMO M. H., 2021. Transfer Learning Approaches for Indonesian Biomedical Entity Recognition. s.l., IEEE.

CONNEAU A., KHANDELWAL K., GOYAL N., CHAUDHARY V., WENZEK G., GUZMAN F., GRAVE E., OTT M., ZETTLEMOYER L. dan STOYANOV V., 2020. Unsupervised Cross-lingual Representation Learning at Scale. s.l., Association for Computational Linguistics.

CAMPILLOS-LLANOS L., VALVERDE-MATEOS A., CAPLLONCH-CARRIÓN A. Dan MORENO-SANDOVAL A., 2021. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence based medicine. BMC Medical Informatics and Decision Making .

CAMACHO-COLLADOS J. dan PILEHVAR M. T., 2018. On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis. s.l., Association for Computational Linguistics.

HEIJDEN N. V. D., ABNAR S. dan SHUTOVA E., 2020. A Comparison of Architectures and Pretraining Methods for Contextualized Multilingual Word Embeddings. s.l., AAAI Conference on Artificial Intelligence.

JIN G. dan YU Z., 2021. A Korean named entity recognition method using Bi-LSTM-CRF and masked self-attention. Computer Speech & Language, Volume 65.

KOTO F., RAHIMI A., LAU J. H. dan BALDWIN T., 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. s.l., International Committee on Computational Linguistics.

BERGSTRA J., KOMER B., ELIASMITH C., YAMINS D. dan COX D. D., 2015. Hyperopt: a Python library for model selection and hyperparameter optimization. Computational Science & Discovery, Volume 8.

DAI Z., WANG X., NI P., LI Y., LI G. dan BAI X., 2019. Named Entity Recognition Using BERT BiLSTM CRF for Chinese Electronic Health Records. s.l., s.n.

TSAI R. T.-H., WU S.-H., CHOU W.-C., LIN Y.-C., HE D., HSIANG J., SUNG T.-Y. dan HSU W.-L., 2006. Various criteria in the evaluation of biomedical named entity recognition. BMC Bioinformatics.

ZHANG Y., ZHANG O., WU Y., LEE H.-J., XU J., XU H. dan ROBERTS K., 2017. Psychiatric symptom recognition without labeled data using distributional representations of phrases and on-line knowledge. Journal of Biomedical Informatics.

Diterbitkan

28-02-2023

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Pengenalan Entitas Biomedis dalam Teks Konsultasi Kesehatan Online Berbahasa Indonesia Berbasis Arsitektur Transformers. (2023). Jurnal Teknologi Informasi Dan Ilmu Komputer, 10(1), 131-140. https://doi.org/10.25126/jtiik.20231016337