Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali

Penulis

Made Agus Putra Subali, Chastine Fatichah

Abstrak

Proses untuk mengekstraksi kata dasar dari kata berafiks dikenal dengan istilah stemming yang bertujuan meningkatkan recall dengan mereduksi variasi kata berafiks ke dalam bentuk kata dasarnya. Penelitian terdahulu tentang stemming bahasa Bali pernah dilakukan menggunakan metode rule-based, tapi afiks yang diluluhkan hanya prefiks dan sufiks, sedangkan variasi afiks lain tidak diluluhkan, seperti infiks, konfiks, simulfiks, dan kombinasi afiks. Penelitian tentang stemming menggunakan pendekatan rule-based telah diterapkan di berbagai bahasa yang berbeda. Metode rule-based memiliki kelebihan jika diterapkan pada domain yang sederhana, maka rule-based mudah untuk diverifikasi dan divalidasi, tapi memiliki kelemahan saat diterapkan pada domain dengan level kompleksitas yang tinggi, apabila sistem tidak dapat mengenali rules, maka tidak ada hasil yang diperoleh. Untuk mengatasi kelemahan stemming menggunakan rule-based, kami menggunakan metode n-gram stemming, dimana kata berafiks dan kata dasar diubah ke bentuk n-gram, kemudian tingkat kemiripan antara n-gram kata berafiks dan n-gram kata dasar diukur menggunakan metode dice coefficient, apabila tingkat kemiripannya memenuhi nilai ambang batas yang ditentukan, maka kata dasar yang dibandingkan dengan kata berafiks ditampilkan. Pada penelitian ini, kami mengembangkan metode stemmer yang meluluhkan seluruh variasi afiks pada bahasa Bali dengan mengombinasikan pendekatan rule-based dan metode n-gram stemming. Berdasarkan pengujian yang telah dilakukan untuk kesepuluh query metode yang diusulkan memperoleh rerata akurasi stemming lebih baik 96,67% dari metode terdahulu 75%, sedangkan untuk kelima query metode n-gram stemming dapat mengenali beberapa kata berafiks diluar rules. Penelitian berikutnya, kami akan memperhatikan semantik setiap kata dan tahap validasi menggunakan aplikasi text mining.

Abstract

A process for extracting a stem word from the inflected word is known as stemming which aims to increase recall by reducing the variation of the inflected word into its stem word form. Previous research on stemming the Balinese language has been done using the rule-based method, but the affixes that are removed are only prefixes and suffixes, while other variations of affixes are not removed, such as infixes, confixes, simulfiks, and combinations of affixes. Research on stemming using the rule-based approach has been applied in a variety of different languages. The rule-based method has advantages when applied to a simple field, rule-based is easy to verify and validate, but has weaknesses when applied to domains with a high level of complexity, if the system cannot recognize rules, no results are obtained. To overcome the stemming weaknesses using rule-based, we use the n-gram stemming method, where the inflected word and stem word are converted to the n-gram form, then the level of similarity between the n-gram of the inflected word and the stem word is measured using the dice coefficient method, when the level of similarity meets the defined threshold value, then the stem word is displayed. In this study, we developed a stemmer method that removes all variations of affixes in the Balinese language by combining the rule-based approach and the n-gram stemming method. Based on the experiments for the ten queries the proposed method get 96,67% stemming accuracy than the previous method 75%, while for the five queries for the n-gram stemming method can recognize some inflected words outside the rules. The next study, we will pay attention to the semantics of each word and the validation stage using text mining application.


Kata Kunci


Stemmer Bahasa Bali, Rule-Based Stemming, N-Gram Stemming, Dice Coefficient

Teks Lengkap:

PDF

Referensi


ABIDIN, T.F. & FERDHIANA, R., 2016. Algorithm for Updating N-Grams Word Dictionary for Web Classification. International Conference on Informatics and Computing (ICIC).

ADAMSON, G.W. & BOREHAM, J., 1974. The Use of An Association Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles. Information Storage and Retrieval, vol.10, pp.253–260.

ADRIANI, M., ASIAN, J., NAZIEF, B., TAHAGHOGHI, S.M.M. & WILLIAMS, H.E., 2007. Stemming Indonesian: A Confix Stripping Approach. ACM Transactions on Asian Language Information Processing (TALIP).

ALOTAIBI, F.S. & GUPTA, V., 2018. A Cognitive Inspired Unsupervised Language-Independent Text Stemmer for Information Retrieval. Cognitive Systems Research, vol.52, pp.291–300.

BALASANKAR, C., SOBHA, T. & MANUSANKAR, C., 2016. Multi Level Inflection Handling Stemmer using Iterative Suffix Stripping for Malayalam Language. International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp.530–534.

DE-ROECK, A.N. & AL-FARES, W., 2000. A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots. Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp.199–206.

DINCER, B.T. & KARAOGLAN, B., 2003. Stemming in Agglutinative Languages: A Probabilistic Stemmer for Turkish. International Symposium on Computer and Information Sciences (ISCIS), pp.244–251.

GRANOKA, I.W.O., NARYANA, I.B.U., JENDERA, I.W., BAWA, I.W., MEDERA, I.N., PUTRAYASA, I.G.N., ANOM, I.G.K., TAMA, I.W., DENES, I.M., PURWA, I.M., SUKAYANA, I.N., & INDRA, I.B.K.M., 1996. Tata Bahasa Baku Bahasa Bali. Balai Penelitian Bahasa Pusat Pembinaan dan Pengembangan Bahasa Departemen Pendidikan dan Kebudayaan, Denpasar.

GROSAN, C. & ABRAHAM, A., 2011. Rule-Based Expert Systems. In: Intelligent Systems, Intelligent Systems Reference Library, vol.17, pp.655–696.

HUSAIN, M.S., 2012. An Unsupervised Approach to Develop Stemmer. International Journal on Natural Language Computing (IJNLC), vol.1, pp.15–23.

KROVETZ, R., 1993. Viewing Morphology as An Inference Process. Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.191–202.

LARKEY, L.S., BALLESTEROS, L. & CONNELL, M.E., 2002. Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-Occurrence Analysis. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.275–282.

LIGEZA, A., 2006. Logical Foundations for Rule-Based Systems. 2nd edition, Springer, Heildelberg.

LINDSAY, S., 1988. Practical Applications of Expert Systems. John Wiley & Sons Inc., Chichester.

LOVINS, J.B., 1968. Development of A Stemming Algorithm. Mechanical Translation and Computational Linguistics, vol.11, pp.22–31.

MAJUMDER, P., MITRA, M., PARUI, S.K., KOLE, G., MITRA, P. & DATTA, K., 2007. YASS: Yet Another Suffix Stripper. ACM Transactions on Information Systems (TOIS).

MATHEW, N.V. & BAI, V.R., 2016. Analyzing the Effectiveness of N-Gram Technique Based Feature Set in a Naive Bayesian Spam Filter.

International Conference on Emerging Technological Trends (ICETT).

MAYFIELD, J. & MCNAMEE, P., 2003. Single N-Gram Stemming. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp.415–416.

MEMET, R., NIJAT, M., MAHMUT, G. & HAMDULLA, A., 2017. A Rule and Statistical Modeling Based Stem Extraction Method for Kazakh Words. International Conference on Asian Language Processing (IALP), pp.231–234.

MOULINIER, I., MCCULLOH, J.A. & LUND, E., 2001. West Group at CLEF 2000: Non-English Monolingual Retrieval. Cross-Language Information Retrieval and Evaluation (CLEF), pp.253–260.

NATA, G.N.. & YUDIASTRA, P.P., 2017. Stemming Teks Sor-Singgih Bahasa Bali. Konferensi Nasional Sistem dan Informatika, pp.608–612.

NAZIEF, B. & ADRIANI, M., 1996. Confix Stripping: Approach to Stemming Algorithm for Bahasa Indonesia. Internal Publication, Faculty of Computer Science, University of Indonesia, Depok.

NIKOLOPOULOS, C., 1997. Expert Systems - Introduction to First and Second Generation and Hybrid Knowledge Based Systems. CRC, Boca Raton.

PAICE, C.D., 1994. An Evaluation Method for Stemming Algorithms. Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp.42–50.

PATIL, H.B. & PATIL, A.S., 2017. MarS: A Rule-Based Stemmer for Morphologically Rich Language Marathi. International Conference on Computer, Communications and Electronics, pp.580–584.

PORTER, M.F., 1980. An Algorithm for Suffix Stripping. Program, vol.14, pp.130–137.

PORTER, M.F., 2001. Snowball: A Language for Stemming Algorithms.

PRAMUDITA, Y.D., PUTRO, S.S. & MAKHMUD, N., 2018. Klasifikasi Berita Olahraga menggunakan Metode Naive Bayes dengan Enhanced Confix Stripping Stemmer. Jurnal Teknologi Informasi dan Ilmu Komputer (JTIIK), vol.5, no.3, pp.269–276.

PUTRA, I.B.G.W., SUDARMA, M. & KUMARA, I.N.S., 2016. Klasifikasi Teks Bahasa Bali dengan Metode Supervised Learning Naive Bayes Classifier. Teknologi Elektro, vol.15, no.2, pp.81–86.

PUTRA, S.J., GUNAWAN, M.N. & SURYATNO, A., 2018. Tokenization and N-Gram for Indexing Indonesian Translation of the Quran. International Conference on Information and Communication Technology (ICoICT).

SEMBOK, T.M. & BAKAR, Z.A., 2011. Effectiveness of Stemming and N-Grams String Similarity Matching on Malay Documents. International Journal of Applied Mathematics and Informatics, vol.5, pp.208–215.

TALA, F.Z., 2003. A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia.




DOI: http://dx.doi.org/10.25126/jtiik.2019621105