Optimasi Suffix Tree Clustering dengan Wordnet dan Named Entity Recognition untuk Pengelompokan Dokumen

Penulis

Satrio Hadi Wijoyo, Admaja Dwi Herlambang, Fahrur Rozi, Septiyan Andika Isanta

Abstrak

Abstrak

Semakin meningkatnya jumlah dokumen teks di dunia digital mempengaruhi banyaknya jumlah informasiĀ  dan menyebabkan kesulitan dalam proses temu kembali informasi (information retreival). Clustering dokumen merupakan suatu bidang text mining yang penting dan dapat digunakan untuk mengefisienkan dalam pengelolaan teks serta peringkasan teks. Namun beberapa permasalahan muncul dalam clustering dokumen teks terutama dalam dokumen berita seperti ambiguitas dalam content, overlapping cluster, dan struktur unik yang terdapat dalam dokumen berita. Penelitian ini mengusulkan metode baru yaitu optimasi Suffix Tree Clustering (STC) dengan WordNet dan Named Entity Recognition (NER) untuk pengelompokan dokumen. Metode ini memiliki beberapa tahap, yaitu prepocessing dokumen dengan mengekstraksi named entity serta melakukan deteksi sinonim berdasarkan WordNet. Tahap kedua adalah pembobotan term dengan tfidf dan nerfidf. Tahap ketiga adalah melakukan clustering dokumen dengan menggunakan Suffix Tree Clustering. Berdasarkan pengujian didapatkan rata-rata nilai precision sebesar 79.83%, recall 77.25%, dan f-measure78.30 %.

Kata kunci: Clustering dokumen, Named Entity Recognition, Suffix Tree Clustering, WordNet


Abstract

The increasingnumber oftext documentsin the internet, influence on the number of information and lead to difficulties in the process of information retrieval. Documents clustering is main field of text mining and can be used to stream line the management of text and summarization of text. However, some problems a risein documents clustering, especially in news documents such as ambiguity in the content, overlapping clusters, and theuniquestructure ofthe news thatcontained inthe document. Inthisresearch, we proposea newmethodfor documents clustering, optimization Suffix Tree Clustering (STC) with WordNet and Named Entity Recognition (NER). In this method there are several step, step one is prepocessing documents with named entity extraction and synonym detection based on WordNet. Step two is term weighting with tfidf and nerfidf. For the last step is document clustering using Suffix Tree Clustering. Based on testingwe obtained 79.83% for precision, 77.25% for recall, and78.30% for F-measure

Keywords: Documents Clustering, Named Entity Recognition, Suffix Tree Clustering, WordNet

Teks Lengkap:

PDF (English)

Referensi


NOGUEIRA, T. M., CAMARGO, H. A., & REZENDE, S. O. 2011. Fuzzy Rules for Document Classification to Improve Information Retrieval.International Journal of Computer Information Systems and Industrial Management Applications, 3, 210-217.

LUO, CONGNAN, LI, YANJUN, CHUNG, SOON M. 2009. Text document clustering based on neighbors. Data & Knowledge Engineering. 1271-1288.

BOURAS, CHRISTOS, TSOGKAS, VASILIS, 2012. A Clustering Technique for News Articles using WordNet. Knowledge-Based Systems. 115-128.

ZHANG, J.,DANG, Q., LU, Y., SUN, S., 2013. Suffix Tree Clustering with Named Entity Recognition. International Conference on Cloud Computing and Big Data. 549-556.

WORAWITPHINYO PHIRADIT, GAO XIAOYING, JABEEN SHAHIDA, 2011. Improving Suffix Tree Clustering with New Ranking and Similarity Measures. 7th International Conference, ADMA 2011.

TAN, P. N., MICHEAL S., & VIPIN K. 2006.Introduction to Data Mining. Pearson Education : India.




DOI: http://dx.doi.org/10.25126/jtiik.201744400