Pengklasifikasian Dokumen Berbahasa Indonesia Dengan Pengindeksan Berbasis LSI

Penulis

Achmad Ridok, Indriati .

Abstrak

Abstrak
Klasifikasi dokumen teks bertujuan untuk menentukan kategori suatu dokumen berdasarkan kesamaannya dengan kumpulan dokumen yang telah berlabel sebelumnya. Namun demikian kebanyakan metode klasifikasi yang ada saat ini dilakukan berdasarkan kata-kata kunci atau kata-kata yang dianggap penting dengan mengasumsikan masing-masing merepresentasikan konsep yang unik. Padahal pada kenyataanya beberapa kata yang mempunyai makna atau semantik sama seharusnya diwakili satu kata unik. Pada penelitian ini pendekatan berbasis LSI (Latent Semantic Indexing) digunakan pada KNN untuk mengklasifikasi dokumen berbahasa Indonesia. Pembobotan term dari dokumen-dokumen latih maupun uji menggunakan tf-idf,  yang direpresentasikan masing-masing dalam matrik term-dokumen A dan B. Selanjutnya matrik A didekomposisi menggunakan SVD untuk mendapatkan matrik U dan V yang tereduksi dengan k-rank. Kedua matrik U dan V digunakan untuk mereduksi B sebagai representasi dokumen uji.  Evaluasi kinerja sistem terbaik berdasarkan hasil  diperoleh pada klasifikasi KNN berbasis LSI tanpa stemming dengan threshould 2. Akan tetapi evaluasi kinerja terbaik berdasarkan waktu dicapai ketika KNN LSI dengan stemming pada threshould 5. Kinerja KNN berbasis LSI secara signifikan jauh lebih baik dibandingkan dengan KNN biasa baik dari sisi hasil maupun waktu.
Kata kunci: KNN, LSI, K-Rank, SVD, Klasifikasi dokumen

Abstract
Classification of text documents aimed to determine the category of a document based on its similarity to set of documents which have been previously labeled. However, most existing methods of classification were conducted based on key words or words that are considered important by assuming each representing a unique concept. Whereas in fact some of the words that have the same meaning or semantics should be represented as a unique word. In this research LSI -based approach  used on KNN to classify documents in Indonesian language. Weighting the terms of the training documents or testing using tf-idf, which represented respectively in term-document matrix A and B. Furthermore, the matrix A is decomposed using SVD to obtain matrices U and V are reduced by k-rank. Both matrices U and V are used to reduce B as a representation of test documents. The best system performance evaluation based on the results obtained LSI-based in the KNN classification without stemming with threshould 2. However, the best performance evaluation based on the time achieved when KNN LSI with stemming the KNN with threshould 5. Performance-based LSI is significantly much better than the tradisional KNN in term both the outcome and timing.
Keywords: KNN, LSI, K-Rank, SVD, Documents classification

Teks Lengkap:

PDF (English)

Referensi


Ab Samat, N., Murad, M.A.A., Atan, R., Abdullah, M.T., n.d. Categorization of Malay Documents using Latent Semantic Indexing.

Bassil, Y., Semaan, P., 2012. Semantic-Sensitive Web Information Retrieval Model for HTML Documents. ArXiv Prepr. ArXiv12040186.

Garcia, E., 2006. Latent Semantic Indexing (LSI) A Fast Track Tutorial. September.

Jing, L., Yun, J., Yu, J., Huang, H., 2010. Text Clustering via Term Semantic Units. IEEE, pp. 417–420. doi:10.1109/WI-IAT.2010.23

Kontostathis, A., Pottenger, W.M., 2006. A framework for understanding Latent Semantic Indexing (LSI) performance. Inf. Process. Manag. 42, 56–73.

Li, C.H., Park, S.C., 2007. Artificial Neural Network for Document Classification Using Latent Semantic Indexing. IEEE, pp. 17–21. doi:10.1109/ISITC.2007.69

Peter, R., Shivapratap, G., Divya, G., Soman, K.P., 2009. Evaluation of svd and nmf methods for latent semantic analysis. Int. J. Recent Trends Eng. 1.

Silva, I.R., Souza, J.N., Santos, K.S., 2004. Dependence among terms in vector space model, in: Database Engineering and Applications Symposium, 2004. IDEAS’04. Proceedings. International. IEEE, pp. 97–102.

Wu, H., Gunopulos, D., 2002. Evaluating the utility of statistical phrases and latent semantic indexing for text classification, in: Data Mining, 2002.

ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, pp. 713–716.

Yang, Y., 1995. Noise reduction in a statistical approach to text categorization, in: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, pp. 256–263.

Zelikovitz, S., Hirsh, H., 2001. Using LSI for text classification in the presence of background text, in: Proceedings of the Tenth International Conference on Information and Knowledge Management. ACM, pp. 113–118.




DOI: http://dx.doi.org/10.25126/jtiik.201522136