Semantic Clustering Dan Pemilihan Kalimat Representatif Untuk Peringkasan Multi Dokumen

Penulis

Pasnur ., Putu Praba Santika, Gus Nanang Syaifuddin

Abstrak

Abstrak

Coverage dan saliency merupakan masalah utama dalam peringkasan multi dokumen. Hasil ringkasan yang baik harus mampu mampu mencakup (coverage) sebanyak mungkin konsep penting (salient) yang ada pada dokumen sumber. Penelitian ini bertujuan untuk mengembangkan metode baru peringkasan multi dokumen dengan teknik semantic clustering dan pemilihan kalimat representatif cluster. Metode yang diusulkan berdasarkan prinsip kerja Latent Semantic Indexing (LSI) dan Similarity Based Histogram Clustering (SHC) untuk pembentukan cluster kalimat secara semantik, serta mengkombinasikan fitur Sentence Information Density (SID) dan Sentence Cluster Keyword (SCK) untuk pemilihan kalimat representatif cluster. Pengujian dilakukan pada dataset Document Understanding Conference (DUC) 2004 Task 2 dan hasilnya diukur menggunakan Recall-Oriented Understudy for Gisting Evaluation (ROUGE). Hasil pengujian menunjukkan bahwa metode yang diusulkan mampu mencapai nilai ROUGE-1 rata-rata sebesar 0,395 dan nilai ROUGE-2 rata-rata sebesar 0,106.

Kata kunci: peringkasan multi dokumen, latent semantic indexing, similarity based histogram clustering, sentence information density, sentence cluster keyword

Abstract

Coverage and saliency is a major problem in multi-document summarization. The good summary should be able to cover (coverage) as much as possible the important concepts (salient) that exist in the source document. This research aims to develop a new method for multiple document summarization with semantic clustering techniques and the selection of representative clusters sentence. The proposed method is based on the principles of Latent Semantic Indexing (LSI) and Similarity Based Histogram Clustering (SHC) for clustering sentences semantically, and combine features of Sentence Information Density (SID) and Sentence Cluster Keyword (SCK) for selecting a representative sentence cluster. Tests are performed on Document Understanding Conference (DUC) 2004 Task 2 dataset and the results are measured using the Recall-Oriented Understudy for Gisting Evaluation (ROUGE). The results show that the proposed method is able to achieve ROUGE-1 value by an average of 0.395 and the ROUGE-2 value by an average of 0.106.

Keywords: multiple document summarization, latent semantic indexing, similarity based histogram clustering, sentence information density, sentence cluster keyword

Teks Lengkap:

PDF (English)

Referensi


GUPTA, V. & Lehal, G. S. 2010. A Survey of Text Summarization Extractive Techniques. Journal of Emerging Technologies in Web Intelligence, vol. II, no. 3, pp. 258-268.

HAMMOUDA, K. M. & KAMEL, M. S. 2003. Incremental Document Clustering Using Cluster Similarity Histograms. Proceeding of the 2003 IEEE/WIC International Conference on Web Intelligence.

HE, T., LI, F., SHAO, W., CHEN, J. & L. MA, L. 2008. A New Feature-Fusion Sentence Selecting Strategy for Query-Focused Multi-Document Summarization. International Conference on Advanced Language Processing and Web Information Technology.

KOGILAVANI, A. & BALASUBRAMANI, P. 2010. Clustering and Feature Spesific Sentence Extraction Based Summarization of Multiple Documents. International journal of computer science & information Technology, vol. II, no. 4.

LIN, C.Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. Proceedings of Workshop on Text Summarization Brances Out.

OUYANG, Y., Li, W., Zhang, R. , Li, S. & Lu, Q. 2012. A Progressive Sentence Selection Strategy for Document Summarization. Information Processing and Management.

SARKAR, K. 2009. Sentence Clustering-based Summarization of Multiple Text Documents. International Journal of Computing Science and Communication Technologies, vol. II, no. 1.

SONG, W. & PARK, S.C., 2004. Genetic Algorithm for Text Clustering Based on Latent Semantic Indexing. Computers and Mathematics with Applications, pp. 1901-1907.

SUPUTRA , I.P.G., H., ARIFIN, A. Z. & YUNIARTI, A. 2013. Pendekatan Positional Text Graph Untuk Pemilihan Kalimat Representatif Cluster Pada Peringkasan Multi-Dokumen. Jurnal Ilmu Komputer, vol. IV, no. 2.




DOI: http://dx.doi.org/10.25126/jtiik.201412117