Komparasi Kinerja Algoritma Blocking Pada Proses Indexing Untuk Deteksi Duplikasi

M. Miftakul Amin; Yevi Dwitayanti

doi:10.25126/jtiik.1148080

Penulis

M. Miftakul Amin Politeknik Negeri Sriwijaya, Palembang
Yevi Dwitayanti Politeknik Negeri Sriwijaya, Palembang

DOI:

https://doi.org/10.25126/jtiik.1148080

Abstrak

Proses integrasi data dari heterogeneous data sources memerlukan kualitas data yang baik. Salah satu ciri kualitas data yang baik adalah terhindar dari terjadinya duplikasi data. Untuk melakukan deteksi duplikasi, langkah yang dapat dilakukan adalah membandingkan setiap record dalam sebuah dataset sehingga membentuk candidate record pair. Teknik blocking digunakan untuk proses indexing yang dapat mengurangi jumlah pasangan record dalam proses deteksi duplikasi. Penelitian ini bertujuan untuk melakukan perbandingan beberapa algoritma blocking sehingga diperoleh rekomendasi algoritma mana yang paling optimal digunakan. Penelitian ini melakukan investigasi terhadap 6 buah algoritma blocking, yaitu Soundex, NYSIIS, Metaphone, Double Metaphone, Jaro Winkler Similarity, dan Cosine Similarity. Dataset yang digunakan dalam penelitian ini adalah dataset restaurant yang berisi 112 record, yang di dalamnya terdapat beberapa record yang terindikasi duplikat. Hasil penelitian menunjukkan bahwa algoritma NYSIIS memberikan hasil record blocking paling optimal, yaitu sebesar 97 record. Sedangkan algoritma Soundex dan Cosine Similarity memberikan hasil yang paling optimal, yaitu sebesar 8 buah candidate record pair. Sedangkan dari sisi waktu eksekusi algoritma Soundex dan NYSIIS memberikan proses yang paling cepat dengan durasi 0,04 detik.

Abstract

The process of integrating data from heterogeneous data sources requires good data quality. One of the characteristics of good data quality is avoiding data duplication. To perform duplication detection, a step that can be done is to compare each record in a dataset to form a candidate record pair. The blocking algorithm is used for the indexing process which can reduce the number of record pairs in the duplication detection process. This research aims to compare several blocking algorithms so as to obtain recommendations on which algorithm is most optimally used. This research investigates 6 blocking algorithms, namely Soundex, NYSIIS, Metaphone, Double Metaphone, Jaro Winkler Similarity, and Cosine Similarity. The dataset used in this research is a restaurant dataset containing 112 records, in which there are several records that indicate duplicates. The results showed that the NYSIIS algorithm provided the most optimal record blocking results, which amounted to 97 records. While the Soundex and Cosine Similarity algorithms provide the most optimal results, which are 8 candidate record pairs. In terms of execution time, the Soundex and NYSIIS algorithms provide the fastest process with a duration of 0.04 seconds.

Downloads

Download data is not yet available.

Referensi

AMALIA, E.L., JUMADI, A.J., MASHUDI, I.A., & WIBOWO, D.W. 2021. Analisis Metode Cosine Similarity Pada Aplikasi Ujian Online Otomatis (Studi Kasus JTI POLINEMA), Jurnal Teknologi Informasi dan Ilmu Komputer, 8(2), pp. 343–348. Tersedia di: https://doi.org/10.25126/jtiik.2021824356.

AMIN, M.M., STIAWAN, D., ERMATITA, E., SURBOTO, I.M.I, & LUKMAN, L. 2022. Decision Support System using Weighting Similarity Model for Constructing Ground-Truth Dataset, International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 2022-Octob(October), pp. 56–60. Tersedia di: https://doi.org/10.23919/EECSI56542.2022.9946630.

AMIN, M.M. & TRIWAHYUNI, A. 2024. Deterministic Record Linkage untuk Pembentukan Master Data Dosen, Jurnal TEKNIKA, 18(1), pp. 323-330.

AMIN, M.M., STIAWAN, D., ERMATITA, BUDIARTO, R., 2024. Proposed Threshold-Based and Rule-Based Approaches to Detecting Duplicates in Bibliographic Database. Bulletin of Electrical Engineering and Informatics (BEEI), 13(3), p.2036-2047.

CAHYONO, S.H. & SUCAHYO, Y.G. 2020. Pengukuran Kualitas Data Menggunakan Framework Total Data Quality Management (TDQM): Studi Kasus Sistem Informasi Beasiswa Universitas Indonesia, Jurnal IPTEK-KOM (Jurnal Ilmu Pengetahuan dan Teknologi Komunikasi), 22(2), pp. 193–206.

CHRISTEN, P. 2012a. A survey of indexing techniques for scalable record linkage and deduplication, IEEE Transactions on Knowledge and Data Engineering, 24(9), pp. 1537–1555. Tersedia di: https://doi.org/10.1109/TKDE.2011.127.

CHRISTEN, P. 2012b. Data matching: Concepts and techniques for record linkage, entity resolution, and duplicate detection. [ebook]. Berlin: Springer. Tersedia di: https://doi.org/10.1007/978-3-642-31164-2.

JAMES P. H. 2020. Phonetic Spelling Algorithm Implementations for R, Journal of Statistical Software, 98(2), pp.1-21. Tersedia di: https://doi.org/10.18637/jss.v095.i08.

MAITY, S., DAS, N., MAJUMDER, M. & DASADHIKARY, D.R. 2021. Word Embedding and String-Matching Techniques for Automobile Entity Name Identification from Web Reviews, EAI Endorsed Transactions on Scalable Information Systems, 8(33), pp. 1–11. Tersedia di: https://doi.org/10.4108/EAI.14-5-2021.169918.

O’HARE, K., JUREK-LOUGHREY, A., & DE CAMPOS, C. 2022. High-Value Token-Blocking: Efficient Blocking Method for Record Linkage, ACM Transactions on Knowledge Discovery from Data, 16(2), pp. 1-17. Tersedia di: https://doi.org/10.1145/3450527

PRIYA, J., VINOTHINI, C., DINESH P.S., RESHMI, T.S. 2021. Data Deduplication Techniques: A Comparative Analysis, International Journal of Aquatic Science, 12(3), pp.1057-1065.

RAYKAR, N., KUMBHARKAR, P., JAYATIAL, D.H. 2023. De-duplication avoidance in regional names using an approach based on pronunciation, International Journal of Advances in Electrical Engineering, 4(10), pp.10-17. Tersedia di: https://doi.org/10.22271/27084574.2023.v4.i1a.32.

SANJAYA, A., SETIAWAN, A.B., MAHDIYAH, U., FARIDA, I.N., & PRASETYO, A.R. 2023. Pengukuran Kemiripan Makna Menggunakan Cosine Similarity dan Basis Data Sinonim Kata, Jurnal Teknologi Informasi dan Ilmu Komputer, 10(4), 747–752. Tersedia di: https://doi.org/10.25126/jtiik.20241046864.

VALENCIO, C.R., JARDINI, T., MARTINS, V.H.P., COLOMBINI, A.C., FORTES, M.Z. 2020. A System Proposal for Automated Data Cleaning Environment, Journal of Engineering and Technology for Industrial Applications, 6(25), pp. 4-15. Tersedia di: https://doi.org/10.5935/jetia.v6i25.685.

YERAI, D., VILARES, M., VILARES, J. 2024. On the performance of phonetic algorithms in microtext normalization, arXiv, pp.1-26. Tersedia di:https://arxiv.org/pdf/2402.02591.

Komparasi Kinerja Algoritma Blocking pada Proses Indexing untuk Deteksi Duplikasi

Penulis

DOI:

Abstrak

Downloads

Referensi

Unduhan

Diterbitkan

Terbitan

Bagian

Lisensi

Cara Mengutip

Kirim Naskah

side menu

sertifikat akreditasi

Pengindeks Jurnal

Mendeley

Citations & Reference Manager

pengunjung

Keywords

Information

Supported by

Technical Support

Laboratorium

Direktori UB