Sistem Identifikasi Pembicara Berbahasa Indonesia Menggunakan X-Vector Embedding

Penulis

  • Alim Misbullah Universitas Syiah Kuala, Banda Aceh
  • Muhammad Saifullah Sani Universitas Syiah Kuala, Banda Aceh
  • Husaini Universitas Syiah Kuala, Banda Aceh
  • Laina Farsiah Universitas Syiah Kuala, Banda Aceh
  • Zahnur Universitas Syiah Kuala, Banda Aceh
  • Kikye Martiwi Sukiakhy Universitas Syiah Kuala, Banda Aceh

DOI:

https://doi.org/10.25126/jtiik.20241127866

Kata Kunci:

identifikasi pembicara, time delay neural network, x-vectors, mel frequency cepstral coefficient, equal error rate

Abstrak

Penyemat pembicara adalah vektor yang terbukti efektif dalam merepresentasikan karakteristik pembicara sehingga menghasilkan akurasi yang tinggi dalam ranah pengenalan pembicara. Penelitian ini berfokus pada penerapan x-vectors sebagai penyemat pembicara pada sistem identifikasi pembicara berbahasa Indonesia yang menggunakan model speaker identification. Model dibangun dengan menggunakan dataset VoxCeleb sebagai data latih dan dataset INF19 sebagai data uji yang dikumpulkan dari suara mahasiswa Jurusan Informatika Universitas Syiah Kuala angkatan 2019. Untuk membangun model, fitur-fitur diekstrak dengan menggunakan Mel-Frequency Cepstral Coeffients (MFCC), dihitung Voice Activity Detection (VAD), dilakukan augmentasi dan normalisasi fitur menggunakan Cepstral Mean and Variance Normalization (CMVN) serta dilakukan filtering. Sedangkan proses pengujian model hanya membutuhkan fitur-fitur yang diekstrak dengan menggunakan MFCC dan dihitung VAD saja. Terdapat 4 (empat) model yang dibangun dengan cara mengombinasikan dua jenis konfigurasi MFCC dan dua jenis arsitektur Deep Neural Network (DNN) yang memanfaatkan Time Delay Neural Network (TDNN). Model terbaik dipilih berdasarkan akurasi tertinggi yang dihitung menggunakan metrik Equal Error Rate (EER) dan durasi ekstraksi x-vectors tersingkat dari keempat model. Nilai EER dari model yang terbaik untuk dataset VoxCeleb1 bagian test sebesar 3,51%, inf19_test_td sebesar 1,3%, dan inf19_test_tid sebesar 1,4%. Durasi ekstraksi x-vectors menggunakan model terbaik untuk data train berdurasi 6 jam 42 menit 39 detik, VoxCeleb1 bagian test berdurasi 2 menit 24 detik, inf19_enroll berdurasi 18 detik, inf19_test_td berdurasi 25 detik, dan inf19_test_tid berdurasi 9 detik. Arsitektur DNN kedua dan konfigurasi MFCC kedua yang telah dirancang menghasilkan model yang lebih kecil, akurasi yang lebih baik terutama untuk dataset pembicara berbahasa Indonesia, dan durasi ekstraksi x-vectors yang lebih singkat.

 

Abstract

The speaker embedding is a vector that has been proven effective in representing speaker characteristics, resulting in high accuracy in the domain of speaker recognition. This research focuses on the application of x-vectors as speaker embeddings in the Indonesian language speaker identification system using a speaker identification model. The model is built using the VoxCeleb dataset as training data and the INF19 dataset as testing data, collected from the voices of students of Informatics Department, Universitas Syiah Kuala from the 2019 batch. To build the model, features are extracted using Mel-Frequency Cepstral Coefficients (MFCC), Voice Activity Detection (VAD) is applied, augmentation and normalization of features are performed using Cepstral Mean and Variance Normalization (CMVN), and filtering is applied. On the other hand, the model testing process only requires features extracted using MFCC and computed VAD. There are 4 (four) models are constructed by combining two configurations of MFCC and two types of Deep Neural Network (DNN) architectures that utilize the Time Delay Neural Network (TDNN). The best model is selected based on the highest accuracy calculated using the Equal Error Rate (EER) metric and the shortest duration of x-vector extraction from the four models. The EER values for the best model on the VoxCeleb1 test dataset are 3.51%, 1.3% for inf19_test_td, and 1.4% for inf19_test_tid. The x-vector extraction duration using the best model for the training dataset is 6 hours 42 minutes 39 seconds, 2 minutes 24 seconds for VoxCeleb1 test part, 18 seconds for inf19_enroll, 25 seconds for inf19_test_td, and 9 seconds for inf19_test_tid. The second DNN architecture and the second MFCC configuration designed result in a smaller model, better accuracy, especially for Indonesian language speaker datasets, and shorter x-vector extraction duration.

 

Downloads

Download data is not yet available.

Referensi

BAI, Z., & ZHANG, X.-L., 2021. Speaker Recognition Based on Deep Learning: An Overview. Neural Networks, 140, 65–99.

CAO, C., LIU, F., TAN, H., SONG, D., SHU, W., LI, W., ZHOU, Y., BO, X., & XIE, Z., 2018. Deep Learning and Its Applications In Biomedicine. Genomics, Proteomics & Bioinformatics, 16(1), 17–32.

CHUNG, J. S., NAGRANI, A., & ZISSERMAN, A., 2018. Voxceleb2: Deep Speaker Recognition. arXiv preprint arXiv:1806.05622.

DING, S., CHEN, T., GONG, X., ZHA, W., & WANG, Z., 2020. Autospeech: Neural Architecture Search for Speaker Recognition. arXiv preprint arXiv:2005.03215.

FAN, C., LIU, B., TAO, J., YI, J., WEN, Z., & SONG, L., 2021. Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning. 2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP), 1–5.

FURUI, S., 2018. Digital Speech Processing: Synthesis, and Recognition. CRC Press.

MISBULLAH, A., NAZARUDDIN, N., MARZUKI, M., & ZULFAN, Z., 2020. Penerapan Time Delay Neural Network pada Model Akustik untuk Sistem Voice-to-Text Berbahasa Sunda. Journal of Data Analysis, 2(2), 61–70.

NURSHOLIHATUN, E., 2017. Identifikasi Suara Menggunakan Metode Mel Frequency Cepstrum Coefficients (MFCC) dan Jaringan Syaraf Tiruan Backpropagation. Universitas Mataram.

PEDDINTI, V., POVEY, D., & KHUDANPUR, S., 2015. A Time Delay Neural Network Architecture for Efficient Modeling of Long Temporal Contexts. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2015-Janua, 3214–3218. https://doi.org/10.21437/interspeech.2015-647

POVEY, D., GHOSHAL, A., BOULIANNE, G., BURGET, L., GLEMBEK, O., GOEL, N., HANNEMANN, M., MOTLICEK, P., QIAN, Y., SCHWARZ, P.,

SILOVSKY, J., STEMMER, G., & VESELY, K., 2011. The Kaldi Speech Recognition Toolkit. IEEE Signal Processing Society. https://infoscience.epfl.ch/record/192584

SNYDER, D., GARCIA-ROMERO, D., SELL, G., POVEY, D., & KHUDANPUR, S., 2018. X-Vectors: Robust DNN Embeddings For Speaker Recognition. 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 5329–5333.

XIANG, X., WANG, S., HUANG, H., QIAN, Y., & YU, K., 2019. Margin Matters: Towards More Discriminative Deep Neural Network Embeddings For Speaker Recognition. 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 1652–1656.

XIE, W., NAGRANI, A., CHUNG, J. S., & ZISSERMAN, A., 2019. Utterance-Level Aggregation for Speaker Recognition In The Wild. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5791–5795.

Unduhan

Diterbitkan

25-04-2024

Terbitan

Bagian

Ilmu Komputer

Cara Mengutip

Sistem Identifikasi Pembicara Berbahasa Indonesia Menggunakan X-Vector Embedding. (2024). Jurnal Teknologi Informasi Dan Ilmu Komputer, 11(2), 369-376. https://doi.org/10.25126/jtiik.20241127866