Meningkatkan Dataset CodeXGLUE dengan Representasi Abstract Syntax Tree (AST) Ter Seragam untuk Analisis Kode Lintas Bahasa
DOI:
https://doi.org/10.25126/jtiik.2025125Kata Kunci:
abstract syntax tree, peringkasan kode, pembelajaran mesin, analisis sintaksis, generasi kodeAbstrak
Dataset kode sumber populer seperti CodeXGLUE belum menyediakan representasi sintaksis yang diseragamkan untuk penelitian lintas bahasa pemrograman. Hal ini akan menyulitkan saat dilakukan penelitian yang berkaitan dengan analisis syntax-aware. Penelitian ini menyediakan representasi sintaksis yang diseragamkan untuk memperkaya dataset CodeXGLUE. Kami menghadirkan dataset CodeXGLUE-AST (Abstract Syntax Tree) seragam untuk enam bahasa pemrograman: Go, Java, JavaScript, Python, Ruby, dan PHP. AST diekstraksi menggunakan Tree-sitter dan disimpan dalam format JSON terstruktur. Untuk menjaga konsistensi antar bahasa, kemudian dilakukan klasifikasi dan pemetaan tipe node guna menyatukan representasi struktur AST. Evaluasi dataset menggunakan analisis kelengkapan struktur AST, pengukuran akurasi rekonstruksi kode menggunakan skor BLEU, serta pengujian ekstraksi Data Flow Graph (DFG) untuk menjaga ketergantungan antar variabel. Selain itu juga dilakukan pengujian pada tugas peringkasan kode menggunakan model CodeT5 yang menunjukkan peningkatan nilai BLEU, METEOR, ROUGE dan ROUGE-L hampir disemua percobaan saat menggunakan AST yang diseragamkan. Dengan representasi AST yang telah diseragamkan, diharapkan pengembangan model ML multi bahasa yang lebih andal dan sadar sintaksis untuk tugas-tugas seperti klasifikasi kode, pembuatan ringkasan kode, dan rekonstruksi program akan menjadi lebih berkembang.
Abstract
Popular source code datasets like CodeXGLUE have not yet provided a standardized syntactic representation for cross-programming language research. This data gap will complicate research related to syntax-aware analysis. This research provides a standardized syntactic representation to enrich the CodeXGLUE dataset. We present a uniform CodeXGLUE-AST (Abstract Syntax Tree) dataset for six programming languages: Go, Java, JavaScript, Python, Ruby, and PHP. The AST is extracted using Tree-sitter and stored in a structured JSON format. To maintain consistency across languages, classification and mapping of node types were then performed to unify the AST structure representation. The dataset evaluation used AST structure completeness analysis, code reconstruction accuracy measurement using BLEU scores, and Data Flow Graph (DFG) extraction testing to maintain variable dependencies. Additionally, testing was conducted on the code summarization task using the CodeT5 model, which showed an increase in BLEU, METEOR, ROUGE, and ROUGE-L scores in almost all experiments when using the standardized AST. With the standardized AST representation, it is hoped that the development of more reliable and syntax-aware multilingual ML models for tasks such as code classification, code summarization, and program reconstruction will become more advanced.
Downloads
Referensi
ALIKHANIFARD, P. AND TSANTALIS, N., 2024. A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools. ACM Trans. Softw. Eng. Methodol. [online] https://doi.org/10.1145/3696002.
ANON. 2019. Tree Sitter - Online Documentation. [online] Available at: <https://github.com/tree-sitter>.
CHA, S., TAYLOR, R.N. AND KANG, K. EDS, 2019. Handbook of Software Engineering. [online] Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-00262-6.
CHEN, Q., HU, H. AND LIU, Z., 2019. Code Summarization with Abstract Syntax Tree. In: T. Gedeon, K.W. Wong and M. Lee, eds. Neural Information Processing. Cham: Springer International Publishing. pp.652–660. https://doi.org/10.1007/978-3-030-36802-9_69.
DEHAERNE, E., DEY, B., HALDER, S., DE GENDT, S. AND MEERT, W., 2022. Code Generation Using Machine Learning: A Systematic Review. IEEE Access, 10, pp.82434–82455. https://doi.org/10.1109/ACCESS.2022.3196347.
DURACIK, M., HRKUT, P., KRSAK, E. AND TOTH, S., 2020. Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set. IEEE Access, 8, pp.175347–175359. https://doi.org/10.1109/ACCESS.2020.3026422.
EVTIKHIEV, M., BOGOMOLOV, E., SOKOLOV, Y. AND BRYKSIN, T., 2023. Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, p.111741. https://doi.org/10.1016/j.jss.2023.111741.
FRANKISH, K. AND RAMSEY, W.M. eds, 2014. The Cambridge handbook of artificial intelligence. Cambridge, UK: Cambridge University Press.
GAO, S., GAO, C., HE, Y., ZENG, J., NIE, L., XIA, X. AND LYU, M., 2023. Code Structure–Guided Transformer for Source Code Summarization. ACM Transactions on Software Engineering and Methodology, 32(1), p.23:1-23:32. https://doi.org/10.1145/3522674.
GUO, D., REN, S., LU, S., FENG, Z., TANG, D., LIU, S., ZHOU, L., DUAN, N., SVYATKOVSKIY, A., FU, S., TUFANO, M., DENG, S.K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D. AND ZHOU, M., 2021. GRAPHCODEBERT: PRE-TRAINING CODE REPRESEN- TATIONS WITH DATA FLOW.
KUANG, L., ZHOU, C. AND YANG, X., 2022. Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems. Automated Software Engineering, 29(2), p.43. https://doi.org/10.1007/s10515-022-00341-1.
LATIF, A., AZAM, F., ANWAR, M.W. AND ZAFAR, A., 2023. Comparison of Leading Language Parsers – ANTLR, JavaCC, SableCC, Tree-sitter, Yacc, Bison. In: 2023 13th International Conference on Software Technology and Engineering (ICSTE).
[online] 2023 13th International Conference on Software Technology and Engineering (ICSTE). pp.7–13. https://doi.org/10.1109/ICSTE61649.2023.00009.
LAVIE, A. AND DENKOWSKI, M.J., 2009. The Meteor metric for automatic evaluation of machine translation. Machine Translation, 23(2), pp.105–115. https://doi.org/10.1007/s10590-009-9059-4.
LIN, C.-Y., 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. [online] Barcelona, Spain: Association for Computational Linguistics. pp.74–81. Available at: <https://aclanthology.org/W04-1013> [Accessed 15 September 2024].
LU, S., GUO, D., REN, S., HUANG, J., SVYATKOVSKIY, A., BLANCO, A., CLEMENT, C., DRAIN, D., JIANG, D., TANG, D., LI, G., ZHOU, L., SHOU, L., ZHOU, L., TUFANO, M., GONG, M., ZHOU, M., DUAN, N., SUNDARESAN, N., DENG, S.K., FU, S. AND LIU, S., 2021a. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. https://doi.org/10.48550/arXiv.2102.04664
LU, S., REN, S., GUO, D., SVYATKOVSKIY, A. AND HUANG, J., 2021b. CodeXGLUE Github Repository. [online] Available at: <https://github.com/microsoft/CodeXGLUE>.
PAPINENI, K., ROUKOS, S., WARD, T. AND ZHU, W.-J., 2002a. Bleu: a Method for Automatic Evaluation of Machine Translation. In: P. Isabelle, E. Charniak and D. Lin, eds. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. [online] ACL 2002. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. pp.311–318. https://doi.org/10.3115/1073083.1073135.
PAPINENI, K., ROUKOS, S., WARD, T. AND ZHU, W.-J., 2002b. Bleu: a Method for Automatic Evaluation of Machine Translation. In: P. Isabelle, E. Charniak and D. Lin, eds. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. [online] ACL 2002. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. pp.311–318. https://doi.org/10.3115/1073083.1073135.
PARR, T., 2014. The definitive ANTLR 4 reference. Book version: P 2.0 edn. The pragmatic programmers. Dallas, Texas Raleigh, North Carolina: The Pragmatic Bookshelf.
SEBESTA, R.W., 2012. Concepts of programming languages. 10. ed edn. Always learning. Boston: Pearson.
SHAH, M., PATEL, R. AND TERRY, A., 2023. Structural Syntax Network for Code Classification. https://doi.org/10.20944/preprints202312.0805.v1.
SHI, T., KENESHLOO, Y., RAMAKRISHNAN, N. AND REDDY, C.K., 2021. Neural Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS Trans. Data Sci., 2(1), p.1:1-1:37. https://doi.org/10.1145/3419106.
SISWO UTOMO, M., UTAMI, E., KUSRINI AND SETYANTO, A., 2024. Machine Learning Innovations in Code Generation: A Systematic Literature Review of Methods, Challenges and Directions. In: 2024 International Conference on Information Technology and Computing (ICITCOM). [online] 2024 International Conference on Information Technology and Computing (ICITCOM). pp.24–29. https://doi.org/10.1109/ICITCOM62788.2024.10762291.
SUN, W., FANG, C., CHEN, Y., ZHANG, Q., TAO, G., YOU, Y., HAN, T., GE, Y., HU, Y., LUO, B. AND CHEN, Z., 2024. An Extractive-and-Abstractive Framework for Source Code Summarization. ACM Transactions on Software Engineering and Methodology, 33(3), pp.1–39. https://doi.org/10.1145/3632742.
TAKERNGSAKSIRI, W., TANTITHAMTHAVORN, C. AND LI, Y.-F., 2024. Syntax-aware on-the-fly code completion. Information and Software Technology, [online] 165(C). https://doi.org/10.1016/j.infsof.2023.107336.
TIPIRNENI, S., ZHU, M. AND REDDY, C.K., 2024. StructCoder: Structure-Aware Transformer for Code Generation. Available at: <http://arxiv.org/abs/2206.05239> [Accessed 11 September 2024].
UTOMO, M., UTAMI, E., KUSRINI, K. AND SETYANTO, A., 2025. CodeXGLUE AST Dataset. Available at: <https://github.com/mardiutomo75/CodeXGLUE_AST>.
WANG, K., YAN, M., ZHANG, H. AND HU, H., 2022. Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification. In: 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC).
[online] 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC). pp.390–400. https://doi.org/10.1145/3524610.3527915.
WANG, Y., LE, H., GOTMARE, A.D., BUI, N.D.Q., LI, J. AND HOI, S.C.H., 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. https://doi.org/10.48550/arXiv.2305.07922
ZHANG, J., LIU, Z., HU, X., XIA, X. AND LI, S., 2023. Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. IEEE Transactions on Software Engineering, 49(8), pp.4196–4212. https://doi.org/10.1109/TSE.2023.3286586
Unduhan
Diterbitkan
Terbitan
Bagian
Lisensi
Hak Cipta (c) 2025 Jurnal Teknologi Informasi dan Ilmu Komputer

Artikel ini berlisensiCreative Commons Attribution-ShareAlike 4.0 International License.

Artikel ini berlisensi Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Penulis yang menerbitkan di jurnal ini menyetujui ketentuan berikut:
- Penulis menyimpan hak cipta dan memberikan jurnal hak penerbitan pertama naskah secara simultan dengan lisensi di bawah Creative Common Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) yang mengizinkan orang lain untuk berbagi pekerjaan dengan sebuah pernyataan kepenulisan pekerjaan dan penerbitan awal di jurnal ini.
- Penulis bisa memasukkan ke dalam penyusunan kontraktual tambahan terpisah untuk distribusi non ekslusif versi kaya terbitan jurnal (contoh: mempostingnya ke repositori institusional atau menerbitkannya dalam sebuah buku), dengan pengakuan penerbitan awalnya di jurnal ini.
- Penulis diizinkan dan didorong untuk mem-posting karya mereka online (contoh: di repositori institusional atau di website mereka) sebelum dan selama proses penyerahan, karena dapat mengarahkan ke pertukaran produktif, seperti halnya sitiran yang lebih awal dan lebih hebat dari karya yang diterbitkan. (Lihat Efek Akses Terbuka).










