Meningkatkan Dataset CodeXGLUE dengan Representasi Abstract Syntax Tree (AST) Ter Seragam untuk Analisis Kode Lintas Bahasa

Mardi Siswo Utomo; Ema Utami; Kusrini Kusrini; Arief Setyanto

doi:10.25126/jtiik.2025125

Penulis

Mardi Siswo Utomo Universitas Stikubank Semarang, Semarang https://orcid.org/0009-0008-8972-908X
Ema Utami Universitas Amikom Yogyakarta, Yogyakarta https://orcid.org/0000-0002-8237-8693
Kusrini Kusrini Universitas Amikom Yogyakarta, Yogyakarta https://orcid.org/0000-0001-9573-3909
Arief Setyanto Universitas Amikom Yogyakarta, Yogyakarta https://orcid.org/0000-0003-0721-3941

DOI:

https://doi.org/10.25126/jtiik.2025125

Kata Kunci:

abstract syntax tree, peringkasan kode, pembelajaran mesin, analisis sintaksis, generasi kode

Abstrak

Dataset kode sumber populer seperti CodeXGLUE belum menyediakan representasi sintaksis yang diseragamkan untuk penelitian lintas bahasa pemrograman. Hal ini akan menyulitkan saat dilakukan penelitian yang berkaitan dengan analisis syntax-aware. Penelitian ini menyediakan representasi sintaksis yang diseragamkan untuk memperkaya dataset CodeXGLUE. Kami menghadirkan dataset CodeXGLUE-AST (Abstract Syntax Tree) seragam untuk enam bahasa pemrograman: Go, Java, JavaScript, Python, Ruby, dan PHP. AST diekstraksi menggunakan Tree-sitter dan disimpan dalam format JSON terstruktur. Untuk menjaga konsistensi antar bahasa, kemudian dilakukan klasifikasi dan pemetaan tipe node guna menyatukan representasi struktur AST. Evaluasi dataset menggunakan analisis kelengkapan struktur AST, pengukuran akurasi rekonstruksi kode menggunakan skor BLEU, serta pengujian ekstraksi Data Flow Graph (DFG) untuk menjaga ketergantungan antar variabel. Selain itu juga dilakukan pengujian pada tugas peringkasan kode menggunakan model CodeT5 yang menunjukkan peningkatan nilai BLEU, METEOR, ROUGE dan ROUGE-L hampir disemua percobaan saat menggunakan AST yang diseragamkan. Dengan representasi AST yang telah diseragamkan, diharapkan pengembangan model ML multi bahasa yang lebih andal dan sadar sintaksis untuk tugas-tugas seperti klasifikasi kode, pembuatan ringkasan kode, dan rekonstruksi program akan menjadi lebih berkembang.

Abstract

Popular source code datasets like CodeXGLUE have not yet provided a standardized syntactic representation for cross-programming language research. This data gap will complicate research related to syntax-aware analysis. This research provides a standardized syntactic representation to enrich the CodeXGLUE dataset. We present a uniform CodeXGLUE-AST (Abstract Syntax Tree) dataset for six programming languages: Go, Java, JavaScript, Python, Ruby, and PHP. The AST is extracted using Tree-sitter and stored in a structured JSON format. To maintain consistency across languages, classification and mapping of node types were then performed to unify the AST structure representation. The dataset evaluation used AST structure completeness analysis, code reconstruction accuracy measurement using BLEU scores, and Data Flow Graph (DFG) extraction testing to maintain variable dependencies. Additionally, testing was conducted on the code summarization task using the CodeT5 model, which showed an increase in BLEU, METEOR, ROUGE, and ROUGE-L scores in almost all experiments when using the standardized AST. With the standardized AST representation, it is hoped that the development of more reliable and syntax-aware multilingual ML models for tasks such as code classification, code summarization, and program reconstruction will become more advanced.

Downloads

Download data is not yet available.

Referensi

ALIKHANIFARD, P. AND TSANTALIS, N., 2024. A Novel Refactoring and Semantic Aware Abstract Syntax Tree Differencing Tool and a Benchmark for Evaluating the Accuracy of Diff Tools. ACM Trans. Softw. Eng. Methodol. [online] https://doi.org/10.1145/3696002.

ANON. 2019. Tree Sitter - Online Documentation. [online] Available at: <https://github.com/tree-sitter>.

CHA, S., TAYLOR, R.N. AND KANG, K. EDS, 2019. Handbook of Software Engineering. [online] Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-00262-6.

CHEN, Q., HU, H. AND LIU, Z., 2019. Code Summarization with Abstract Syntax Tree. In: T. Gedeon, K.W. Wong and M. Lee, eds. Neural Information Processing. Cham: Springer International Publishing. pp.652–660. https://doi.org/10.1007/978-3-030-36802-9_69.

DEHAERNE, E., DEY, B., HALDER, S., DE GENDT, S. AND MEERT, W., 2022. Code Generation Using Machine Learning: A Systematic Review. IEEE Access, 10, pp.82434–82455. https://doi.org/10.1109/ACCESS.2022.3196347.

DURACIK, M., HRKUT, P., KRSAK, E. AND TOTH, S., 2020. Abstract Syntax Tree Based Source Code Antiplagiarism System for Large Projects Set. IEEE Access, 8, pp.175347–175359. https://doi.org/10.1109/ACCESS.2020.3026422.

EVTIKHIEV, M., BOGOMOLOV, E., SOKOLOV, Y. AND BRYKSIN, T., 2023. Out of the BLEU: How should we assess quality of the Code Generation models? Journal of Systems and Software, 203, p.111741. https://doi.org/10.1016/j.jss.2023.111741.

FRANKISH, K. AND RAMSEY, W.M. eds, 2014. The Cambridge handbook of artificial intelligence. Cambridge, UK: Cambridge University Press.

GAO, S., GAO, C., HE, Y., ZENG, J., NIE, L., XIA, X. AND LYU, M., 2023. Code Structure–Guided Transformer for Source Code Summarization. ACM Transactions on Software Engineering and Methodology, 32(1), p.23:1-23:32. https://doi.org/10.1145/3522674.

GUO, D., REN, S., LU, S., FENG, Z., TANG, D., LIU, S., ZHOU, L., DUAN, N., SVYATKOVSKIY, A., FU, S., TUFANO, M., DENG, S.K., CLEMENT, C., DRAIN, D., SUNDARESAN, N., YIN, J., JIANG, D. AND ZHOU, M., 2021. GRAPHCODEBERT: PRE-TRAINING CODE REPRESEN- TATIONS WITH DATA FLOW.

KUANG, L., ZHOU, C. AND YANG, X., 2022. Code comment generation based on graph neural network enhanced transformer model for code understanding in open-source software ecosystems. Automated Software Engineering, 29(2), p.43. https://doi.org/10.1007/s10515-022-00341-1.

LATIF, A., AZAM, F., ANWAR, M.W. AND ZAFAR, A., 2023. Comparison of Leading Language Parsers – ANTLR, JavaCC, SableCC, Tree-sitter, Yacc, Bison. In: 2023 13th International Conference on Software Technology and Engineering (ICSTE).

[online] 2023 13th International Conference on Software Technology and Engineering (ICSTE). pp.7–13. https://doi.org/10.1109/ICSTE61649.2023.00009.

LAVIE, A. AND DENKOWSKI, M.J., 2009. The Meteor metric for automatic evaluation of machine translation. Machine Translation, 23(2), pp.105–115. https://doi.org/10.1007/s10590-009-9059-4.

LIN, C.-Y., 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. [online] Barcelona, Spain: Association for Computational Linguistics. pp.74–81. Available at: <https://aclanthology.org/W04-1013> [Accessed 15 September 2024].

LU, S., GUO, D., REN, S., HUANG, J., SVYATKOVSKIY, A., BLANCO, A., CLEMENT, C., DRAIN, D., JIANG, D., TANG, D., LI, G., ZHOU, L., SHOU, L., ZHOU, L., TUFANO, M., GONG, M., ZHOU, M., DUAN, N., SUNDARESAN, N., DENG, S.K., FU, S. AND LIU, S., 2021a. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. https://doi.org/10.48550/arXiv.2102.04664

LU, S., REN, S., GUO, D., SVYATKOVSKIY, A. AND HUANG, J., 2021b. CodeXGLUE Github Repository. [online] Available at: <https://github.com/microsoft/CodeXGLUE>.

PAPINENI, K., ROUKOS, S., WARD, T. AND ZHU, W.-J., 2002a. Bleu: a Method for Automatic Evaluation of Machine Translation. In: P. Isabelle, E. Charniak and D. Lin, eds. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. [online] ACL 2002. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. pp.311–318. https://doi.org/10.3115/1073083.1073135.

PAPINENI, K., ROUKOS, S., WARD, T. AND ZHU, W.-J., 2002b. Bleu: a Method for Automatic Evaluation of Machine Translation. In: P. Isabelle, E. Charniak and D. Lin, eds. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. [online] ACL 2002. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics. pp.311–318. https://doi.org/10.3115/1073083.1073135.

PARR, T., 2014. The definitive ANTLR 4 reference. Book version: P 2.0 edn. The pragmatic programmers. Dallas, Texas Raleigh, North Carolina: The Pragmatic Bookshelf.

SEBESTA, R.W., 2012. Concepts of programming languages. 10. ed edn. Always learning. Boston: Pearson.

SHAH, M., PATEL, R. AND TERRY, A., 2023. Structural Syntax Network for Code Classification. https://doi.org/10.20944/preprints202312.0805.v1.

SHI, T., KENESHLOO, Y., RAMAKRISHNAN, N. AND REDDY, C.K., 2021. Neural Abstractive Text Summarization with Sequence-to-Sequence Models. ACM/IMS Trans. Data Sci., 2(1), p.1:1-1:37. https://doi.org/10.1145/3419106.

SISWO UTOMO, M., UTAMI, E., KUSRINI AND SETYANTO, A., 2024. Machine Learning Innovations in Code Generation: A Systematic Literature Review of Methods, Challenges and Directions. In: 2024 International Conference on Information Technology and Computing (ICITCOM). [online] 2024 International Conference on Information Technology and Computing (ICITCOM). pp.24–29. https://doi.org/10.1109/ICITCOM62788.2024.10762291.

SUN, W., FANG, C., CHEN, Y., ZHANG, Q., TAO, G., YOU, Y., HAN, T., GE, Y., HU, Y., LUO, B. AND CHEN, Z., 2024. An Extractive-and-Abstractive Framework for Source Code Summarization. ACM Transactions on Software Engineering and Methodology, 33(3), pp.1–39. https://doi.org/10.1145/3632742.

TAKERNGSAKSIRI, W., TANTITHAMTHAVORN, C. AND LI, Y.-F., 2024. Syntax-aware on-the-fly code completion. Information and Software Technology, [online] 165(C). https://doi.org/10.1016/j.infsof.2023.107336.

TIPIRNENI, S., ZHU, M. AND REDDY, C.K., 2024. StructCoder: Structure-Aware Transformer for Code Generation. Available at: <http://arxiv.org/abs/2206.05239> [Accessed 11 September 2024].

UTOMO, M., UTAMI, E., KUSRINI, K. AND SETYANTO, A., 2025. CodeXGLUE AST Dataset. Available at: <https://github.com/mardiutomo75/CodeXGLUE_AST>.

WANG, K., YAN, M., ZHANG, H. AND HU, H., 2022. Unified Abstract Syntax Tree Representation Learning for Cross-Language Program Classification. In: 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC).

[online] 2022 IEEE/ACM 30th International Conference on Program Comprehension (ICPC). pp.390–400. https://doi.org/10.1145/3524610.3527915.

WANG, Y., LE, H., GOTMARE, A.D., BUI, N.D.Q., LI, J. AND HOI, S.C.H., 2023. CodeT5+: Open Code Large Language Models for Code Understanding and Generation. https://doi.org/10.48550/arXiv.2305.07922

ZHANG, J., LIU, Z., HU, X., XIA, X. AND LI, S., 2023. Vulnerability Detection by Learning From Syntax-Based Execution Paths of Code. IEEE Transactions on Software Engineering, 49(8), pp.4196–4212. https://doi.org/10.1109/TSE.2023.3286586

Meningkatkan Dataset CodeXGLUE dengan Representasi Abstract Syntax Tree (AST) Ter Seragam untuk Analisis Kode Lintas Bahasa

Penulis

DOI:

Kata Kunci:

Abstrak

Downloads

Referensi

Unduhan

Diterbitkan

Terbitan

Bagian

Lisensi

Cara Mengutip

Kirim Naskah

side menu

sertifikat akreditasi

Pengindeks Jurnal

Mendeley

Citations & Reference Manager

pengunjung

Keywords

Information

Supported by

Technical Support

Laboratorium

Direktori UB