Abstract
Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our benchmark is available at https://github.com/CodeLLM-Research/CodeJudge-Eval.
Original language | English |
---|---|
Title of host publication | Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025 |
Editors | Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 73-95 |
Number of pages | 23 |
ISBN (Electronic) | 9798891761964 |
Publication status | Published - Jan 2025 |
Event | 31st International Conference on Computational Linguistics, COLING 2025 - Abu Dhabi, United Arab Emirates Duration: 19 Jan 2025 → 24 Jan 2025 https://aclanthology.org/volumes/2025.coling-main/ (Conference proceedings) |
Publication series
Name | Proceedings - International Conference on Computational Linguistics, COLING |
---|---|
Volume | Part F206484-1 |
ISSN (Print) | 2951-2093 |
Conference
Conference | 31st International Conference on Computational Linguistics, COLING 2025 |
---|---|
Country/Territory | United Arab Emirates |
City | Abu Dhabi |
Period | 19/01/25 → 24/01/25 |
Internet address |
|