Abstract
Data synthesis has become a crucial research area in large language models (LLMs), especially for generating high-quality instruction fine-tuning data to enhance downstream performance. In code generation, a key application of LLMs, manual annotation of code instruction data is costly. Recent methods, such as Code Evol-Instruct and OSS-Instruct, leverage LLMs to synthesize large-scale code instruction data, significantly improving LLM coding capabilities. However, these approaches face limitations due to unidirectional synthesis and randomness-driven generation, which restrict data quality and diversity. To overcome these challenges, we introduce Tree-of-Evolution (ToE), a novel framework that models code instruction synthesis process with a tree structure, exploring multiple evolutionary paths to alleviate the constraints of unidirectional generation. Additionally, we propose optimization-driven evolution, which refines each generation step based on the quality of the previous iteration. Experimental results across five widely-used coding benchmarks—HumanEval, MBPP, EvalPlus, LiveCodeBench, and BigCodeBench—demonstrate that base models fine-tuned on just 75k data synthesized by our method achieve comparable or superior performance to the state-of-the-art open-weight Code LLM, Qwen2.5-Coder-Instruct, which was fine-tuned on millions of samples.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics |
| Editors | Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar |
| Place of Publication | Vienna |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 297–316 |
| Number of pages | 20 |
| Volume | 1 |
| ISBN (Electronic) | 9798891762510 |
| DOIs | |
| Publication status | Published - Jul 2025 |
| Event | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 - Austria Center Vienna, Vienna, Austria Duration: 27 Jul 2025 → 1 Aug 2025 https://2025.aclweb.org/ (Conference Website) https://docs.google.com/spreadsheets/d/1O-n3HPvv8vY0L_kjyP5AtRTcWWjqLk2deCYtrMgCGw4/edit?usp=drive_link (Conference Program) https://aclanthology.org/events/acl-2025/ (Conference Proceedings) |
Publication series
| Name | Proceedings of Annual Meeting of the Association for Computational Linguistics |
|---|---|
| Publisher | Association for Computational Linguistics |
Conference
| Conference | 63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025 |
|---|---|
| Country/Territory | Austria |
| City | Vienna |
| Period | 27/07/25 → 1/08/25 |
| Internet address |
|