Abstract
Evolution is a promising way for Large Language Models (LLMs) to tackle open-ended problems, such as molecular optimization. Existing training-free methods of evolution rely on context engineering that cannot reliably yield desired solutions. On the other hand, Reinforcement Learning with Verifiable Rewards (RLVR) is a learning-centric alternative, but it prioritizes final solutions over the multi-turn process of evolution, which cannot bring stable improvement. To address this, we propose Learning to Evolve (LtE), which learns a policy for iterative refinement by turning per-turn evaluator scores into turn-wise and trajectory-wise credit assignments. LtE uses (i) a turn-level advantage based on the score improvement over the initial solution and (ii) a trajectory-level advantage that accumulates these improvements over the entire trajectory. These two rewards are combined for credit assignment across turns and across trajectories, aligning the learning with progress improvement across evolution turns. We conduct experiments on molecular optimization tasks. LtE produces higher-quality solutions with the same budgets as training-free and RLVR methods and enables test time scale-up.
| Original language | English |
|---|---|
| Title of host publication | ICLR 2026 Workshop on AI with Recursive Self-Improvement |
| Publisher | International Conference on Learning Representations, ICLR |
| Pages | 1-18 |
| Number of pages | 18 |
| Publication status | Published - 26 Apr 2026 |
| Event | ICLR 2026 Workshop on AI with Recursive Self-Improvement - Rio de Janeiro, Brazil Duration: 26 Apr 2026 → 26 Apr 2026 https://openreview.net/group?id=ICLR.cc/2026/Workshop/RSI |
Publication series
| Name | International Conference on Learning Representations Workshop |
|---|
Workshop
| Workshop | ICLR 2026 Workshop on AI with Recursive Self-Improvement |
|---|---|
| Country/Territory | Brazil |
| City | Rio de Janeiro |
| Period | 26/04/26 → 26/04/26 |
| Internet address |
Fingerprint
Dive into the research topics of 'Learning to Evolve: Scaling Open-Ended Discovery with Relative-Progress RL'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver