Skip to main navigation Skip to search Skip to main content

Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning

  • Xuan Li
  • , Zhanke Zhou
  • , Zongze Li
  • , Jiangchao Yao
  • , Yu Rong
  • , Lu Zhang
  • , Bo Han*
  • *Corresponding author for this work

Research output: Chapter in book/report/conference proceedingConference proceedingpeer-review

Abstract

Large language models (LLMs) benefit substantially from supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR) in reasoning tasks. However, these recipes perform poorly in instruction-based molecular optimization, where each data point typically provides only a single optimized reference molecule and no step-by-step optimization trajectory. We reveal that answer-only SFT on the reference molecules collapses reasoning, and RLVR provides sparse feedback under similarity constraints due to the model’s lack of effective exploration, which slows learning and limits optimization. To encourage the exploration of new molecules while balancing the exploitation of the reference molecules, we introduce Reference-guided Policy Optimization (RePO), an optimization approach that learns from reference molecules without requiring trajectory data. At each update, RePO samples candidate molecules with their intermediate reasoning trajectories from the model and trains the model using verifiable rewards that measure property satisfaction under similarity constraints in an RL manner. Meanwhile, it applies reference guidance by keeping the policy’s intermediate reasoning trajectory as context and training only the answer in a supervised manner. Together, the RL term promotes exploration, while the guidance term mitigates reward sparsity and stabilizes training by grounding outputs to references when many valid molecular edits exist. Across molecular optimization benchmarks, RePO consistently outperforms SFT and RLVR baselines (e.g., GRPO), achieving improvements on the optimization metric (Success Rate × Similarity), improving balance across competing objectives, and generalizing better to unseen instruction styles. Our code is publicly available at https://github.com/tmlr-group/RePo.
Original languageEnglish
Title of host publicationThe Fourteenth International Conference on Learning Representations, ICLR 2026
PublisherInternational Conference on Learning Representations, ICLR
Pages1-32
Number of pages32
Publication statusPublished - 26 Jan 2026
Event14th International Conference on Learning Representations, ICLR 2026 - Rio de Janeiro, Brazil
Duration: 23 Apr 202627 Apr 2026
https://iclr.cc/Conferences/2026 (Conference website)
https://openreview.net/group?id=ICLR.cc/2026 (Conference proceedings)
https://iclr.cc/virtual/2026/calendar (Conference schedule)

Publication series

NameInternational Conference on Learning Representations
PublisherInternational Conference on Learning Representations, ICLR

Conference

Conference14th International Conference on Learning Representations, ICLR 2026
Abbreviated titleICLR 2026
Country/TerritoryBrazil
CityRio de Janeiro
Period23/04/2627/04/26
Internet address

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 3 - Good Health and Well-being
    SDG 3 Good Health and Well-being

User-Defined Keywords

  • arge Language Model
  • Molecular Optimization
  • LLM Reasoning

Fingerprint

Dive into the research topics of 'Reference-guided Policy Optimization for Molecular Optimization via LLM Reasoning'. Together they form a unique fingerprint.

Cite this