OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

Current methods for improving LLM reasoning using reinforcement learning often assign trajectory-level advantages, diluting the signal at crucial reasoning steps. We propose Oracle-Prompted Policy Optimization (OPPO), which addresses this by grounding the policy optimization in the natural Bayesian update of the model's belief regarding eventual success. By accumulating the oracle signals used in prior distillation methods, OPPO derives a running estimate of success probability and token-level advantages for every position in the output. A first-order analysis shows that this advantage factorizes into a per-token discrimination signal modulated by a state weight, concentrating credit on pivotal tokens. Empirical results demonstrate that OPPO significantly improves performance over existing algorithms (GRPO, DAPO, SDPO) across seven reasoning benchmarks, achieving gains up to +6.0 points on AMC'23 and +5.2 points on AIME'24, with performance gains widening monotonically with response length.

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

More from this section

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

More from this section