This research introduces Oracle-Prompted Policy Optimization (OPPO), a novel reinforcement learning approach that provides per-token credit assignment for LLM reasoning. OPPO leverages Bayesian updates to accumulate oracle signals along a trajectory, yielding token-level advantage
Current methods for improving LLM reasoning using reinforcement learning often assign trajectory-level advantages, diluting the signal at crucial reasoning steps. We propose Oracle-Prompted Policy Optimization (OPPO), which addresses this by grounding the policy optimization in the natural Bayesian update of the model's belief regarding eventual success. By accumulating the oracle signals used in prior distillation methods, OPPO derives a running estimate of success probability and token-level advantages for every position in the output. A first-order analysis shows that this advantage factorizes into a per-token discrimination signal modulated by a state weight, concentrating credit on pivotal tokens. Empirical results demonstrate that OPPO significantly improves performance over existing algorithms (GRPO, DAPO, SDPO) across seven reasoning benchmarks, achieving gains up to +6.0 points on AMC'23 and +5.2 points on AIME'24, with performance gains widening monotonically with response length.