This paper introduces Evidence Guided On-Policy Distillation (EDGE-OPD), a novel method for distilling privileged context (like personas or private facts) into Large Language Models. It addresses the challenge that privileged information can degrade general capabilities during tr
On-Policy Distillation (OPD) is an attractive paradigm for improving LLM capabilities without introducing distributional drift. Building on this, On-Policy Self-Distillation (OPSD) allows for incorporating privileged context (e.g., a persona or private fact) into the training process. However, this approach risks modifying the model's reasoning and degrading general skills.
To solve this, the authors propose EDGE-OPD, which modifies OPSD with two key features: a) using guided rollouts to inject the desired privileged-context behavior during sampling, ensuring the behavior is present in the on-policy data; and b) applying an evidence mask, updating the student model only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout.
Empirical results show that standard OPSD methods fail to learn target identities. The integration of guided rollouts successfully enables the model to learn the desired identity. Ablation studies further demonstrate that the persona signal is localized to the positive-evidence tail, offering insights into efficient knowledge transfer and the preservation of general purpose capabilities.