EDGE-OPD: Internalizing Privileged Context in LLM Distillation

On-Policy Distillation (OPD) is an attractive paradigm for improving LLM capabilities without introducing distributional drift. Building on this, On-Policy Self-Distillation (OPSD) allows for incorporating privileged context (e.g., a persona or private fact) into the training process. However, this approach risks modifying the model's reasoning and degrading general skills.

To solve this, the authors propose EDGE-OPD, which modifies OPSD with two key features: a) using guided rollouts to inject the desired privileged-context behavior during sampling, ensuring the behavior is present in the on-policy data; and b) applying an evidence mask, updating the student model only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout.

Empirical results show that standard OPSD methods fail to learn target identities. The integration of guided rollouts successfully enables the model to learn the desired identity. Ablation studies further demonstrate that the persona signal is localized to the positive-evidence tail, offering insights into efficient knowledge transfer and the preservation of general purpose capabilities.

EDGE-OPD: Internalizing Privileged Context in LLM Distillation

More from this section

EDGE-OPD: Internalizing Privileged Context in LLM Distillation

More from this section