Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-efficient-opd-distillationgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-efficient-opd-distillation/SKILL.md---
name: efficient-opd-distillation
category: training
description: Efficient On-Policy Distillation (OPD) methodology for LLM reasoning. Covers Prune-OPD (dynamic prefix drift detection) and vOPD (control variate baseline for variance reduction). Use when implementing, stabilizing, or optimizing on-policy distillation for reasoning models.
---
# Efficient On-Policy Distillation (OPD)
## Problem
On-Policy Distillation (OPD) uses dense teacher rewards to train student LLMs on reasoning tasks. Two critical failure modes:
1. **Prefix drift** (Prune-OPD, arXiv:2605.07804): As student's generated prefix diverges from teacher's thought process, teacher's dense reward loses local exploitability. Continuing to generate on "drifted" trajectories wastes compute and degrades reward quality.
2. **High gradient variance** (vOPD, arXiv:2605.07865): OPD's single-sample Monte Carlo estimator has high variance, making training unstable.
## Prune-OPD: Dynamic Prefix Drift Detection
### Core Idea
Monitor local compatibility between student and teacher predictions (e.g., via top-k overlap) in real-time. Upon detecting severe drift:
- Monotonically down-weight subsequent unreliable rewards
- Trigger dynamic rollout truncation
- Reallocate compute to reliable teacher supervision
### Implementation Steps
1. **Compute top-k overlap** at each token step between student and teacher distributions
2. **Detect prefix drift** when overlap drops below threshold
3. **Apply monotonic reward decay**: once drift detected, weights for subsequent tokens decay
4. **Trigger truncation**: halt generation early when drift is severe
5. **Dynamic window expansion**: when compatibility remains high, preserve long-context supervision
### Key Insight
Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.
### Results
- Reduces training time by 37.6%–68.0% while preserving performance
- Works across diverse teacher-student combinations
- Performance preserved/improved on AMC, AIME, HMMT benchmarks
## vOPD: Control Variate Baseline for Stabilization
### Core Idea
Cast OPD as policy-gradient RL and stabilize with a control variate baseline (value function from RL literature).
### Key Insight
The OPD value function has a closed form: **per-token negative reverse KL divergence** between student and teacher. This is available directly from the already-computed forward pass — no additional critic or inference needed.
### Implementation Steps
1. **Compute per-token reverse KL**: `-sum(teacher_probs * log(student_probs / teacher_probs))`
2. **Use as detached baseline**: subtract from gradient estimator to reduce variance
3. **Keep gradient unbiased**: the detached baseline preserves unbiased gradient
4. **Top-k approximation**: approximate baseline on top-k support for further cost reduction
### Advantages over existing methods
- Full-vocabulary KL: expensive (computes over entire vocab)
- Top-k support: biased objective
- **vOPD**: lightweight single-sample + detached baseline = unbiased + low variance
## Combined Prune-OPD + vOPD Pipeline
For maximum efficiency and stability in OPD:
```
For each training step:
1. Generate student trajectory on-policy
2. At each token, compute:
a. Top-k overlap with teacher (drift detection)
b. Per-token reverse KL (control variate baseline)
3. If drift detected:
- Apply monotonic reward weight decay
- Consider early truncation
4. Compute gradient with vOPD baseline
5. Apply reweighted gradient update
```
## Pitfalls
- **Drift threshold too sensitive**: may truncate valid reasoning paths. Start conservative.
- **Not all tasks benefit from truncation**: tasks requiring long reasoning chains need wider windows.
- **Single-sample MC variance**: even with vOPD baseline, variance reduction depends on baseline quality.
- **Teacher quality matters**: poor teacher rewards amplify all OPD failure modes.
- **Monitoring overhead**: top-k overlap computation adds cost; use efficient approximations.
## Activation Keywords
- on-policy distillation, OPD, Prune-OPD, vOPD, prefix drift
- reasoning model distillation, teacher-student training
- control variate baseline, reverse KL divergence
- OPD stabilization, dense reward supervision