OtherhiyenwongFree

efficient-opd-distillation

Efficient On-Policy Distillation (OPD) methodology for LLM reasoning. Covers Prune-OPD (dynamic prefix drift detection) and vOPD (control variate baseline for variance reduction). Use when implementing, stabilizing, or optimizing on-policy distillation for reasoning models.

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get efficient-opd-distillation.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-efficient-opd-distillation

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

$cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-efficient-opd-distillation/SKILL.md

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-efficient-opd-distillation↵

Show SKILL.md content (~1.2k tokens)

---
name: efficient-opd-distillation
category: training
description: Efficient On-Policy Distillation (OPD) methodology for LLM reasoning. Covers Prune-OPD (dynamic prefix drift detection) and vOPD (control variate baseline for variance reduction). Use when implementing, stabilizing, or optimizing on-policy distillation for reasoning models.
---

# Efficient On-Policy Distillation (OPD)

## Problem

On-Policy Distillation (OPD) uses dense teacher rewards to train student LLMs on reasoning tasks. Two critical failure modes:

1. **Prefix drift** (Prune-OPD, arXiv:2605.07804): As student's generated prefix diverges from teacher's thought process, teacher's dense reward loses local exploitability. Continuing to generate on "drifted" trajectories wastes compute and degrades reward quality.
2. **High gradient variance** (vOPD, arXiv:2605.07865): OPD's single-sample Monte Carlo estimator has high variance, making training unstable.

## Prune-OPD: Dynamic Prefix Drift Detection

### Core Idea
Monitor local compatibility between student and teacher predictions (e.g., via top-k overlap) in real-time. Upon detecting severe drift:
- Monotonically down-weight subsequent unreliable rewards
- Trigger dynamic rollout truncation
- Reallocate compute to reliable teacher supervision

### Implementation Steps

1. **Compute top-k overlap** at each token step between student and teacher distributions
2. **Detect prefix drift** when overlap drops below threshold
3. **Apply monotonic reward decay**: once drift detected, weights for subsequent tokens decay
4. **Trigger truncation**: halt generation early when drift is severe
5. **Dynamic window expansion**: when compatibility remains high, preserve long-context supervision

### Key Insight
Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

### Results
- Reduces training time by 37.6%–68.0% while preserving performance
- Works across diverse teacher-student combinations
- Performance preserved/improved on AMC, AIME, HMMT benchmarks

## vOPD: Control Variate Baseline for Stabilization

### Core Idea
Cast OPD as policy-gradient RL and stabilize with a control variate baseline (value function from RL literature).

### Key Insight
The OPD value function has a closed form: **per-token negative reverse KL divergence** between student and teacher. This is available directly from the already-computed forward pass — no additional critic or inference needed.

### Implementation Steps

1. **Compute per-token reverse KL**: `-sum(teacher_probs * log(student_probs / teacher_probs))`
2. **Use as detached baseline**: subtract from gradient estimator to reduce variance
3. **Keep gradient unbiased**: the detached baseline preserves unbiased gradient
4. **Top-k approximation**: approximate baseline on top-k support for further cost reduction

### Advantages over existing methods
- Full-vocabulary KL: expensive (computes over entire vocab)
- Top-k support: biased objective
- **vOPD**: lightweight single-sample + detached baseline = unbiased + low variance

## Combined Prune-OPD + vOPD Pipeline

For maximum efficiency and stability in OPD:

```
For each training step:
1. Generate student trajectory on-policy
2. At each token, compute:
a. Top-k overlap with teacher (drift detection)
b. Per-token reverse KL (control variate baseline)
3. If drift detected:
- Apply monotonic reward weight decay
- Consider early truncation
4. Compute gradient with vOPD baseline
5. Apply reweighted gradient update
```

## Pitfalls

- **Drift threshold too sensitive**: may truncate valid reasoning paths. Start conservative.
- **Not all tasks benefit from truncation**: tasks requiring long reasoning chains need wider windows.
- **Single-sample MC variance**: even with vOPD baseline, variance reduction depends on baseline quality.
- **Teacher quality matters**: poor teacher rewards amplify all OPD failure modes.
- **Monitoring overhead**: top-k overlap computation adds cost; use efficient approximations.

## Activation Keywords

- on-policy distillation, OPD, Prune-OPD, vOPD, prefix drift
- reasoning model distillation, teacher-student training
- control variate baseline, reverse KL divergence
- OPD stabilization, dense reward supervision