Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-free-energy-moe-routinggit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-free-energy-moe-routing/SKILL.md---
name: free-energy-moe-routing
description: "Free Energy Principle-based Mixture-of-Experts routing methodology. Uses LIF membrane potentials (beta) for temporal memory, precision-weighted gating (Pi) for reliability assessment, and anticipatory routing to solve domain transition failures in sparse MoE. Trigger words: free energy MoE, MoE routing failure, domain transition, LIF gating, precision-weighted routing, anticipatory routing, mixture of experts failure, sparse MoE, expert affinity."
category: ai_collection
---
# Free Energy Principle in Mixture-of-Experts Routing
## Paper
**Title:** Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
**arXiv:** 2605.00604v1
**Author:** Man Yung Wong
**Date:** 2026-05-01
**Categories:** cs.LG, cs.NE
## Abstract
Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanism reveals the Free Energy Principle (FEP) as the theoretical foundation: beta approximates Bayesian posterior beliefs, Pi encodes precision-weighted prediction errors, and anticipatory routing implements active inference.
## Core Problem: MoE Domain Transition Failure
### The Affinity Routing Collapse
Standard MoE routing computes expert affinity (similarity) between current token representation and expert prototypes:
```python
# Standard affinity routing
affinity = token_embedding @ expert_prototypes.T
routing_weights = softmax(affinity / temperature)
```
**Problem:** At domain transitions (e.g., code -> math, medical -> legal), the token embedding shifts to a new distribution but routing weights remain committed to the old domain's experts.
**Measured failure:** At transition boundary, correct expert gets only 0.6% probability. Requiring 100+ experts for 99% coverage of the true expert.
## Solution: Three Gate Modifications
### 1. Temporal Memory (β) — LIF Membrane Potential
Each expert maintains a Leaky Integrate-and-Fire (LIF) membrane potential that accumulates routing context:
```python
# LIF membrane potential per expert
beta[t] = decay * beta[t-1] + input[t]
# decay ∈ (0, 1) controls memory timescale
# beta accumulates evidence for expert relevance over token sequence
```
**Key insight:** β acts as a Bayesian posterior belief about expert relevance, integrating evidence over time rather than reacting to single tokens.
**Effect:** Provides routing inertia — prevents abrupt switching and maintains context across domain boundaries.
### 2. Precision-Weighted Gating (Π) — Inverse Variance of Prediction Error
Each expert tracks the variance of its recent prediction errors:
```python
# Per-expert precision (inverse variance)
error_variance[t] = EMA((prediction - target)^2)
precision[t] = 1.0 / (error_variance[t] + epsilon)
```
**Key insight:** Experts that have been making accurate predictions gain higher routing weight (higher precision = more trusted). This creates a 31x contrast between reliable and unreliable experts.
**Effect:** Automatically routes to experts that are currently performing well, enabling dynamic expertise reallocation.
### 3. Anticipatory Routing — Next-State Prediction
A lightweight predictor forecasts the next routing state based on the β-accumulated hidden state:
```python
# Anticipatory routing
next_hidden = predictor(beta_accumulated)
anticipatory_weights = softmax(next_hidden @ expert_prototypes.T)
# Blend with current affinity
final_weights = λ * affinity_weights + (1-λ) * anticipatory_weights
```
**Key insight:** Instead of reactive routing, the gate predicts where the computation is heading and pre-allocates expert resources.
**Effect:** Cuts the experts needed for 99% coverage from infeasible to a small constant.
## Theoretical Foundation: Free Energy Principle (FEP)
The three mechanisms together implement Active Inference under the Free Energy Principle:
| FEP Component | MoE Implementation | Role |
|---------------|-------------------|------|
| Posterior Beliefs | β (LIF membrane potential) | Accumulated evidence for expert relevance |
| Prediction Error | Token → expert prediction residual | Drives belief update |
| Precision Weighting | Π (inverse variance) | Confidence in prediction error signal |
| Active Inference | Anticipatory routing | Pre-emptive expert selection |
| Variational Free Energy | Routing objective | Minimizes surprise/entropy |
### Mapping to FEP Mathematics
```
FEP Free Energy: F = E_q[ln q(s) - ln p(s,o)]
= KL(q(s) || p(s|o)) - ln p(o)
In MoE context:
- q(s): Routing distribution (β-encoded posterior)
- p(s|o): Prior over expert assignments
- p(o): Model evidence (prediction accuracy)
- Precision Π: Confidence in observations
The routing gate minimizes variational free energy by:
1. Maintaining coherent beliefs (β temporal integration)
2. Weighting by reliability (Π precision weighting)
3. Acting to reduce expected surprise (anticipatory routing)
```
## Results Summary
| Metric | Standard Affinity | + β | + Π | + Anticipatory | All Three |
|--------|------------------|-----|-----|---------------|-----------|
| Correct expert prob. at transition | 0.006 | 0.45 | 0.52 | 0.61 | **0.748** |
| Experts for 99% coverage | 100+ | ~15 | ~12 | ~8 | **~5** |
| Improvement | 1x | 75x | 87x | 102x | **124x** |
## Implementation Guidelines
### When to Apply
- **Sparse MoE models** experiencing domain transition failures
- **Multi-domain training** where data distributions shift
- **Long-context generation** where topic drift occurs
- **Code generation** mixed with natural language
### Integration Steps
1. **Add LIF state to router:** Maintain β per expert with configurable decay
2. **Track prediction variance:** Compute per-expert error EMA for Π
3. **Add lightweight predictor:** Small MLP on β-hidden for anticipatory routing
4. **Blend routing signals:** Weighted combination of affinity + β + Π + anticipatory
### Hyperparameters
| Parameter | Role | Suggested Range |
|-----------|------|-----------------|
| β decay | Memory timescale | 0.90 - 0.99 |
| Π epsilon | Variance floor | 1e-6 - 1e-4 |
| λ (blend) | Affinity vs. anticipatory weight | 0.3 - 0.7 |
| Predictor hidden size | Anticipation capacity | 64 - 256 |
### Computational Overhead
- **β update:** O(E) per token — negligible (E = number of experts)
- **Π update:** O(E) per token — negligible
- **Anticipatory predictor:** O(E × H) where H is hidden size — small MLP
- **Total overhead:** < 2% additional compute for typical MoE configurations
## Design Patterns
### Pattern 1: Temporal Routing Memory
```python
class LIFRouterGate:
def __init__(self, num_experts, decay=0.95):
self.beta = torch.zeros(num_experts) # membrane potentials
self.decay = decay
def forward(self, token_embeddings):
# Standard affinity
affinity = token_embeddings @ self.expert_prototypes.T
# Update LIF potentials
self.beta = self.decay * self.beta + affinity.mean(dim=0)
# Route with temporal memory
weights = softmax(affinity + self.beta.unsqueeze(0))
return weights
```
### Pattern 2: Precision-Aware Routing
```python
class PrecisionWeightedGate:
def __init__(self, num_experts, ema_decay=0.99):
self.error_var = torch.ones(num_experts)
self.ema_decay = ema_decay
def update_precision(self, expert_id, prediction_error):
# Update error variance via EMA
self.error_var[expert_id] = (
self.ema_decay * self.error_var[expert_id] +
(1 - self.ema_decay) * prediction_error ** 2
)
def get_precision(self):
return 1.0 / (self.error_var + 1e-6)
```
## Key Takeaways
1. **Standard MoE routing fails catastrophically at domain transitions** — correct expert gets <1% probability
2. **Three lightweight modifications achieve 124x improvement** with minimal overhead
3. **The Free Energy Principle provides the theoretical foundation** — routing as variational inference
4. **β (LIF memory) ≈ Bayesian posterior** — integrate evidence over time
5. **Π (precision) ≈ reliability weighting** — trust accurate experts more
6. **Anticipatory routing ≈ active inference** — predict and pre-allocate
7. **All three are necessary** — each contributes independently and synergistically
## Related Skills
- **moe-optimal-transport-routing**: OT-based balanced MoE routing (complementary approach)
- **hierarchical-moe-detection**: Hierarchical MoE for object detection
- **momenta-multimodal-moe-misinformation-detection**: Multimodal MoE
## Pitfalls
- **β decay too high:** Router becomes too inert, slow to adapt to genuine domain changes
- **β decay too low:** Loses temporal memory benefit, behaves like standard routing
- **Π without proper EMA smoothing:** Noisy precision estimates cause routing instability
- **Anticipatory predictor too large:** Defeats the "lightweight" advantage, adds significant overhead
- **Missing any component:** The three mechanisms are synergistic — removing any one degrades performance significantly