SQLhiyenwongFree

free-energy-moe-routing

Free Energy Principle-based Mixture-of-Experts routing methodology. Uses LIF membrane potentials (beta) for temporal memory, precision-weighted gating (Pi) for reliability assessment, and anticipatory routing to solve domain transition failures in sparse MoE. Trigger words: free energy MoE, MoE routing failure, domain transition, LIF gating, precision-weighted routing, anticipatory routing, mixture of experts failure, sparse MoE, expert affinity.

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get free-energy-moe-routing.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-free-energy-moe-routing

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

$cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-free-energy-moe-routing/SKILL.md

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-free-energy-moe-routing↵

Show SKILL.md content (~2.6k tokens)

---
name: free-energy-moe-routing
description: "Free Energy Principle-based Mixture-of-Experts routing methodology. Uses LIF membrane potentials (beta) for temporal memory, precision-weighted gating (Pi) for reliability assessment, and anticipatory routing to solve domain transition failures in sparse MoE. Trigger words: free energy MoE, MoE routing failure, domain transition, LIF gating, precision-weighted routing, anticipatory routing, mixture of experts failure, sparse MoE, expert affinity."
category: ai_collection
---

# Free Energy Principle in Mixture-of-Experts Routing

## Paper

**Title:** Affinity Is Not Enough: Recovering the Free Energy Principle in Mixture-of-Experts
**arXiv:** 2605.00604v1
**Author:** Man Yung Wong
**Date:** 2026-05-01
**Categories:** cs.LG, cs.NE

## Abstract

Sparse MoE routing fails at domain transitions, where the current token belongs to one distribution and the next to another. In a controlled experiment (4 experts, 5 seeds), standard affinity routing assigns only 0.006 +/- 0.001 probability to the correct expert at the transition. Three lightweight gate modifications raise this to 0.748 +/- 0.002 (124x), cutting experts needed for 99% coverage from infeasible to a small constant: temporal memory (beta), a per-expert LIF membrane potential accumulating routing context across tokens; precision-weighted gating (Pi), a per-expert inverse variance of recent prediction error, yielding 31x contrast between reliable and unreliable experts; and anticipatory routing, a next-state predictor conditioned on the beta-accumulated hidden state. The mechanism reveals the Free Energy Principle (FEP) as the theoretical foundation: beta approximates Bayesian posterior beliefs, Pi encodes precision-weighted prediction errors, and anticipatory routing implements active inference.

## Core Problem: MoE Domain Transition Failure

### The Affinity Routing Collapse

Standard MoE routing computes expert affinity (similarity) between current token representation and expert prototypes:

```python
# Standard affinity routing
affinity = token_embedding @ expert_prototypes.T
routing_weights = softmax(affinity / temperature)
```

**Problem:** At domain transitions (e.g., code -> math, medical -> legal), the token embedding shifts to a new distribution but routing weights remain committed to the old domain's experts.

**Measured failure:** At transition boundary, correct expert gets only 0.6% probability. Requiring 100+ experts for 99% coverage of the true expert.

## Solution: Three Gate Modifications

### 1. Temporal Memory (β) — LIF Membrane Potential

Each expert maintains a Leaky Integrate-and-Fire (LIF) membrane potential that accumulates routing context:

```python
# LIF membrane potential per expert
beta[t] = decay * beta[t-1] + input[t]
# decay ∈ (0, 1) controls memory timescale
# beta accumulates evidence for expert relevance over token sequence
```

**Key insight:** β acts as a Bayesian posterior belief about expert relevance, integrating evidence over time rather than reacting to single tokens.

**Effect:** Provides routing inertia — prevents abrupt switching and maintains context across domain boundaries.

### 2. Precision-Weighted Gating (Π) — Inverse Variance of Prediction Error

Each expert tracks the variance of its recent prediction errors:

```python
# Per-expert precision (inverse variance)
error_variance[t] = EMA((prediction - target)^2)
precision[t] = 1.0 / (error_variance[t] + epsilon)
```

**Key insight:** Experts that have been making accurate predictions gain higher routing weight (higher precision = more trusted). This creates a 31x contrast between reliable and unreliable experts.

**Effect:** Automatically routes to experts that are currently performing well, enabling dynamic expertise reallocation.

### 3. Anticipatory Routing — Next-State Prediction

A lightweight predictor forecasts the next routing state based on the β-accumulated hidden state:

```python
# Anticipatory routing
next_hidden = predictor(beta_accumulated)
anticipatory_weights = softmax(next_hidden @ expert_prototypes.T)
# Blend with current affinity
final_weights = λ * affinity_weights + (1-λ) * anticipatory_weights
```

**Key insight:** Instead of reactive routing, the gate predicts where the computation is heading and pre-allocates expert resources.

**Effect:** Cuts the experts needed for 99% coverage from infeasible to a small constant.

## Theoretical Foundation: Free Energy Principle (FEP)

The three mechanisms together implement Active Inference under the Free Energy Principle:

| FEP Component | MoE Implementation | Role |
|---------------|-------------------|------|
| Posterior Beliefs | β (LIF membrane potential) | Accumulated evidence for expert relevance |
| Prediction Error | Token → expert prediction residual | Drives belief update |
| Precision Weighting | Π (inverse variance) | Confidence in prediction error signal |
| Active Inference | Anticipatory routing | Pre-emptive expert selection |
| Variational Free Energy | Routing objective | Minimizes surprise/entropy |

### Mapping to FEP Mathematics

```
FEP Free Energy: F = E_q[ln q(s) - ln p(s,o)]
                = KL(q(s) || p(s|o)) - ln p(o)

In MoE context:
- q(s): Routing distribution (β-encoded posterior)
- p(s|o): Prior over expert assignments
- p(o): Model evidence (prediction accuracy)
- Precision Π: Confidence in observations

The routing gate minimizes variational free energy by:
1. Maintaining coherent beliefs (β temporal integration)
2. Weighting by reliability (Π precision weighting)
3. Acting to reduce expected surprise (anticipatory routing)
```

## Results Summary

| Metric | Standard Affinity | + β | + Π | + Anticipatory | All Three |
|--------|------------------|-----|-----|---------------|-----------|
| Correct expert prob. at transition | 0.006 | 0.45 | 0.52 | 0.61 | **0.748** |
| Experts for 99% coverage | 100+ | ~15 | ~12 | ~8 | **~5** |
| Improvement | 1x | 75x | 87x | 102x | **124x** |

## Implementation Guidelines

### When to Apply

- **Sparse MoE models** experiencing domain transition failures
- **Multi-domain training** where data distributions shift
- **Long-context generation** where topic drift occurs
- **Code generation** mixed with natural language

### Integration Steps

1. **Add LIF state to router:** Maintain β per expert with configurable decay
2. **Track prediction variance:** Compute per-expert error EMA for Π
3. **Add lightweight predictor:** Small MLP on β-hidden for anticipatory routing
4. **Blend routing signals:** Weighted combination of affinity + β + Π + anticipatory

### Hyperparameters

| Parameter | Role | Suggested Range |
|-----------|------|-----------------|
| β decay | Memory timescale | 0.90 - 0.99 |
| Π epsilon | Variance floor | 1e-6 - 1e-4 |
| λ (blend) | Affinity vs. anticipatory weight | 0.3 - 0.7 |
| Predictor hidden size | Anticipation capacity | 64 - 256 |

### Computational Overhead

- **β update:** O(E) per token — negligible (E = number of experts)
- **Π update:** O(E) per token — negligible
- **Anticipatory predictor:** O(E × H) where H is hidden size — small MLP
- **Total overhead:** < 2% additional compute for typical MoE configurations

## Design Patterns

### Pattern 1: Temporal Routing Memory

```python
class LIFRouterGate:
    def __init__(self, num_experts, decay=0.95):
        self.beta = torch.zeros(num_experts)  # membrane potentials
        self.decay = decay
    
    def forward(self, token_embeddings):
        # Standard affinity
        affinity = token_embeddings @ self.expert_prototypes.T
        
        # Update LIF potentials
        self.beta = self.decay * self.beta + affinity.mean(dim=0)
        
        # Route with temporal memory
        weights = softmax(affinity + self.beta.unsqueeze(0))
        return weights
```

### Pattern 2: Precision-Aware Routing

```python
class PrecisionWeightedGate:
    def __init__(self, num_experts, ema_decay=0.99):
        self.error_var = torch.ones(num_experts)
        self.ema_decay = ema_decay
    
    def update_precision(self, expert_id, prediction_error):
        # Update error variance via EMA
        self.error_var[expert_id] = (
            self.ema_decay * self.error_var[expert_id] +
            (1 - self.ema_decay) * prediction_error ** 2
        )
    
    def get_precision(self):
        return 1.0 / (self.error_var + 1e-6)
```

## Key Takeaways

1. **Standard MoE routing fails catastrophically at domain transitions** — correct expert gets <1% probability
2. **Three lightweight modifications achieve 124x improvement** with minimal overhead
3. **The Free Energy Principle provides the theoretical foundation** — routing as variational inference
4. **β (LIF memory) ≈ Bayesian posterior** — integrate evidence over time
5. **Π (precision) ≈ reliability weighting** — trust accurate experts more
6. **Anticipatory routing ≈ active inference** — predict and pre-allocate
7. **All three are necessary** — each contributes independently and synergistically

## Related Skills

- **moe-optimal-transport-routing**: OT-based balanced MoE routing (complementary approach)
- **hierarchical-moe-detection**: Hierarchical MoE for object detection
- **momenta-multimodal-moe-misinformation-detection**: Multimodal MoE

## Pitfalls

- **β decay too high:** Router becomes too inert, slow to adapt to genuine domain changes
- **β decay too low:** Loses temporal memory benefit, behaves like standard routing
- **Π without proper EMA smoothing:** Noisy precision estimates cause routing instability
- **Anticipatory predictor too large:** Defeats the "lightweight" advantage, adds significant overhead
- **Missing any component:** The three mechanisms are synergistic — removing any one degrades performance significantly