Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-decentralized-stochastic-momentum-admmgit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-decentralized-stochastic-momentum-admm/SKILL.md---
name: decentralized-stochastic-momentum-admm
description: "Decentralized Momentum Tracking with Biased Gradients (Biased-DMT) for large-scale distributed optimization. Handles communication compression and data heterogeneity in decentralized learning. Use for: decentralized optimization, federated learning, distributed ML, gradient compression, biased gradients. Activation: decentralized optimization, biased gradients, momentum tracking, distributed learning, federated learning, gradient compression."
---
# Decentralized Stochastic Momentum Tracking with Biased Gradients
Novel decentralized algorithm for large-scale optimization with biased gradient estimators, achieving robust convergence under data heterogeneity and communication compression.
## Overview
Decentralized stochastic optimization is fundamental for large-scale machine learning, but practical implementations often rely on biased gradient estimators from:
- Communication compression
- Inexact local oracles
- Quantization
These biases severely degrade convergence in the presence of data heterogeneity. **Biased-DMT** addresses this challenge through momentum tracking that decouples network topology effects from data heterogeneity.
## Key Innovations
### 1. Momentum Tracking Architecture
```
Traditional: Track raw stochastic gradients
Biased-DMT: Track momentum term instead
Benefit: Decouples variance reduction from network connectivity
```
### 2. Bias-Aware Convergence
- Handles both absolute and relative bias
- Characterizes convergence limits explicitly
- Eliminates structural heterogeneity error
### 3. Topology-Heterogeneity Decoupling
```
Traditional methods: Error depends on network topology AND data distribution
Biased-DMT: Separates topology effects from heterogeneity effects
Result: Robust performance in sparse networks
```
## Mathematical Framework
### Problem Formulation
```
Minimize: f(x) = (1/n) Σ f_i(x)
where:
- n: number of agents
- f_i: local objective at agent i
- x: shared optimization variable
Decentralized setting:
- Each agent has local data
- Agents communicate over graph
- No central coordinator
```
### Biased Gradient Model
```
g_i^t = ∇f_i(x_i^t) + b_i^t + ξ_i^t
where:
- g_i^t: biased stochastic gradient
- b_i^t: bias term (can be absolute or relative)
- ξ_i^t: zero-mean noise
Absolute bias: ||b_i|| ≤ δ
Relative bias: ||b_i|| ≤ ρ||∇f_i(x)||
```
### Biased-DMT Algorithm
```
At each agent i:
1. Local Update:
x_i^{t+1} = x_i^t - η [g_i^t + m_i^t]
2. Momentum Tracking:
m_i^{t+1} = β m_i^t + (1-β) [g_i^{t+1} - g_i^t]
3. Consensus:
x_i^{t+1} = Σ_j W_ij x_j^{t+1}
where:
- η: step size
- β: momentum parameter
- W: mixing matrix (graph weights)
```
## Convergence Theory
### Nonconvex Setting
**Theorem**: Biased-DMT achieves convergence in nonconvex settings with linear speedup.
**Convergence Rate**:
```
(1/T) Σ E[||∇f(x^t)||^2] ≤ O(1/√(nT)) + bias_terms
where n = number of agents, T = iterations
```
**Linear Speedup**: As n increases, convergence improves proportionally.
### Bias Characterization
#### Absolute Bias Case
```
If ||b_i|| ≤ δ:
Converges to: exact physical error floor
Eliminates: structural heterogeneity error
Result: ||x - x*|| ≤ C · δ
```
#### Relative Bias Case
```
If ||b_i|| ≤ ρ||∇f_i(x)||:
Converges to: neighborhood of optimum
Remaining error: unavoidable physical consequence
Result: ||x - x*|| ≤ C · ρ/(1-ρ)
```
### Network Topology Effects
**Key Property**: Biased-DMT decouples topology from heterogeneity.
```
Traditional error: O(α · heterogeneity / (1-λ))
Biased-DMT error: O(heterogeneity) + O(topology)
where:
- α: data heterogeneity
- λ: graph spectral gap
```
**Implication**: Works well even in sparse networks.
## Practical Benefits
### Robustness to Data Heterogeneity
| Scenario | Traditional DSGD | Biased-DMT |
|----------|-----------------|------------|
| IID data | Good | Good |
| Mild heterogeneity | Degraded | Good |
| High heterogeneity | Fails | Good |
### Communication Efficiency
Compatible with:
- **Quantization**: 1-bit, 2-bit gradients
- **Sparsification**: Top-k gradient compression
- **Sketching**: Count-sketch compression
- **Error compensation**: Memory-based methods
### Scalability
Tested on:
- Non-convex logistic regression
- Deep neural networks
- Large-scale federated learning
## Implementation
### Algorithm Pseudocode
```python
def biased_dmt_update(agent_id, local_model, neighbors, step_size, momentum):
"""
Biased-DMT update for a single agent
Args:
agent_id: ID of current agent
local_model: Current local model parameters
neighbors: List of neighboring agents
step_size: Learning rate η
momentum: Momentum parameter β
Returns:
Updated model parameters
"""
# Compute biased stochastic gradient
gradient = compute_biased_gradient(local_model, agent_id)
# Update momentum tracker
momentum_tracker[agent_id] = (
momentum * momentum_tracker[agent_id] +
(1 - momentum) * (gradient - previous_gradient[agent_id])
)
# Local update with momentum
local_update = local_model - step_size * (gradient + momentum_tracker[agent_id])
# Consensus with neighbors
new_model = consensus_average(local_update, neighbors)
# Store for next iteration
previous_gradient[agent_id] = gradient
return new_model
```
### Hyperparameter Selection
```python
def select_hyperparameters(n_agents, graph_connectivity, heterogeneity):
"""
Select step size and momentum based on problem characteristics
Guidelines:
- Step size: η ≈ O(1/√T) or constant for strongly convex
- Momentum: β ∈ [0.5, 0.9] typically works well
- Larger heterogeneity → smaller step size
- Better connectivity → larger step size
"""
base_lr = 0.1
lr = base_lr / (1 + heterogeneity)
momentum = 0.7 # Good default
return lr, momentum
```
### Bias Type Detection
```python
def detect_bias_type(gradient_history):
"""
Determine if bias is absolute or relative
Returns: 'absolute', 'relative', or 'mixed'
"""
# Check correlation with gradient magnitude
correlation = compute_correlation(gradient_history)
if correlation > 0.8:
return 'relative'
elif correlation < 0.2:
return 'absolute'
else:
return 'mixed'
```
## Applications
### Federated Learning
- **Cross-device FL**: Mobile/edge devices with limited bandwidth
- **Cross-silo FL**: Organizations with private data
- **Personalization**: Local adaptation with global consensus
### Distributed Training
- **Large models**: Model parallelism across GPUs
- **Geo-distributed**: Data centers in different locations
- **Heterogeneous hardware**: Different compute capabilities
### Sensor Networks
- **IoT devices**: Battery-constrained sensors
- **Swarm robotics**: Distributed robot teams
- **Smart grids**: Distributed energy management
## Comparison with Alternatives
| Method | Handles Bias | Topology Robust | Convergence |
|--------|-------------|-----------------|-------------|
| DSGD | No | No | O(1/√T) |
| D^2 | Partial | No | O(1/√T) |
| GT | No | Yes | O(1/T) |
| **Biased-DMT** | **Yes** | **Yes** | **O(1/√(nT))** |
## Theoretical Insights
### Why Momentum Tracking Works
1. **Variance Reduction**: Momentum averages out stochastic noise
2. **Bias Isolation**: Separates bias from variance in analysis
3. **Topology Decoupling**: Momentum state independent of graph structure
### Convergence Limit Characterization
```
Final error = Physical error (unavoidable) + Structural error (eliminated)
Physical error: From local noise and bias
Structural error: From network topology (eliminated by Biased-DMT)
```
## References
- **Paper**: "Improved Convergence for Decentralized Stochastic Optimization with Biased Gradients" by Xu et al. (arXiv:2604.08236v1, 2026)
- **Categories**: math.OC
## Related Skills
- **stochastic-momentum-tracking-push-pull**: For directed graph optimization
- **decentralized-optimization**: General decentralized optimization methods
- **federated-learning**: Federated learning specific techniques
## Activation Keywords
- decentralized optimization
- biased gradients
- momentum tracking
- distributed learning
- federated learning
- gradient compression
- data heterogeneity
- communication compression