ShellhiyenwongFree

decentralized-stochastic-momentum-admm

Decentralized Momentum Tracking with Biased Gradients (Biased-DMT) for large-scale distributed optimization. Handles communication compression and data heterogeneity in decentralized learning. Use for: decentralized optimization, federated learning, distributed ML, gradient compression, biased gradients. Activation: decentralized optimization, biased gradients, momentum tracking, distributed learning, federated learning, gradient compression.

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1099

Quality—

Score—

Tasks—

§ 02 — Install

Get decentralized-stochastic-momentum-admm.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-decentralized-stochastic-momentum-admm

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-decentralized-stochastic-momentum-admm↵

Show SKILL.md content (~2.3k tokens)

---
name: decentralized-stochastic-momentum-admm
description: "Decentralized Momentum Tracking with Biased Gradients (Biased-DMT) for large-scale distributed optimization. Handles communication compression and data heterogeneity in decentralized learning. Use for: decentralized optimization, federated learning, distributed ML, gradient compression, biased gradients. Activation: decentralized optimization, biased gradients, momentum tracking, distributed learning, federated learning, gradient compression."
---

# Decentralized Stochastic Momentum Tracking with Biased Gradients

Novel decentralized algorithm for large-scale optimization with biased gradient estimators, achieving robust convergence under data heterogeneity and communication compression.

## Overview

Decentralized stochastic optimization is fundamental for large-scale machine learning, but practical implementations often rely on biased gradient estimators from:
- Communication compression
- Inexact local oracles
- Quantization

These biases severely degrade convergence in the presence of data heterogeneity. **Biased-DMT** addresses this challenge through momentum tracking that decouples network topology effects from data heterogeneity.

## Key Innovations

### 1. Momentum Tracking Architecture

```
Traditional: Track raw stochastic gradients
Biased-DMT: Track momentum term instead

Benefit: Decouples variance reduction from network connectivity
```

### 2. Bias-Aware Convergence

- Handles both absolute and relative bias
- Characterizes convergence limits explicitly
- Eliminates structural heterogeneity error

### 3. Topology-Heterogeneity Decoupling

```
Traditional methods: Error depends on network topology AND data distribution
Biased-DMT: Separates topology effects from heterogeneity effects
Result: Robust performance in sparse networks
```

## Mathematical Framework

### Problem Formulation

```
Minimize: f(x) = (1/n) Σ f_i(x)

where:
- n: number of agents
- f_i: local objective at agent i
- x: shared optimization variable

Decentralized setting:
- Each agent has local data
- Agents communicate over graph
- No central coordinator
```

### Biased Gradient Model

```
g_i^t = ∇f_i(x_i^t) + b_i^t + ξ_i^t

where:
- g_i^t: biased stochastic gradient
- b_i^t: bias term (can be absolute or relative)
- ξ_i^t: zero-mean noise

Absolute bias: ||b_i|| ≤ δ
Relative bias: ||b_i|| ≤ ρ||∇f_i(x)||
```

### Biased-DMT Algorithm

```
At each agent i:

1. Local Update:
   x_i^{t+1} = x_i^t - η [g_i^t + m_i^t]

2. Momentum Tracking:
   m_i^{t+1} = β m_i^t + (1-β) [g_i^{t+1} - g_i^t]

3. Consensus:
   x_i^{t+1} = Σ_j W_ij x_j^{t+1}

where:
- η: step size
- β: momentum parameter
- W: mixing matrix (graph weights)
```

## Convergence Theory

### Nonconvex Setting

**Theorem**: Biased-DMT achieves convergence in nonconvex settings with linear speedup.

**Convergence Rate**:
```
(1/T) Σ E[||∇f(x^t)||^2] ≤ O(1/√(nT)) + bias_terms

where n = number of agents, T = iterations
```

**Linear Speedup**: As n increases, convergence improves proportionally.

### Bias Characterization

#### Absolute Bias Case

```
If ||b_i|| ≤ δ:

Converges to: exact physical error floor
Eliminates: structural heterogeneity error

Result: ||x - x*|| ≤ C · δ
```

#### Relative Bias Case

```
If ||b_i|| ≤ ρ||∇f_i(x)||:

Converges to: neighborhood of optimum
Remaining error: unavoidable physical consequence

Result: ||x - x*|| ≤ C · ρ/(1-ρ)
```

### Network Topology Effects

**Key Property**: Biased-DMT decouples topology from heterogeneity.

```
Traditional error: O(α · heterogeneity / (1-λ))
Biased-DMT error: O(heterogeneity) + O(topology)

where:
- α: data heterogeneity
- λ: graph spectral gap
```

**Implication**: Works well even in sparse networks.

## Practical Benefits

### Robustness to Data Heterogeneity

| Scenario | Traditional DSGD | Biased-DMT |
|----------|-----------------|------------|
| IID data | Good | Good |
| Mild heterogeneity | Degraded | Good |
| High heterogeneity | Fails | Good |

### Communication Efficiency

Compatible with:
- **Quantization**: 1-bit, 2-bit gradients
- **Sparsification**: Top-k gradient compression
- **Sketching**: Count-sketch compression
- **Error compensation**: Memory-based methods

### Scalability

Tested on:
- Non-convex logistic regression
- Deep neural networks
- Large-scale federated learning

## Implementation

### Algorithm Pseudocode

```python
def biased_dmt_update(agent_id, local_model, neighbors, step_size, momentum):
    """
    Biased-DMT update for a single agent
    
    Args:
        agent_id: ID of current agent
        local_model: Current local model parameters
        neighbors: List of neighboring agents
        step_size: Learning rate η
        momentum: Momentum parameter β
    
    Returns:
        Updated model parameters
    """
    # Compute biased stochastic gradient
    gradient = compute_biased_gradient(local_model, agent_id)
    
    # Update momentum tracker
    momentum_tracker[agent_id] = (
        momentum * momentum_tracker[agent_id] + 
        (1 - momentum) * (gradient - previous_gradient[agent_id])
    )
    
    # Local update with momentum
    local_update = local_model - step_size * (gradient + momentum_tracker[agent_id])
    
    # Consensus with neighbors
    new_model = consensus_average(local_update, neighbors)
    
    # Store for next iteration
    previous_gradient[agent_id] = gradient
    
    return new_model
```

### Hyperparameter Selection

```python
def select_hyperparameters(n_agents, graph_connectivity, heterogeneity):
    """
    Select step size and momentum based on problem characteristics
    
    Guidelines:
    - Step size: η ≈ O(1/√T) or constant for strongly convex
    - Momentum: β ∈ [0.5, 0.9] typically works well
    - Larger heterogeneity → smaller step size
    - Better connectivity → larger step size
    """
    base_lr = 0.1
    lr = base_lr / (1 + heterogeneity)
    
    momentum = 0.7  # Good default
    
    return lr, momentum
```

### Bias Type Detection

```python
def detect_bias_type(gradient_history):
    """
    Determine if bias is absolute or relative
    
    Returns: 'absolute', 'relative', or 'mixed'
    """
    # Check correlation with gradient magnitude
    correlation = compute_correlation(gradient_history)
    
    if correlation > 0.8:
        return 'relative'
    elif correlation < 0.2:
        return 'absolute'
    else:
        return 'mixed'
```

## Applications

### Federated Learning

- **Cross-device FL**: Mobile/edge devices with limited bandwidth
- **Cross-silo FL**: Organizations with private data
- **Personalization**: Local adaptation with global consensus

### Distributed Training

- **Large models**: Model parallelism across GPUs
- **Geo-distributed**: Data centers in different locations
- **Heterogeneous hardware**: Different compute capabilities

### Sensor Networks

- **IoT devices**: Battery-constrained sensors
- **Swarm robotics**: Distributed robot teams
- **Smart grids**: Distributed energy management

## Comparison with Alternatives

| Method | Handles Bias | Topology Robust | Convergence |
|--------|-------------|-----------------|-------------|
| DSGD | No | No | O(1/√T) |
| D^2 | Partial | No | O(1/√T) |
| GT | No | Yes | O(1/T) |
| **Biased-DMT** | **Yes** | **Yes** | **O(1/√(nT))** |

## Theoretical Insights

### Why Momentum Tracking Works

1. **Variance Reduction**: Momentum averages out stochastic noise
2. **Bias Isolation**: Separates bias from variance in analysis
3. **Topology Decoupling**: Momentum state independent of graph structure

### Convergence Limit Characterization

```
Final error = Physical error (unavoidable) + Structural error (eliminated)

Physical error: From local noise and bias
Structural error: From network topology (eliminated by Biased-DMT)
```

## References

- **Paper**: "Improved Convergence for Decentralized Stochastic Optimization with Biased Gradients" by Xu et al. (arXiv:2604.08236v1, 2026)
- **Categories**: math.OC

## Related Skills

- **stochastic-momentum-tracking-push-pull**: For directed graph optimization
- **decentralized-optimization**: General decentralized optimization methods
- **federated-learning**: Federated learning specific techniques

## Activation Keywords

- decentralized optimization
- biased gradients
- momentum tracking
- distributed learning
- federated learning
- gradient compression
- data heterogeneity
- communication compression