SQLjeremylongshoreFree

anth-load-scale

Implement load testing, auto-scaling, and capacity planning for Claude API. Use when running performance benchmarks, planning for traffic spikes, or configuring horizontal scaling for Claude-powered services. Trigger with phrases like "anthropic load test", "claude scaling", "anthropic capacity planning", "scale claude api".

Repo bundle on Versuzjeremylongshore/claude-code-plugins-plus-skills1002 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/jeremylongshore/claude-code-plugins-plus-skills Yours? Claim it ↗

§ 01 — Stats

Stars2.2k

Prior1190

Quality—

Score—

Tasks—

§ 02 — Install

Get anth-load-scale.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

npx versuz@latest install jeremylongshore-claude-code-plugins-plus-skills-plugins-saas-packs-anthropic-pack-skills-anth-load-scale

Or clone the repo

$git clone https://github.com/jeremylongshore/claude-code-plugins-plus-skills.git

Or copy the SKILL.md manually

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge jeremylongshore-claude-code-plugins-plus-skills-plugins-saas-packs-anthropic-pack-skills-anth-load-scale↵

Show SKILL.md content (~1.4k tokens)

---
name: anth-load-scale
description: 'Implement load testing, auto-scaling, and capacity planning for Claude
  API.

  Use when running performance benchmarks, planning for traffic spikes,

  or configuring horizontal scaling for Claude-powered services.

  Trigger with phrases like "anthropic load test", "claude scaling",

  "anthropic capacity planning", "scale claude api".

  '
allowed-tools: Read, Write, Edit, Bash(npm:*), Grep
version: 1.0.0
license: MIT
author: Jeremy Longshore <jeremy@intentsolutions.io>
tags:
- saas
- ai
- anthropic
compatibility: Designed for Claude Code
---
# Anthropic Load & Scale

## Overview

Capacity planning and load testing for Claude API integrations. Key constraint: your rate limits (RPM/ITPM/OTPM) are the ceiling, not your infrastructure.

## Capacity Planning

```python
# Calculate required tier based on traffic
def plan_capacity(
    requests_per_minute: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    model: str = "claude-sonnet-4-20250514"
) -> dict:
    itpm = requests_per_minute * avg_input_tokens
    otpm = requests_per_minute * avg_output_tokens

    # Estimate monthly cost
    pricing = {
        "claude-haiku-4-20250514": (0.80, 4.00),
        "claude-sonnet-4-20250514": (3.00, 15.00),
        "claude-opus-4-20250514": (15.00, 75.00),
    }
    rates = pricing[model]
    cost_per_request = (avg_input_tokens * rates[0] + avg_output_tokens * rates[1]) / 1_000_000
    monthly_cost = cost_per_request * requests_per_minute * 60 * 24 * 30

    return {
        "rpm_needed": requests_per_minute,
        "itpm_needed": itpm,
        "otpm_needed": otpm,
        "cost_per_request": f"${cost_per_request:.4f}",
        "monthly_estimate": f"${monthly_cost:,.0f}",
        "recommendation": "Contact Anthropic sales for Scale tier" if requests_per_minute > 500 else "Self-serve tiers sufficient",
    }

print(plan_capacity(100, 500, 200))
```

## Load Testing Script

```python
import anthropic
import asyncio
import time
from dataclasses import dataclass

@dataclass
class LoadTestResult:
    total_requests: int = 0
    successful: int = 0
    failed: int = 0
    rate_limited: int = 0
    avg_latency_ms: float = 0
    p99_latency_ms: float = 0
    total_input_tokens: int = 0
    total_output_tokens: int = 0

async def load_test(
    concurrency: int = 10,
    total_requests: int = 100,
    model: str = "claude-haiku-4-20250514"
) -> LoadTestResult:
    client = anthropic.Anthropic()
    result = LoadTestResult()
    latencies = []
    semaphore = asyncio.Semaphore(concurrency)

    async def single_request():
        async with semaphore:
            start = time.monotonic()
            try:
                msg = client.messages.create(
                    model=model,
                    max_tokens=64,
                    messages=[{"role": "user", "content": "Respond with exactly: OK"}]
                )
                duration = (time.monotonic() - start) * 1000
                latencies.append(duration)
                result.successful += 1
                result.total_input_tokens += msg.usage.input_tokens
                result.total_output_tokens += msg.usage.output_tokens
            except anthropic.RateLimitError:
                result.rate_limited += 1
            except Exception:
                result.failed += 1
            result.total_requests += 1

    tasks = [single_request() for _ in range(total_requests)]
    await asyncio.gather(*tasks)

    if latencies:
        latencies.sort()
        result.avg_latency_ms = sum(latencies) / len(latencies)
        result.p99_latency_ms = latencies[int(len(latencies) * 0.99)]

    return result

# Run: asyncio.run(load_test(concurrency=10, total_requests=50))
```

## Scaling Strategies

| Strategy | When | Implementation |
|----------|------|---------------|
| Queue-based processing | > 50 RPM sustained | Redis/SQS queue + worker pool |
| Model routing | Mixed workloads | Haiku for simple, Sonnet for complex |
| Message Batches | Offline processing | 100K requests, 50% cheaper, no RPM impact |
| Prompt caching | Repeated system prompts | 90% input token savings |
| Request coalescing | Duplicate prompts | Cache identical request hashes |

## Horizontal Scaling Pattern

```python
# Multiple application instances sharing the same API key
# Rate limits are per-organization, NOT per-instance
# Use a shared rate limiter (Redis) to coordinate

import redis

r = redis.Redis()

def check_rate_limit(key: str = "claude:rpm", limit: int = 100, window: int = 60) -> bool:
    current = r.incr(key)
    if current == 1:
        r.expire(key, window)
    return current <= limit
```

## Error Handling

| Issue | Cause | Fix |
|-------|-------|-----|
| 429 during load test | Exceeded tier limits | Reduce concurrency or upgrade tier |
| Increasing latency under load | Output queue saturation | Reduce max_tokens |
| Uneven request distribution | No load balancing | Use queue for fair distribution |

## Resources

- [Rate Limits](https://docs.anthropic.com/en/api/rate-limits)
- [Service Tiers](https://docs.anthropic.com/en/api/service-tiers)

## Next Steps

For reliability patterns, see `anth-reliability-patterns`.