DataGadriel-aiFree

gadriel-token-cost-estimator

Token-cost math, per-call pricing, ceiling enforcement, model-tier selection. Auto-invoke for findings tagged `finops`, `cost-anomaly`, `token-budget`, or rule IDs `CODE-W1-COST-*`, and when the user asks about "token cost", "pricing", "context-window optimization", "model tier".

Repo bundle on VersuzGadriel-ai/gadriel-claude-plugins17 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/Gadriel-ai/gadriel-claude-plugins Yours? Claim it ↗

§ 01 — Stats

Prior1090

Quality—

Score—

Tasks—

§ 02 — Install

Get gadriel-token-cost-estimator.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install gadriel-ai-gadriel-claude-plugins-plugins-gadriel-scanners-skills-gadriel-token-cost-estimator

Or clone the repo

$git clone https://github.com/Gadriel-ai/gadriel-claude-plugins.git

Or copy the SKILL.md manually

More Versuz picks

★ Featured$0.99

vz-scrape-runner

Web

★ Featured$1.99

vz-bench-debug

Document

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge gadriel-ai-gadriel-claude-plugins-plugins-gadriel-scanners-skills-gadriel-token-cost-estimator↵

Show SKILL.md content (~1.3k tokens)

---
name: gadriel-token-cost-estimator
description: Token-cost math, per-call pricing, ceiling enforcement, model-tier selection. Auto-invoke for findings tagged `finops`, `cost-anomaly`, `token-budget`, or rule IDs `CODE-W1-COST-*`, and when the user asks about "token cost", "pricing", "context-window optimization", "model tier".
---

# Token Cost Estimator

This skill teaches Claude how to estimate the per-call cost of an LLM invocation, how to detect cost-anomaly patterns in code/config, and how to propose cost-reducing rewrites. Used by the `finops` pillar.

## When this skill activates

- Findings in `CODE-W1-COST-*` or any finding the orchestrator routes to the FinOps reviewer
- Tags: `finops`, `cost-anomaly`, `token-budget`, `model-tier`, `context-bloat`
- User phrasings: "how much does this cost per call", "can we use Haiku here", "context too large", "spend forecast"
- File patterns: code with model invocations, RAG pipelines, batch-completion scripts, observability dashboards

## Core concepts

- **Per-call cost formula** — `cost = (input_tokens / 1M) * input_price + (output_tokens / 1M) * output_price`. Cached input has a separate (cheaper) rate where available; cache writes have a one-time premium.
- **Model tiers** — pick the cheapest model that meets task quality. Tier ladder: small (Haiku/Mini/Flash) → mid (Sonnet/4-mini) → large (Opus/4o/Pro). Default to mid; downshift when prompt + few-shot is enough.
- **Cost ceilings** — three layers: per-call cap (`max_tokens`), per-task budget (sum across an agent run), per-tenant daily budget (rate limit + alerting).
- **Prompt caching** — Anthropic prompt caching (5m, 1h) and OpenAI Cached Input (auto) deliver 5-10x reduction for stable system prompts. Cache should be on by default for any system prompt > 1k tokens.
- **Context-window discipline** — long contexts are expensive even when irrelevant; retrieval (RAG) + summarization beats packing-the-window in cost and accuracy.
- **Embeddings are 100x cheaper** — for filtering, similarity, classification, use embeddings, not generation.

## Detection patterns / cheatsheet

- `max_tokens` not set → unbounded output cost.
- Looping over a list of N items with one `chat.completions.create` per item, no batching → 1x→Nx cost.
- Retrieval that pulls top-k=50 chunks of 1k tokens each → 50k tokens of context per call.
- System prompt > 4k tokens with no caching directive (`cache_control: ephemeral`).
- Model selected = Opus/4o/Pro for tasks that are extractive or yes/no (small model would suffice).
- Sampling with `temperature > 0` for deterministic tasks → wasted retries.
- No retry-backoff budget → exponential cost on degraded provider.
- Embedding-then-generate pattern where embedding alone would answer (e.g., classification).

## Remediation playbook

1. Always set `max_tokens` per call; clamp to the smallest value the schema requires.
2. Replace per-item loops with batch APIs (OpenAI Batch, Anthropic batches) when latency permits — 50% discount.
3. Set retrieval `top_k` based on token budget: budget=4k tokens / chunk_size = top_k; don't blindly pick 50.
4. Apply prompt caching for stable system prompts: split message into static (cached) and dynamic (not cached) blocks.
5. Pick model tier with a written rule per task type:
   - Classification/extraction → small.
   - Summarization/Q&A → mid (with caching).
   - Multi-step reasoning, long-form → large (rare; budget alarm).
6. Add a budget guard: per-tenant daily cap that returns HTTP 429 (with retry-after) when exceeded.
7. For RAG, cap context at a token budget (e.g., 6k) and let retrieval pick the densest chunks; trim quoted matches before generation.
8. Emit per-call cost into the audit log; weekly review of top 1% expensive calls reveals 80% of waste.

## Worked estimate

Suppose a RAG Q&A endpoint with system prompt 2k tokens, retrieved context 6k tokens, user question 200 tokens, answer 400 tokens. At Claude Sonnet pricing (illustrative: $3/M input, $15/M output):

- Without caching: input 8,200 tokens → ~$0.025; output 400 → ~$0.006; total ~$0.031.
- With prompt caching on the 2k system prompt: cached read at 10% rate → ~$0.0006 on those 2k; saving ~$0.005 (16%) per call.
- Switching to Haiku for extractive sub-queries (price floor): often a 10-12x reduction; sometimes a quality drop — eval before committing.

The optimization order is: set `max_tokens` (free, always), enable caching (cheap), trim retrieval (medium), downshift model (eval first).

## References

- Anthropic prompt caching docs — https://docs.anthropic.com/en/docs/prompt-caching
- OpenAI cached input pricing — https://platform.openai.com/docs/guides/prompt-caching
- ADR-086 §D4 — skill assigned to `finops` agent
- Gadriel cost-policy doc: `.security/compliance/finops/budget.md` (per-tenant)
- Sibling skills: `gadriel-output-schema-library` (short structured outputs reduce cost)