Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install hiyenwong-ai-collection-collection-skills-efficient-llm-inference-surveygit clone https://github.com/hiyenwong/ai_collection.gitcp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-efficient-llm-inference-survey/SKILL.md---
name: efficient-llm-inference-survey
description: Skill for AI agent capabilities
---
# Efficient LLM Inference Survey
## Description
A comprehensive survey on efficient LLM inference techniques. Covers data-level, model-level, and system-level optimizations to address the three main causes of inefficient inference: large model size, quadratic-complexity attention, and auto-regressive decoding.
**Key Topics:**
- Data-level optimization (input pruning, adaptive computation)
- Model-level optimization (quantization, pruning, distillation)
- System-level optimization (KV cache, parallel decoding, memory management)
## Tools Used
- read: Load model configurations
- write: Save optimization settings
- exec: Run benchmarking scripts
- browser: Access optimization tools
- memory_search: Retrieve optimization methods
## Instructions for Agents
### Three Causes of Inefficient Inference
1. **Large Model Size** - Billions of parameters
2. **Quadratic Attention** - O(n²) complexity
3. **Auto-regressive Decoding** - Sequential token generation
### Optimization Taxonomy
```
Efficient Inference
├── Data-Level
│ ├── Input pruning
│ ├── Adaptive computation
│ └── Prompt optimization
│
├── Model-Level
│ ├── Quantization
│ ├── Pruning
│ ├── Distillation
│ └── Architecture optimization
│
└── System-Level
├── KV cache optimization
├── Parallel decoding
├── Memory management
└── Scheduling strategies
```
## Overview
**Source:** arXiv:2404.14294v3
**Utility:** 0.93
**Scope:** Comprehensive survey with comparative experiments
## Activation Keywords
- efficient LLM inference
- LLM optimization
- model quantization
- KV cache optimization
- inference acceleration
---
## Data-Level Optimization
### Input Pruning
```python
class InputPruner:
def prune_tokens(self, input_tokens, importance_scores):
# Remove less important tokens
threshold = self.calculate_threshold(importance_scores)
pruned_tokens = [
t for t, s in zip(input_tokens, importance_scores)
if s > threshold
]
return pruned_tokens
```
### Adaptive Computation
```python
class AdaptiveComputation:
def __init__(self, model, early_exit_layers):
self.model = model
self.exit_layers = early_exit_layers
def forward(self, input_ids):
for i, layer in enumerate(self.model.layers):
output = layer(input_ids)
if i in self.exit_layers:
if self.is_confident(output):
return self.exit_classifier(output)
return self.model.final_output(output)
```
---
## Model-Level Optimization
### Quantization
```python
# Post-Training Quantization (PTQ)
def quantize_model(model, bits=8):
for name, param in model.named_parameters():
scale = param.abs().max() / (2 ** (bits - 1) - 1)
quantized = torch.round(param / scale) * scale
model.state_dict()[name] = quantized
return model
# Quantization-Aware Training (QAT)
class QuantizedLinear(nn.Module):
def __init__(self, in_features, out_features, bits=8):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.scale = nn.Parameter(torch.ones(1))
self.bits = bits
def forward(self, x):
# Fake quantization during training
qweight = self.fake_quantize(self.weight)
return F.linear(x, qweight)
```
### Pruning
```python
# Structured Pruning
def prune_heads(model, heads_to_prune):
for layer_idx, heads in heads_to_prune.items():
model.layers[layer_idx].attention.prune_heads(heads)
return model
# Unstructured Pruning
def magnitude_pruning(model, sparsity=0.5):
for name, param in model.named_parameters():
mask = param.abs() > param.abs().quantile(sparsity)
param.data *= mask
return model
```
### Knowledge Distillation
```python
class DistillationTrainer:
def __init__(self, teacher, student, temperature=4.0):
self.teacher = teacher
self.student = student
self.temperature = temperature
def distill_loss(self, student_logits, teacher_logits, labels):
# Soft target loss
soft_loss = F.kl_div(
F.log_softmax(student_logits / self.temperature, dim=-1),
F.softmax(teacher_logits / self.temperature, dim=-1),
reduction='batchmean'
) * (self.temperature ** 2)
# Hard target loss
hard_loss = F.cross_entropy(student_logits, labels)
return soft_loss + hard_loss
```
---
## System-Level Optimization
### KV Cache Optimization
```python
class KVCacheManager:
def __init__(self, max_cache_size, eviction_policy='lru'):
self.cache = {}
self.max_size = max_cache_size
self.policy = eviction_policy
def get_kv(self, layer_idx, token_idx):
key = (layer_idx, token_idx)
if key in self.cache:
return self.cache[key]
return None
def put_kv(self, layer_idx, token_idx, k, v):
if len(self.cache) >= self.max_size:
self.evict()
self.cache[(layer_idx, token_idx)] = (k, v)
def evict(self):
if self.policy == 'lru':
oldest = next(iter(self.cache))
del self.cache[oldest]
```
### Speculative Decoding
```python
class SpeculativeDecoder:
def __init__(self, target_model, draft_model, num_speculative=5):
self.target = target_model
self.draft = draft_model
self.k = num_speculative
def generate(self, input_ids):
while not self.is_eos(input_ids):
# Draft model generates k tokens
draft_tokens = self.draft.generate(input_ids, self.k)
# Target model verifies in parallel
target_probs = self.target.verify(input_ids, draft_tokens)
# Accept tokens that match
accepted = self.accept_tokens(draft_tokens, target_probs)
input_ids = torch.cat([input_ids, accepted])
return input_ids
```
### Continuous Batching
```python
class ContinuousBatcher:
def __init__(self, model, max_batch_size=32):
self.model = model
self.max_batch = max_batch_size
self.requests = []
def add_request(self, request):
self.requests.append(request)
def step(self):
# Process all active requests together
batch = self.prepare_batch(self.requests)
outputs = self.model(batch)
# Remove completed requests
self.requests = [r for r in self.requests if not r.completed]
return outputs
```
---
## Optimization Comparison
| Method | Latency Reduction | Memory Savings | Quality Impact |
|--------|-------------------|----------------|----------------|
| INT8 Quantization | 2-3x | 2x | Minimal |
| INT4 Quantization | 3-4x | 4x | Small |
| Pruning (50%) | 1.5-2x | 2x | Moderate |
| Distillation | 2-10x | 2-10x | Small |
| Speculative Decoding | 2-3x | Minimal | None |
| KV Cache | 1.5-2x | Depends | None |
---
## Best Practices
1. **Combine optimizations** - Quantization + KV cache + batching
2. **Profile first** - Identify bottlenecks before optimizing
3. **Benchmark quality** - Ensure acceptable degradation
4. **Consider hardware** - Match optimization to deployment platform
5. **Iterative refinement** - Start with simple, add complexity as needed
---
## Deployment Strategies
| Scenario | Recommended Optimizations |
|----------|---------------------------|
| Edge devices | INT4 quantization + pruning |
| Cloud serving | KV cache + continuous batching |
| Batch processing | Speculative decoding + parallelism |
| Real-time | All levels combined |
---
## Examples
### Example 1: Basic Application
**User:** I need to apply Efficient LLM Inference Survey to my analysis.
**Agent:** I'll help you apply efficient-llm-inference-survey. First, let me understand your specific use case...
**Context:** Apply the methodology
### Example 2: Advanced Scenario
**User:** Complex analysis scenario
**Agent:** Based on the methodology, I'll guide you through the advanced application...
### Example 2: Advanced Application
**User:** What are the key considerations for efficient-llm-inference-survey?
**Agent:** Let me search for the latest research and best practices...
## References
- Paper: https://arxiv.org/abs/2404.14294
- DOI: https://doi.org/10.48550/arXiv.2404.14294
---
**Created:** 2026-03-28
**Source:** arXiv:2404.14294v3 - "A Survey on Efficient Inference for Large Language Models"