OtherhiyenwongFree

efficient-llm-inference-survey

Skill for AI agent capabilities

Repo bundle on Versuzhiyenwong/ai_collection1001 indexed entries (SKILL.md and CLAUDE.md) from this repository — open the full bundle view.

Open bundle →

View on GitHub ↗</>github.com/hiyenwong/ai_collection Yours? Claim it ↗

§ 01 — Stats

Stars1

Prior1082

Quality—

Score—

Tasks—

§ 02 — Install

Get efficient-llm-inference-survey.

Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.

One-line install · Claude Code

$npx versuz@latest install hiyenwong-ai-collection-collection-skills-efficient-llm-inference-survey

Or clone the repo

$git clone https://github.com/hiyenwong/ai_collection.git

Or copy the SKILL.md manually

cp ai_collection/SKILL.MD ~/.claude/skills/hiyenwong-ai-collection-collection-skills-efficient-llm-inference-survey/SKILL.md

More Versuz picks

★ Featured$1.99

vz-bench-debug

Document

★ Featured$0.99

vz-scrape-runner

Web

Got something better ?Submit your skill — it enters tomorrow's cycle. No fee.

Submit yours →

§ 05 — Challenge

Think you can beat it?

$npx versuz challenge hiyenwong-ai-collection-collection-skills-efficient-llm-inference-survey↵

Show SKILL.md content (~2.3k tokens)

---
name: efficient-llm-inference-survey
description: Skill for AI agent capabilities
---

# Efficient LLM Inference Survey

## Description

A comprehensive survey on efficient LLM inference techniques. Covers data-level, model-level, and system-level optimizations to address the three main causes of inefficient inference: large model size, quadratic-complexity attention, and auto-regressive decoding.

**Key Topics:**
- Data-level optimization (input pruning, adaptive computation)
- Model-level optimization (quantization, pruning, distillation)
- System-level optimization (KV cache, parallel decoding, memory management)

## Tools Used

- read: Load model configurations
- write: Save optimization settings
- exec: Run benchmarking scripts
- browser: Access optimization tools
- memory_search: Retrieve optimization methods

## Instructions for Agents

### Three Causes of Inefficient Inference

1. **Large Model Size** - Billions of parameters
2. **Quadratic Attention** - O(n²) complexity
3. **Auto-regressive Decoding** - Sequential token generation

### Optimization Taxonomy

```
Efficient Inference
├── Data-Level
│   ├── Input pruning
│   ├── Adaptive computation
│   └── Prompt optimization
│
├── Model-Level
│   ├── Quantization
│   ├── Pruning
│   ├── Distillation
│   └── Architecture optimization
│
└── System-Level
    ├── KV cache optimization
    ├── Parallel decoding
    ├── Memory management
    └── Scheduling strategies
```

## Overview

**Source:** arXiv:2404.14294v3
**Utility:** 0.93
**Scope:** Comprehensive survey with comparative experiments

## Activation Keywords

- efficient LLM inference
- LLM optimization
- model quantization
- KV cache optimization
- inference acceleration

---

## Data-Level Optimization

### Input Pruning

```python
class InputPruner:
    def prune_tokens(self, input_tokens, importance_scores):
        # Remove less important tokens
        threshold = self.calculate_threshold(importance_scores)
        pruned_tokens = [
            t for t, s in zip(input_tokens, importance_scores)
            if s > threshold
        ]
        return pruned_tokens
```

### Adaptive Computation

```python
class AdaptiveComputation:
    def __init__(self, model, early_exit_layers):
        self.model = model
        self.exit_layers = early_exit_layers
    
    def forward(self, input_ids):
        for i, layer in enumerate(self.model.layers):
            output = layer(input_ids)
            
            if i in self.exit_layers:
                if self.is_confident(output):
                    return self.exit_classifier(output)
        
        return self.model.final_output(output)
```

---

## Model-Level Optimization

### Quantization

```python
# Post-Training Quantization (PTQ)
def quantize_model(model, bits=8):
    for name, param in model.named_parameters():
        scale = param.abs().max() / (2 ** (bits - 1) - 1)
        quantized = torch.round(param / scale) * scale
        model.state_dict()[name] = quantized
    return model

# Quantization-Aware Training (QAT)
class QuantizedLinear(nn.Module):
    def __init__(self, in_features, out_features, bits=8):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.scale = nn.Parameter(torch.ones(1))
        self.bits = bits
    
    def forward(self, x):
        # Fake quantization during training
        qweight = self.fake_quantize(self.weight)
        return F.linear(x, qweight)
```

### Pruning

```python
# Structured Pruning
def prune_heads(model, heads_to_prune):
    for layer_idx, heads in heads_to_prune.items():
        model.layers[layer_idx].attention.prune_heads(heads)
    return model

# Unstructured Pruning
def magnitude_pruning(model, sparsity=0.5):
    for name, param in model.named_parameters():
        mask = param.abs() > param.abs().quantile(sparsity)
        param.data *= mask
    return model
```

### Knowledge Distillation

```python
class DistillationTrainer:
    def __init__(self, teacher, student, temperature=4.0):
        self.teacher = teacher
        self.student = student
        self.temperature = temperature
    
    def distill_loss(self, student_logits, teacher_logits, labels):
        # Soft target loss
        soft_loss = F.kl_div(
            F.log_softmax(student_logits / self.temperature, dim=-1),
            F.softmax(teacher_logits / self.temperature, dim=-1),
            reduction='batchmean'
        ) * (self.temperature ** 2)
        
        # Hard target loss
        hard_loss = F.cross_entropy(student_logits, labels)
        
        return soft_loss + hard_loss
```

---

## System-Level Optimization

### KV Cache Optimization

```python
class KVCacheManager:
    def __init__(self, max_cache_size, eviction_policy='lru'):
        self.cache = {}
        self.max_size = max_cache_size
        self.policy = eviction_policy
    
    def get_kv(self, layer_idx, token_idx):
        key = (layer_idx, token_idx)
        if key in self.cache:
            return self.cache[key]
        return None
    
    def put_kv(self, layer_idx, token_idx, k, v):
        if len(self.cache) >= self.max_size:
            self.evict()
        self.cache[(layer_idx, token_idx)] = (k, v)
    
    def evict(self):
        if self.policy == 'lru':
            oldest = next(iter(self.cache))
            del self.cache[oldest]
```

### Speculative Decoding

```python
class SpeculativeDecoder:
    def __init__(self, target_model, draft_model, num_speculative=5):
        self.target = target_model
        self.draft = draft_model
        self.k = num_speculative
    
    def generate(self, input_ids):
        while not self.is_eos(input_ids):
            # Draft model generates k tokens
            draft_tokens = self.draft.generate(input_ids, self.k)
            
            # Target model verifies in parallel
            target_probs = self.target.verify(input_ids, draft_tokens)
            
            # Accept tokens that match
            accepted = self.accept_tokens(draft_tokens, target_probs)
            input_ids = torch.cat([input_ids, accepted])
        
        return input_ids
```

### Continuous Batching

```python
class ContinuousBatcher:
    def __init__(self, model, max_batch_size=32):
        self.model = model
        self.max_batch = max_batch_size
        self.requests = []
    
    def add_request(self, request):
        self.requests.append(request)
    
    def step(self):
        # Process all active requests together
        batch = self.prepare_batch(self.requests)
        outputs = self.model(batch)
        
        # Remove completed requests
        self.requests = [r for r in self.requests if not r.completed]
        
        return outputs
```

---

## Optimization Comparison

| Method | Latency Reduction | Memory Savings | Quality Impact |
|--------|-------------------|----------------|----------------|
| INT8 Quantization | 2-3x | 2x | Minimal |
| INT4 Quantization | 3-4x | 4x | Small |
| Pruning (50%) | 1.5-2x | 2x | Moderate |
| Distillation | 2-10x | 2-10x | Small |
| Speculative Decoding | 2-3x | Minimal | None |
| KV Cache | 1.5-2x | Depends | None |

---

## Best Practices

1. **Combine optimizations** - Quantization + KV cache + batching
2. **Profile first** - Identify bottlenecks before optimizing
3. **Benchmark quality** - Ensure acceptable degradation
4. **Consider hardware** - Match optimization to deployment platform
5. **Iterative refinement** - Start with simple, add complexity as needed

---

## Deployment Strategies

| Scenario | Recommended Optimizations |
|----------|---------------------------|
| Edge devices | INT4 quantization + pruning |
| Cloud serving | KV cache + continuous batching |
| Batch processing | Speculative decoding + parallelism |
| Real-time | All levels combined |

---

## Examples

### Example 1: Basic Application

**User:** I need to apply Efficient LLM Inference Survey to my analysis.

**Agent:** I'll help you apply efficient-llm-inference-survey. First, let me understand your specific use case...

**Context:** Apply the methodology

### Example 2: Advanced Scenario

**User:** Complex analysis scenario

**Agent:** Based on the methodology, I'll guide you through the advanced application...

### Example 2: Advanced Application

**User:** What are the key considerations for efficient-llm-inference-survey?

**Agent:** Let me search for the latest research and best practices...

## References

- Paper: https://arxiv.org/abs/2404.14294
- DOI: https://doi.org/10.48550/arXiv.2404.14294

---

**Created:** 2026-03-28
**Source:** arXiv:2404.14294v3 - "A Survey on Efficient Inference for Large Language Models"