Free SKILL.md scraped from GitHub. Clone the repo or copy the file directly into your Claude Code skills directory.
npx versuz@latest install biyearly-mesothelioma790-skills-transcribe-anythinggit clone https://github.com/biyearly-mesothelioma790/skills.gitcp skills/SKILL.MD ~/.claude/skills/biyearly-mesothelioma790-skills-transcribe-anything/SKILL.md---
name: transcribe-anything
description: |
Transcribes audio and video files to text using pluggable ASR backends. Default backend is local whisper CLI (openai-whisper). Supports whisperX (with diarization), insanely-fast-whisper, faster-whisper, whisper.cpp, OpenAI Whisper API, Groq Whisper API, Deepgram, AssemblyAI, Gemini, and Hugging Face models. Handles very long files (1-8+ hours) by preprocessing with ffmpeg: extracts audio from video, converts to optimal ASR format, detects and skips silence, and chunks for API size limits. Supports speaker diarization, word-level timestamps, custom vocabulary, and multiple output formats. Use this skill when someone says "transcribe this", "convert to text", "speech to text", "get the transcript", "transcribe this video/audio/podcast/recording", or provides a media file and wants text output.
license: MIT
compatibility: |
Requires macOS (Apple Silicon recommended) with ffmpeg installed. At minimum one ASR backend must be available — openai-whisper is the default (pip install openai-whisper). For diarization, whisperX is recommended (pip install whisperx). For cloud backends, relevant API keys must be set as environment variables.
metadata:
author: swyxio
version: "1.0"
last-updated: "2026-03-28"
hardware: Apple Silicon (M-series), also works on CUDA GPUs and CPU
primary-tools: ffmpeg, whisper, whisperx, yt-dlp
---
# Transcribe Anything
Transcribes audio and video files to text. Pluggable backends, silence skipping for long files, optional speaker diarization, and multiple output formats.
## Setup
### Required (install these first)
```bash
# ffmpeg — audio extraction, preprocessing, silence detection
brew install ffmpeg
# yt-dlp — downloading video/audio from URLs (optional but recommended)
brew install yt-dlp
# Default ASR backend — OpenAI's whisper CLI
pip3 install --break-system-packages openai-whisper
```
### Recommended Extras
```bash
# curl_cffi — prevents OAuth errors when downloading private videos
pip3 install --break-system-packages curl_cffi
# faster-whisper — 4x faster than whisper, built-in VAD silence skipping, lower memory
# Best local backend for long files (1hr+)
pip3 install --break-system-packages faster-whisper
# whisperX — adds speaker diarization + precise word-level timestamps
# Bundles faster-whisper + pyannote alignment
pip3 install --break-system-packages whisperx
```
**whisperX diarization setup** (one-time):
1. Create a Hugging Face account at https://huggingface.co
2. Accept the terms for these gated models:
- https://huggingface.co/pyannote/speaker-diarization-3.1
- https://huggingface.co/pyannote/segmentation-3.0
3. Create an access token at https://huggingface.co/settings/tokens
4. Set `export HF_TOKEN=hf_...` in your shell profile
Without HF token access, whisperX still works for transcription and word alignment — just no speaker labels.
### Other Local Backends (optional, pick what you need)
```bash
# insanely-fast-whisper — batched GPU inference, 10-20x faster on NVIDIA GPUs
pip3 install --break-system-packages insanely-fast-whisper
# whisper.cpp — C++ native with Metal acceleration on Apple Silicon
# Best option if you want to avoid Python entirely
brew install whisper-cpp
```
### Cloud API Keys (optional)
Set these environment variables if you want to use cloud backends. None are required — local whisper works out of the box.
```bash
# OpenAI — best accuracy with gpt-4o-transcribe ($0.006/min)
export OPENAI_API_KEY=sk-...
# Groq — cheapest and fastest cloud option ($0.00004/min with turbo)
export GROQ_API_KEY=gsk_...
# Deepgram — best cloud diarization ($0.0043/min)
export DEEPGRAM_API_KEY=...
# AssemblyAI — cloud diarization + auto-chapters ($0.0062/min)
export ASSEMBLYAI_API_KEY=...
# Gemini — handles 9.5hr files natively, flexible prompting
export GEMINI_API_KEY=...
```
### Verify Your Setup
Run this to check what's available:
```bash
echo "=== Required ==="
which ffmpeg && echo "ffmpeg: OK" || echo "ffmpeg: MISSING (brew install ffmpeg)"
which whisper && echo "whisper: OK" || echo "whisper: MISSING (pip3 install --break-system-packages openai-whisper)"
echo ""
echo "=== Local Backends ==="
which whisperx && echo "whisperx: OK" || echo "whisperx: not installed"
python3 -c "import faster_whisper" 2>/dev/null && echo "faster-whisper: OK" || echo "faster-whisper: not installed"
which insanely-fast-whisper 2>/dev/null && echo "insanely-fast-whisper: OK" || echo "insanely-fast-whisper: not installed"
which whisper-cpp 2>/dev/null && echo "whisper.cpp: OK" || echo "whisper.cpp: not installed"
echo ""
echo "=== Cloud APIs ==="
[ -n "$OPENAI_API_KEY" ] && echo "OpenAI: configured" || echo "OpenAI: not set"
[ -n "$GROQ_API_KEY" ] && echo "Groq: configured" || echo "Groq: not set"
[ -n "$DEEPGRAM_API_KEY" ] && echo "Deepgram: configured" || echo "Deepgram: not set"
[ -n "$ASSEMBLYAI_API_KEY" ] && echo "AssemblyAI: configured" || echo "AssemblyAI: not set"
[ -n "$GEMINI_API_KEY" ] && echo "Gemini: configured" || echo "Gemini: not set"
echo ""
echo "=== Optional ==="
which yt-dlp && echo "yt-dlp: OK" || echo "yt-dlp: not installed (brew install yt-dlp)"
python3 -c "import curl_cffi" 2>/dev/null && echo "curl_cffi: OK" || echo "curl_cffi: not installed"
[ -n "$HF_TOKEN" ] && echo "HF token: configured (diarization ready)" || echo "HF token: not set (no diarization)"
```
## Backend Selection Guide
Pick the backend based on the user's needs:
| Scenario | Backend | Why |
|----------|---------|-----|
| Default / just works | `whisper` | Already installed, good quality |
| Need speaker labels | `whisperx` | Integrated diarization + word alignment |
| Very long file, local | `faster-whisper` | VAD silence skipping, low memory |
| Maximum speed, local GPU | `insanely-fast-whisper` | Batched inference, 10-20x faster |
| Apple Silicon, no Python | `whisper.cpp` | Metal acceleration, pure C++ |
| Cheapest cloud, fast | `groq` | $0.00004/min with turbo model |
| Best cloud accuracy | `openai` | gpt-4o-transcribe model |
| Cloud with diarization | `deepgram` or `assemblyai` | Native speaker labels |
| Flexible Q&A over audio | `gemini` | Can ask questions, not just transcribe |
If the user doesn't specify, use this priority:
1. `whisper` (already installed locally)
2. `whisperx` (if diarization requested and installed)
3. `openai` API (if OPENAI_API_KEY is set and file is manageable)
## Step-by-Step Workflow
### Step 1: Identify the Input
Accept any of these input types:
- Audio files: mp3, wav, flac, ogg, m4a, opus, wma, aac
- Video files: mp4, mkv, webm, mov, avi, wmv
- URLs: Use the download-video skill first, or yt-dlp directly
If the input is a URL:
```bash
yt-dlp -x --audio-format wav -o "%(title)s.%(ext)s" "{url}"
```
### Step 2: Preprocess Audio with ffmpeg
**Always preprocess.** This step is critical for quality and speed.
```bash
# Extract audio from video (or re-encode audio) to ASR-optimal format
ffmpeg -i "{input}" \
-vn \
-ac 1 \
-ar 16000 \
-acodec pcm_s16le \
-af "highpass=f=80,lowpass=f=8000,loudnorm=I=-16:TP=-1.5:LRA=11" \
"{output_stem}_preprocessed.wav"
```
Flags explained:
- `-vn` — strip video
- `-ac 1` — mono (stereo wastes processing time, no ASR benefit)
- `-ar 16000` — 16kHz (what whisper expects internally)
- `-acodec pcm_s16le` — 16-bit WAV
- `highpass=f=80` — remove rumble below speech range
- `lowpass=f=8000` — remove hiss above speech range
- `loudnorm` — normalize volume (critical for variable-volume recordings)
**Check duration after preprocessing:**
```bash
DURATION=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "{preprocessed_file}" | cut -d. -f1)
echo "Duration: ${DURATION}s ($((DURATION / 3600))h $(((DURATION % 3600) / 60))m)"
```
### Step 3: Handle Long Files (>30 minutes)
For files over 30 minutes, apply silence detection and chunking. This is especially important for 1-8 hour recordings.
#### 3a: Silence Analysis
```bash
# Detect silent regions (informational — see what we're working with)
ffmpeg -i "{preprocessed_file}" \
-af silencedetect=noise=-30dB:d=2.0 \
-f null - 2>&1 | grep -c "silence_end"
# Shows number of silence gaps >= 2 seconds
```
Silence threshold guide:
- `-30dB` — clean recordings (studio, podcast)
- `-35dB` — moderate background noise
- `-40dB` — noisy environments
#### 3b: For Local Backends (whisper, whisperx, faster-whisper)
Local backends handle long files natively — no need to chunk. But use VAD to skip silence:
**With faster-whisper (built-in VAD):**
```python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
"preprocessed.wav",
language="en",
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(
min_silence_duration_ms=1000,
speech_pad_ms=400,
threshold=0.5,
),
condition_on_previous_text=False, # prevents hallucination cascades on long files
)
```
**With whisper CLI (no built-in VAD — preprocess silence out):**
```bash
# Remove silences longer than 2s, keeping 0.3s padding
ffmpeg -i "{preprocessed_file}" \
-af "silenceremove=start_periods=1:start_threshold=-30dB:stop_periods=-1:stop_duration=2.0:stop_threshold=-30dB" \
"{output_stem}_trimmed.wav"
# Then transcribe the trimmed file
whisper "{output_stem}_trimmed.wav" --model turbo --language en \
--condition_on_previous_text False \
--word_timestamps True \
--output_format json \
--output_dir ./
```
**Important for long files:** Always use `--condition_on_previous_text False` with whisper on files over 30 minutes. Without this, a single hallucination can cascade and corrupt hours of transcript (whisper repeats the same phrase endlessly).
#### 3c: For Cloud APIs (25MB file size limit)
Cloud APIs (OpenAI, Groq) have a 25MB limit. Compress first, then chunk if needed.
```bash
# Compress to opus (smallest format for speech) — 1 hour ≈ 14MB
ffmpeg -i "{preprocessed_file}" -ac 1 -ar 16000 -c:a libopus -b:a 32k "{output_stem}.ogg"
# Check file size
SIZE_MB=$(du -m "{output_stem}.ogg" | cut -f1)
echo "File size: ${SIZE_MB}MB"
```
If the compressed file is under 25MB, send it directly. Otherwise, chunk on silence boundaries:
```bash
# Split into ~20-minute chunks on silence boundaries
# (under 25MB each at opus 32kbps)
ffmpeg -i "{output_stem}.ogg" \
-f segment \
-segment_time 1200 \
-c copy \
"{output_stem}_chunk_%03d.ogg"
```
For each chunk, track the start offset for timestamp correction:
```bash
# Get duration of each chunk for timestamp reassembly
for f in {output_stem}_chunk_*.ogg; do
dur=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$f")
echo "$f: ${dur}s"
done
```
### Step 4: Transcribe
#### Backend: whisper (default)
```bash
whisper "{input_file}" \
--model turbo \
--language en \
--output_format json \
--output_dir "{output_dir}" \
--word_timestamps True \
--condition_on_previous_text False \
--fp16 False
```
Model selection for Apple Silicon (CPU — no CUDA):
- `turbo` — best balance of speed and quality (recommended default)
- `large-v3` — highest quality, 2-3x slower than turbo
- `medium.en` — faster, English-only, good for clear speech
- `small.en` — fast, acceptable quality for clean recordings
- `base.en` — fastest, use only for quick previews
Note: `--fp16 False` is required on CPU (Apple Silicon without MLX). Whisper defaults to fp16 which only works on CUDA.
#### Backend: whisperx (with diarization)
```bash
whisperx "{input_file}" \
--model large-v3 \
--language en \
--diarize \
--min_speakers 2 \
--max_speakers 6 \
--hf_token "{HF_TOKEN}" \
--compute_type int8 \
--output_dir "{output_dir}" \
--output_format json
```
If no HF token is available, whisperX still works for transcription and word alignment, just without diarization:
```bash
whisperx "{input_file}" \
--model large-v3 \
--language en \
--compute_type int8 \
--output_dir "{output_dir}" \
--output_format json
```
#### Backend: faster-whisper (Python, best for long files)
```python
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
"{input_file}",
language="en",
beam_size=5,
word_timestamps=True,
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=1000),
condition_on_previous_text=False,
)
for segment in segments:
print(f"[{segment.start:.2f} -> {segment.end:.2f}] {segment.text}")
```
#### Backend: insanely-fast-whisper (GPU batched)
```bash
insanely-fast-whisper \
--file-name "{input_file}" \
--model-name openai/whisper-large-v3-turbo \
--task transcribe \
--language en \
--batch-size 24 \
--timestamp word \
--transcript-path "{output_stem}.json"
```
#### Backend: whisper.cpp (Metal acceleration on Apple Silicon)
```bash
# Download model if needed
whisper-cpp-download-model large-v3
# Transcribe with Metal GPU acceleration
whisper-cpp \
-m ~/.local/share/whisper-cpp/ggml-large-v3.bin \
-f "{preprocessed_wav}" \
-l en \
-t 8 \
--output-json \
--print-progress
```
Note: whisper.cpp requires WAV input (not mp3/ogg). Always preprocess to WAV first.
#### Backend: OpenAI API
```bash
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@{input_file}" \
-F model="gpt-4o-transcribe" \
-F language="en" \
-F response_format="verbose_json" \
-F 'timestamp_granularities[]=word' \
-F 'timestamp_granularities[]=segment' \
> "{output_stem}_openai.json"
```
For multiple chunks, loop and offset timestamps:
```bash
OFFSET=0
for chunk in {output_stem}_chunk_*.ogg; do
curl -s https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file="@$chunk" \
-F model="gpt-4o-transcribe" \
-F language="en" \
-F response_format="verbose_json" \
-F 'timestamp_granularities[]=segment' \
> "${chunk%.ogg}_transcript.json"
DUR=$(ffprobe -v error -show_entries format=duration -of csv=p=0 "$chunk")
OFFSET=$(echo "$OFFSET + $DUR" | bc)
done
```
#### Backend: Groq API (cheapest cloud)
Same OpenAI-compatible format, different base URL:
```bash
curl -s https://api.groq.com/openai/v1/audio/transcriptions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: multipart/form-data" \
-F file="@{input_file}" \
-F model="whisper-large-v3-turbo" \
-F language="en" \
-F response_format="verbose_json" \
-F 'timestamp_granularities[]=word' \
-F 'timestamp_granularities[]=segment' \
> "{output_stem}_groq.json"
```
#### Backend: Deepgram (cloud, native diarization)
```bash
curl -s -X POST "https://api.deepgram.com/v1/listen?model=nova-3&smart_format=true&diarize=true&language=en&utterances=true" \
-H "Authorization: Token $DEEPGRAM_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary "@{input_file}" \
> "{output_stem}_deepgram.json"
```
#### Backend: AssemblyAI (cloud, native diarization + chapters)
```python
import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]
config = aai.TranscriptionConfig(
speaker_labels=True,
language_code="en",
auto_chapters=True,
word_boost=["custom", "vocabulary", "terms"],
)
transcript = aai.Transcriber().transcribe("{input_file}", config=config)
for utterance in transcript.utterances:
print(f"Speaker {utterance.speaker}: {utterance.text}")
```
#### Backend: Gemini (flexible, prompt-based)
```python
import google.generativeai as genai
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-2.5-flash")
audio = genai.upload_file("{input_file}")
response = model.generate_content([
audio,
"""Transcribe this audio verbatim. Format as markdown with:
- Timestamps every ~30 seconds as ### headers (e.g., ### [00:01:30])
- Speaker labels if you can distinguish voices (Speaker 1, Speaker 2, etc.)
- Paragraph breaks at natural topic shifts
Do not summarize or omit anything. Transcribe every word spoken."""
])
print(response.text)
```
Note: Gemini handles up to ~9.5 hours natively (no chunking needed) but timestamps are approximate and output is unstructured text shaped by your prompt.
### Step 5: Format Output
The user's default preference is **markdown-formatted plain text**. Convert the raw backend output to this format.
#### Default: Markdown transcript
```markdown
# Transcript: {filename}
**Date transcribed:** {date}
**Duration:** {duration}
**Backend:** {backend}
**Model:** {model}
---
## [00:00:00]
{text of first segment or group of segments...}
## [00:05:23]
{text continues with periodic timestamp headers...}
---
*Transcribed with {backend} ({model})*
```
**Formatting rules:**
- Insert `## [HH:MM:SS]` timestamp headers every 2-5 minutes (not every segment — that's too noisy)
- Group consecutive segments by the same speaker into paragraphs
- If diarization is available, prefix with bold speaker labels: `**Speaker 1:** text...`
- Use `---` horizontal rules at major topic shifts or long pauses (>10s)
- Keep paragraph breaks at natural sentence boundaries
#### With speaker diarization:
```markdown
# Transcript: {filename}
**Speakers:** 3 detected
**Duration:** 1h 23m
---
## [00:00:00]
**Speaker 1:** Welcome everyone to today's session. We're going to be talking about...
**Speaker 2:** Thanks for having me. I'm excited to share...
## [00:05:12]
**Speaker 1:** Let's dive into the first topic...
```
#### Alternative output formats
If the user requests a different format:
**SRT subtitles:**
```bash
whisper "{input}" --model turbo --output_format srt --output_dir ./
```
**VTT subtitles:**
```bash
whisper "{input}" --model turbo --output_format vtt --output_dir ./
```
**JSON with word timestamps:**
```bash
whisper "{input}" --model turbo --output_format json --word_timestamps True --output_dir ./
```
**Plain text (no timestamps):**
```bash
whisper "{input}" --model turbo --output_format txt --output_dir ./
```
**TSV (tab-separated, for spreadsheets):**
```bash
whisper "{input}" --model turbo --output_format tsv --output_dir ./
```
## Custom Vocabulary / Prompt Hints
Whisper supports an `initial_prompt` that biases the model toward specific terminology:
```bash
whisper "{input}" --model turbo --language en \
--initial_prompt "This conversation discusses Kubernetes, GitLab CI/CD, Terraform, and Infrastructure as Code. Names mentioned: Sid Sijbrandij, David Thompson."
```
For cloud APIs:
```bash
# OpenAI - use the prompt parameter
curl ... -F prompt="Technical terms: LLM, RAG, vector database, embeddings. Names: Sid Sijbrandij."
# AssemblyAI - use word_boost
config = aai.TranscriptionConfig(
word_boost=["Kubernetes", "GitLab", "Sijbrandij", "Terraform"],
boost_param="high",
)
# Deepgram - use keywords
curl ... "https://api.deepgram.com/v1/listen?keywords=Kubernetes:2&keywords=GitLab:2"
```
**When to use custom vocabulary:**
- Proper nouns (people, companies, products)
- Domain-specific jargon
- Acronyms that might be misheard (e.g., "RAG" vs "rag")
- Non-English words in English speech
## Handling Specific Input Types
### Podcast with intro/outro music
Music segments cause hallucinations. Trim them:
```bash
# Skip first 30s (intro music) and last 30s (outro)
TOTAL=$(ffprobe -v error -show_entries format=duration -of csv=p=0 input.mp3 | cut -d. -f1)
END=$((TOTAL - 30))
ffmpeg -i input.mp3 -ss 30 -to $END -ac 1 -ar 16000 -acodec pcm_s16le trimmed.wav
```
### Multi-track recording (separate speaker mics)
If each speaker has their own audio track:
```bash
# Extract each track
ffmpeg -i recording.mkv -map 0:a:0 -ac 1 -ar 16000 speaker1.wav
ffmpeg -i recording.mkv -map 0:a:1 -ac 1 -ar 16000 speaker2.wav
# Transcribe each separately (no diarization needed)
whisper speaker1.wav --model turbo --output_format json
whisper speaker2.wav --model turbo --output_format json
# Then interleave by timestamps in the markdown output
```
### Stereo recording (L/R = different speakers)
```bash
# Split channels
ffmpeg -i stereo.wav -af "pan=mono|c0=FL" -ar 16000 left.wav
ffmpeg -i stereo.wav -af "pan=mono|c0=FR" -ar 16000 right.wav
# Transcribe each channel as a separate speaker
```
### Video from URL (download + transcribe)
```bash
# Download audio only
yt-dlp -x --audio-format wav -o "%(title)s.%(ext)s" "{url}"
# Then run the standard preprocessing + transcription pipeline
```
## Troubleshooting
### Whisper repeats the same phrase endlessly
The hallucination cascade problem. Fix: use `--condition_on_previous_text False`. If already set, the input likely has long silence or music — preprocess with silence removal.
### Very slow on Apple Silicon
Whisper's Python implementation doesn't use Metal/MPS well. Options:
- Use `whisper.cpp` for Metal GPU acceleration
- Use `--model turbo` instead of `large-v3` (2-3x faster, minimal quality loss)
- Use `--model medium.en` for English-only (4-5x faster)
- Use Groq API (transcribes hours in seconds, nearly free)
### Out of memory
- Use `faster-whisper` with `compute_type="int8"` (halves memory)
- Use a smaller model (`medium` or `small`)
- Chunk the file (see Step 3c)
### pip install fails with "externally-managed-environment"
Modern macOS Python (Homebrew) requires:
```bash
pip3 install --break-system-packages {package}
```
### Word timestamps are inaccurate
Whisper's native word timestamps are approximate. For precise word-level timing, use whisperX which adds forced phoneme alignment.
### Foreign language or accent issues
- Omit `--language` to let whisper auto-detect
- Use `large-v3` (best multilingual model)
- Use `--initial_prompt` with example text in the target language
- For code-switching (multiple languages in one recording), Gemini handles this better than Whisper
## Backend Comparison Quick Reference
| Backend | Install | Diarization | VAD | Speed (1hr file, Apple Silicon) | Cost |
|---------|---------|-------------|-----|-------------------------------|------|
| whisper | `pip install openai-whisper` | No | No | ~20-40 min (turbo) | Free |
| whisperx | `pip install whisperx` | Yes (pyannote) | Yes | ~15-30 min | Free |
| faster-whisper | `pip install faster-whisper` | No | Yes (Silero) | ~10-20 min | Free |
| insanely-fast-whisper | `pip install insanely-fast-whisper` | Experimental | No | ~5-10 min (GPU) | Free |
| whisper.cpp | `brew install whisper-cpp` | Basic | No | ~10-15 min (Metal) | Free |
| OpenAI API | API key | No | N/A | ~1-2 min | $0.006/min |
| Groq API | API key | No | N/A | ~seconds | $0.00004/min |
| Deepgram | API key | Yes (native) | N/A | ~1-2 min | $0.0043/min |
| AssemblyAI | API key | Yes (native) | N/A | ~2-5 min | $0.0062/min |
| Gemini | API key | Prompted | N/A | ~1-3 min | Token-based |