25-voice-clone-podcast-global

Show SKILL.md content (~4.3k tokens)
---
name: 25-voice-clone-podcast-global
description: "Audio AI for global personal brand — voice clone (ElevenLabs, Murf, PlayHT), podcast, audiobook, voiceover. 3 use cases: short voiceover (TikTok/Reels), podcast 30-60min, audiobook. 1:10 repurpose (1 podcast → 10 short clips). English (US/UK/AU/SG accents available). Trigger: 'voice clone global', 'ElevenLabs', 'podcast AI', 'audiobook AI', 'voiceover AI'."
metadata:
  version: 1.0.0
  category: content
license: MIT
triggers:
  - "voice clone global"
  - "ElevenLabs"
  - "podcast AI"
  - "audiobook AI"
  - "voiceover AI"
related:
  - 24-ai-avatar-production-global
  - 26-thought-leadership-content-global
  - references/voice-clone-prompts-global
  - references/ai-video-disclosure-global
---

# Voice Clone & Podcast — Audio AI for Personal Brand (Global)

> **This skill focuses on audio AI** — voice clone, podcast, audiobook, voiceover.
> Pairs with `24-ai-avatar-production-global` (video) — combine both for full content stack coverage.

---

## 1. Newbie Guide

### What is audio AI and how is it different from video AI?

Audio AI is the tech behind synthetic voices that sound nearly human — from a sample of your voice, AI learns and produces a synthetic clone (voice clone). You write text -> AI reads it back (Text-to-Speech).

**Differences vs video AI:**
- Video AI (skill 24): produces video with face + voice -> talking head, social video
- Audio AI (this skill): produces voice only -> podcast, audiobook, voiceover, narration

### When to use audio AI instead of video?

| Situation | Pick audio AI | Pick video AI |
|-----------|---------------|---------------|
| Long-form content (>10 min) | YES — podcast format | NO — too long for video |
| Don't want to be on camera | YES | NO |
| Need volume content fast | YES — 1 podcast = 10 shorts | YES but more expensive |
| Audience listens while driving / at gym | YES | NO |
| Need visuals to demo | NO | YES |
| Personal brand thought leader | YES — podcast = authority | YES — if face brand exists |

### Main tools (international)

- **ElevenLabs:** Best in class for voice clone — top-tier English voices (US/UK/AU/IN), 30+ languages
- **Murf:** 120+ voice library, strong for corporate voiceover, multilingual
- **PlayHT:** API-friendly, instant clone, 800+ voices
- **HeyGen Voice:** Bundles with HeyGen avatars — seamless voice + video pipeline
- **Descript:** AI editing — cut audio by editing text, voice clone (Overdub)
- **Resemble.ai:** Custom emotion control, brand-grade APIs
- **Riverside:** Studio-quality podcast recording with AI Magic Clips repurpose

### Time and cost

| Task | Time | Cost (USD/mo) |
|------|------|---------------|
| Voice clone setup | 30-60 min | $5-22 (ElevenLabs Starter/Pro) |
| 60s voiceover (TikTok) | 5-10 min | $5-22 |
| 30 min solo podcast | 1-2 hrs | $22-99 (ElevenLabs + Riverside) |
| Audiobook chapter (15 min) | 30-45 min | $22-99 |
| 1 podcast -> 10 clips | 1-2 hrs | $0-30 (Descript/Opus Clip) |

### 5 common mistakes

1. **AI voice sounds robotic:** sample too short or monotonic. Fix: re-record 3-5 minutes with varied emotions (happy, serious, sad).
2. **Mispronounced names/jargon:** TTS engines mishandle proper nouns. Fix: use phonetic spelling (e.g., "Anthropic" -> "an-THROW-pic") in the script.
3. **Audio clipping:** levels too hot. Fix: target -3dB peak, -16 LUFS loudness.
4. **Background noise/echo:** untreated room. Fix: small room with curtains and rugs, or apply NVIDIA Broadcast / Krisp / Adobe Enhance Speech.
5. **Boring podcast:** no editing, too many "ums". Fix: Descript auto-removes filler words, add light background music (-25dB).

---

## 2. Information collection

Ask up to 4 questions before starting:

1. **Main use case?** Short voiceover (TikTok/Reels) / Podcast 30-60 min / Audiobook?
2. **Language(s)?** English (US/UK/AU/IN) / multilingual / single non-English?
3. **Total length?** <60s / 5-30 min / 30-60 min / >60 min (audiobook)?
4. **Budget tier?** Free ($0) / Starter ($5-22) / Pro ($22-99) / Business ($99+)?

> Based on the answers, pick the appropriate use case + tool stack.

---

## 3. Voice clone setup

### Sample requirements

| Criterion | Minimum | Optimal |
|-----------|---------|---------|
| Length | 1 min (Free tier) | 3-5 min (Pro tier) |
| Room | Quiet, no echo | Acoustic treatment, rugs, curtains |
| Mic | Phone + headset mic | Condenser mic (AT2020, $80-100) |
| Distance | 20-30 cm | 15-20 cm with pop filter |
| Format | MP3 128 kbps | WAV 44.1 kHz |
| Content | One pre-written passage | Three passages: business / casual / emotional |

> **Full reference:** `references/voice-clone-prompts-global.md` — sample scripts across English variants (US/UK/AU/SG/IN) and 3 topics (business / lifestyle / educational).

### Tool comparison (global)

| Tool | English clone quality | Price/mo | Setup time | Best for |
|------|----------------------|----------|------------|----------|
| **ElevenLabs Pro** | Excellent (10/10) | $22 | 30 min | Multilingual, content creator |
| **HeyGen Voice** | Good (8/10) | Bundled with avatar | 15 min | Combo with video AI |
| **Murf** | Excellent (9/10) | $29-79 | 30 min | Corporate voiceover, e-learning |
| **PlayHT** | Excellent (9.5/10) | $39-99 | 30 min | API-driven, instant clone |
| **Descript Overdub** | Good (8/10) | $24 (Hobbyist) | 30 min | Podcast editing |
| **Resemble.ai** | Excellent (9/10) | $30-99 | 1 hr | Brand custom voice, emotion control |

**Recommendations:**
- **English-only creator:** ElevenLabs Pro ($22) — best balance of quality and price
- **Multilingual creator:** ElevenLabs Pro (30+ languages built in)
- **Combo with video:** HeyGen (single platform — voice + avatar)
- **Brand/agency at scale:** Resemble.ai or PlayHT (API + custom emotion)

### Consent form template

```
VOICE CLONE LICENSE AGREEMENT

I, [Full name], ID/passport: [number], grant [Brand/Company]:
1. Permission to use samples of my voice to create an AI voice clone.
2. Use of the voice clone in [scope: internal / advertising / podcast / etc.].
3. Term: from [DD/MM/YYYY] to [DD/MM/YYYY].
4. Right of withdrawal: I may request deletion of the voice clone at any time
   in writing; the brand has 7 days to fully remove it.
5. Disclosure: the brand commits to disclose "AI-generated voice" wherever
   required by applicable law (FTC, EU AI Act, etc.).

Signed: ____________   Date: ____________
```

---

## 4. Three use cases

### Use case A: Short voiceover for TikTok/Reels (Energetic)

**Spec:**
- Length: 15-60s
- Pace: fast (180-220 wpm) — younger English-speaking audience
- Tone: energetic, slightly higher pitch, exciting
- Audio levels: -14 LUFS (TikTok), peak -1 dB
- CTA: clear in the last 5 seconds

**Script template (30s):**
```
[HOOK 0-3s] "Did you know [shocking stat]?"
[PROBLEM 3-10s] "Most people are still stuck in [wrong loop]"
[SOLUTION 10-22s] "I tried [method], and here are 3 things..."
[PAYOFF 22-27s] "Result: [specific number]"
[CTA 27-30s] "Comment 'YES' to get the full breakdown"
```

**Voice settings (ElevenLabs):**
- Stability: 35-45 (low — allows variation)
- Similarity: 75-85
- Style: 50-65 (boost expressiveness)
- Speaker Boost: ON

### Use case B: Podcast 30-60 min (Conversational)

**Structure:**
- **Intro (1-2 min):** hook + introduce topic + welcome listeners
- **Body (25-50 min):** 3-5 main segments, each 5-10 min
- **Ad slot (optional):** 3-5 min after intro, or mid-body
- **Outro (1-2 min):** recap + CTA + thanks

**Pacing:**
- Conversational pace: 140-160 wpm
- 1-2s pause after important sentences
- Segment transitions: 2-3s pause + audio sting

**Sound design:**
- Background music: -25 to -30 dB (very subtle)
- Stings/transitions: -15 dB, 1-2s
- Voice levels: -16 LUFS (podcast standard), peak -1 dB

**Voice settings (ElevenLabs):**
- Stability: 60-75 (high — consistent across 30+ minutes)
- Similarity: 85-95
- Style: 30-40 (natural, not over-expressive)
- Speaker Boost: ON

### Use case C: Audiobook (Mid-tempo)

**Structure:**
- **Chapter intro:** "Chapter [X]: [Title]" — 2s pause
- **Chapter body:** 10-20 min/chapter, 1s pause between paragraphs
- **Chapter end:** 3s pause before next chapter

**Pacing:**
- Mid-tempo: 150-170 wpm
- Natural breath every 2-3 sentences
- Dialogue: subtle voice shifts per character (fiction)

**Consistency check (most important):**
- Render Chapter 1 and Chapter 5 -> compare voice -> must match 95%+
- If voices drift: re-clone with a longer sample (5+ min)
- Pronunciation guide: build a database of proper nouns + custom phonetics

**Voice settings (ElevenLabs):**
- Stability: 70-85 (very high — consistent for hours)
- Similarity: 90-95
- Style: 20-30 (calm, even)
- Speaker Boost: ON

---

## 5. Tool comparison (global)

| Tool | Price/mo | English quality | Multilingual | Setup | Pros | Cons | Best for |
|------|----------|-----------------|-------------|-------|------|------|----------|
| **ElevenLabs** | $5-99 | 10/10 | 30+ langs | 30 min | Best clone, multilingual | Pricier high tiers | Multilingual creator |
| **HeyGen Voice** | Bundle w/ avatar | 8/10 | 40+ langs | 15 min | Combo with avatar | Voice clone less expressive | Combo with video |
| **Descript** | $24-30 | 9/10 | EN focus | 30 min | Audio editing first | Multilingual weaker | Podcast editing |
| **Riverside** | $19-29 | n/a (recording) | n/a | 5 min | Studio recording | Not TTS | Live podcast |
| **Murf** | $29-79 | 9/10 | 20+ langs | 30 min | 120+ voice library | Voice clone limited tier | Corporate voiceover |
| **PlayHT** | $39-99 | 9.5/10 | 100+ langs | 30 min | Strong API, instant clone | UI dense | Developer/API |
| **Resemble.ai** | $30-99 | 9/10 | 60+ langs | 1 hr | Custom emotion control | Steep learning curve | Brand custom voice |

**Recommended combos 2025-2026:**
- **English solo creator:** ElevenLabs Pro ($22) + Riverside Free + Descript Hobbyist ($24)
- **Multilingual creator:** ElevenLabs Pro ($22) + Riverside Standard ($19) + Descript Pro ($30)
- **Brand/agency:** ElevenLabs Creator ($99) + Resemble.ai + Riverside Pro ($29)

---

## 6. 1-on-1 podcast with an AI co-host

> **Use case:** solo podcaster who wants conversational format but can't find a co-host. AI co-host = a second AI voice that asks questions while you answer.

### Setup — prompt-engineering the AI personality

**Step 1: Define the AI co-host's personality**
```
Name: [AI co-host name]
Personality: curious, asks deep follow-ups, occasionally light humor
Role: asks the host questions, doesn't talk too much
Speaking style: casual, natural, addresses the host by first name
Knowledge level: average — asks questions like a listener would
Catchphrases: "Wow, that's wild." / "What does that mean exactly?" / "Can you go deeper?"
```

**Step 2: Create a separate voice clone for the AI co-host**
- Use a different voice than the host (e.g., woman vs man, or different accent)
- Clone from a consenting friend, or use one of ElevenLabs' built-in voices

**Step 3: Tool stack**
- **ElevenLabs:** generate the AI co-host voice
- **Riverside:** record the host live
- **Descript:** edit + splice in the AI co-host (text-to-audio)

### Q&A script template

```
[INTRO]
Host: Hey everyone, today [AI co-host] and I are diving into...
AI co-host: Hi all, I'm [name]. Today I want to dig into [topic] from [host]'s
            point of view. Let's go!

[BODY — 5-7 Q&A pairs]
AI co-host: [Broad opening question]
Host: [Answers 2-3 minutes]
AI co-host: [Deeper follow-up]
Host: [Answers with a concrete example]
... repeat 5-7 times ...

[OUTRO]
AI co-host: Thanks [host] for sharing. The biggest thing I learned was...
Host: Thanks [AI co-host]. If you have questions, drop them in the comments...
```

**Tip:** pre-write 7-10 AI co-host questions in a doc, record host responses in one go. Then generate AI co-host audio in ElevenLabs and splice in via Descript.

---

## 7. Repurpose pipeline 1:10 (1 podcast -> 10 short clips)

### Workflow overview

```
[1] Record 60-min podcast (Riverside)
        v
[2] Auto-transcript (Descript / Riverside)
        v
[3] Identify hooks (10-15 quotable lines)
        v
[4] Cut 30-60s clips per quote (Opus Clip / Descript)
        v
[5] Add captions (auto-caption)
        v
[6] Distribute across 4 platforms
```

### How to identify hooks

Find moments in the transcript with these traits:
- **Bold statement:** "I think 90% of founders are doing this wrong"
- **Counter-intuitive:** "Raising prices actually grew revenue"
- **Specific number:** "Went from $0 to $1M in 6 months"
- **Personal story:** "On my first startup I lost $200K"
- **Actionable tip:** "3 specific steps you can take today"

> **Target:** 10-15 hooks per 60-min podcast. Pick the 10 best.

### Tool stack

- **Descript:** auto-clip — select sentences, export short clips (free tier 1 hr/month)
- **Opus Clip:** AI auto-finds viral moments + auto-format vertical/horizontal ($19-99)
- **Riverside Magic Clips:** built-in to Riverside Pro ($29)
- **CapCut + ChatGPT:** manual but free — paste transcript, ChatGPT extracts hook candidates

### Distribution across 4 platforms

| Platform | Format | Length | Caption | Bonus |
|----------|--------|--------|---------|-------|
| **TikTok** | 9:16 (1080×1920) | 30-60s | Bold caption on top | Trend audio overlay (low volume) |
| **Instagram Reels** | 9:16 | 15-90s | Clean subtitle, sans-serif font | Strong cover image |
| **YouTube Shorts** | 9:16 | <60s | Auto-caption | Title with target keyword |
| **LinkedIn audio** | 1:1 (square video w/ audio) | 60-120s | Subtitle below | Long-form thread (carousel) |

**Pro tip:** each clip should target one platform with platform-specific captions and cover image. Maximizes reach.

---

## 8. Audio QA + Disclosure

### 5 QA criteria

1. **Clarity (10 pts):** voice clear, no rasp, no stuttering. Test: play on phone speaker, still intelligible.
2. **No clipping (10 pts):** peak below -1 dB. Tools: Audacity, Adobe Audition, Reaper.
3. **No background noise (10 pts):** no fans, traffic, neighbors. Tools: Krisp, NVIDIA Broadcast, Adobe Enhance Speech.
4. **Consistent loudness (10 pts):** stable -16 LUFS (podcast) or -14 LUFS (TikTok). Tool: loudness meter in DAW.
5. **Natural pauses (10 pts):** human-like pacing. Manual review: listen 3 times.

> **Pass:** 40+/50. Below 40 = re-render or re-record.

### Global disclosure — when required

| Situation | Disclosure | Placement |
|-----------|-----------|-----------|
| Commercial advertising | REQUIRED | Caption + end of audio ("This audio uses an AI voice clone") |
| Personal brand podcast | RECOMMENDED — transparency | Episode description |
| Fiction audiobook | OPTIONAL | Optional — credits at end |
| News/educational | REQUIRED | Beginning of audio + caption |
| Internal corporate content | NOT REQUIRED | n/a |

**Disclosure caption template:**
```
This audio uses AI voice cloning technology
(ElevenLabs / Murf / [tool name]). Content was written and reviewed by [Name].
```

> **Full reference:** `references/ai-video-disclosure-global.md` — FTC, EU AI Act, FCC, and OFCOM requirements; 3-tier disclosure framework, situational templates (also applies to audio).

---

## 9. Quality checklist

Before publishing audio:

- [ ] Voice-clone sample 3-5 min, quiet room
- [ ] Signed consent form (if cloning someone else's voice)
- [ ] Use case matches: voiceover (energetic) / podcast (conversational) / audiobook (mid-tempo)
- [ ] Voice settings appropriate for use case (Stability/Similarity/Style)
- [ ] Loudness correct: -14 LUFS (TikTok) / -16 LUFS (podcast/audiobook)
- [ ] Peak below -1 dB (no clipping)
- [ ] No background noise (Krisp/NVIDIA Broadcast pass)
- [ ] Pacing correct: 180-220 wpm (TikTok) / 140-160 wpm (podcast) / 150-170 wpm (audiobook)
- [ ] QA Score 40+/50
- [ ] Disclosure caption (if commercial use)
- [ ] Repurpose plan: 1 podcast -> 10 clips across 4 platforms

---

*Skill 25 (Global) | v1.0.0*
Get 25-voice-clone-podcast-global.

vz-bench-debug

vz-scrape-runner

Think you can beat it?