Spoken vs Whisper + Pyannote: skip the pipeline

The Whisper + Pyannote diarization pipeline is a popular DIY approach to podcast transcription. Spoken is the managed shortcut — for published podcasts, it replaces every step.

By Robert Tomko · Last updated May 2026

The Whisper + Pyannote pipeline is six steps you have to maintain. Spoken replaces it with one API call for published podcasts. Whisper transcribes audio, Pyannote runs speaker diarization, and you stitch the outputs together with a third pass to map "Speaker 0" and "Speaker 1" to real names. Spoken handles all of that and returns Markdown with real speaker names already attached.

The DIY pipeline

If you've built a podcast transcription pipeline before, this list is familiar:

  1. Locate the audio file — find the .mp3 URL for the episode
  2. Download and convert — pull 50–100 MB, normalize sample rate
  3. Run Whisper — local model or paid API ($0.006/min)
  4. Run Pyannote diarization — separately, to get speaker turns
  5. Align outputs — match Whisper segments to Pyannote speaker labels (timing is rarely perfectly aligned)
  6. Identify speakers — second LLM pass to map "Speaker 0" / "Speaker 1" to real names from context
  7. Format and store — convert to Markdown, manage storage and cache invalidation

That's a week of plumbing for the first version, and ongoing maintenance for as long as you run it. Each step has its own failure modes — GPU availability, model versioning, timing misalignment, speaker count guessing.

Side-by-side comparison

Whisper + Pyannote (DIY) Spoken
Setup Multi-step pipeline, GPU or paid API One curl command, no setup
Cost per 1-hour episode ~$0.36 Whisper API + diarization + your time $0.08–$0.15 flat
Speaker names "Speaker 0", "Speaker 1" — you map them yourself Real names included
Timing alignment Manual — Whisper and Pyannote outputs aren't aligned by default Already aligned
Latency Audio length + processing minutes Under 30 seconds
Output Whatever you build Markdown with bold names + timestamps
Maintenance Ongoing: model updates, infra, failures None
Works on your own audio Yes No
Works on published podcasts Yes, after you fetch and host the audio Yes, in one API call

When DIY is still the right call

The Whisper + Pyannote pipeline isn't going away. There are real reasons to run it:

When to skip it

For published podcasts, the pipeline is reproducing work that's already been done. The transcripts exist; you just need them in a usable shape. Spoken is the wrapper.

# What the entire pipeline collapses into
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{episode_id}
# Returns Markdown with real names. That's the whole job.

FAQ

Is Spoken using Whisper under the hood?

Spoken serves transcripts that have already been produced and published by the shows themselves, then adds speaker name detection and clean Markdown formatting on top. No on-demand Whisper transcription happens per request.

Why is Spoken cheaper than running Whisper myself?

Spoken doesn't transcribe audio per request. It serves cached, processed transcripts. The cost is one fetch per episode rather than minutes of GPU or API time, plus the work of running and aligning Pyannote.

Can Whisper give me real speaker names?

Not directly. Whisper transcribes words; speaker identification requires a separate diarization model like Pyannote, and even then the output is anonymous ("Speaker 0", "Speaker 1"). Mapping those to real names from context typically requires another LLM call.

What about open-source pipelines like WhisperX or pyannote-whisper?

They're solid options if you need full control or are running at very high volume. For published podcasts at moderate volume, the per-episode cost and zero-maintenance trade-off usually favors Spoken.

Does Spoken work on YouTube podcast videos?

You can search by pasting a Spotify, YouTube, or other podcast app URL — Spoken matches it back to the podcast episode and returns the transcript.

TL;DR: Whisper + Pyannote is fine if you need the control. For published podcasts, Spoken collapses the pipeline into one call — real speaker names, flat per-episode pricing, no infrastructure to maintain.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire.