Spoken vs Whisper + Pyannote: skip the pipeline

The Whisper + Pyannote diarization pipeline is a popular DIY approach to podcast transcription. Spoken is the managed shortcut — for published podcasts, it replaces every step.

Last updated June 2026

The Whisper + Pyannote pipeline is six steps you have to maintain. Spoken replaces it with one API call for published podcasts. Whisper transcribes audio, Pyannote runs speaker diarization, and you stitch the outputs together with a third pass to map "Speaker 0" and "Speaker 1" to real names. Spoken handles all of that and returns Markdown with real speaker names already attached.

The DIY pipeline

If you've built a podcast transcription pipeline before, this list is familiar:

Locate the audio file — find the .mp3 URL for the episode
Download and convert — pull 50–100 MB, normalize sample rate
Run Whisper — local model or paid API ($0.006/min)
Run Pyannote diarization — separately, to get speaker turns
Align outputs — match Whisper segments to Pyannote speaker labels (timing is rarely perfectly aligned)
Identify speakers — second LLM pass to map "Speaker 0" / "Speaker 1" to real names from context
Format and store — convert to Markdown, manage storage and cache invalidation

That's a week of plumbing for the first version, and ongoing maintenance for as long as you run it. Each step has its own failure modes — GPU availability, model versioning, timing misalignment, speaker count guessing.

Side-by-side comparison

	Whisper + Pyannote (DIY)	Spoken
Setup	Multi-step pipeline, GPU or paid API	One curl command, no setup
Cost per 1-hour episode	~$0.36 Whisper API + diarization + your time	$0.08–$0.15 flat
Speaker names	"Speaker 0", "Speaker 1" — you map them yourself	Real names included
Timing alignment	Manual — Whisper and Pyannote outputs aren't aligned by default	Already aligned
Latency	Audio length + processing minutes	Under 30 seconds
Output	Whatever you build	Markdown with bold names + timestamps
Maintenance	Ongoing: model updates, infra, failures	None
Works on your own audio	Yes	No
Works on published podcasts	Yes, after you fetch and host the audio	Yes, in one API call

When DIY is still the right call

The Whisper + Pyannote pipeline isn't going away. There are real reasons to run it:

You're transcribing your own audio (meetings, recordings, internal content), not published podcasts
You need full control over the model and parameters (custom vocabulary, fine-tuned models, custom languages)
You're processing audio at a scale where running models locally is cheaper than per-call pricing
You have data residency or compliance requirements that prevent calling external APIs

When to skip it

For published podcasts, the pipeline is reproducing work that's already been done. The transcripts exist; you just need them in a usable shape. Spoken is the wrapper.

# What the entire pipeline collapses into
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{episode_id}
# Returns Markdown with real names. That's the whole job.

FAQ

Is Spoken using Whisper under the hood?

Spoken serves transcripts that have already been produced and published by the shows themselves, then adds speaker name detection and clean Markdown formatting on top. No on-demand Whisper transcription happens per request.

Why is Spoken cheaper than running Whisper myself?

Spoken doesn't transcribe audio per request. It serves cached, processed transcripts. The cost is one fetch per episode rather than minutes of GPU or API time, plus the work of running and aligning Pyannote.

Can Whisper give me real speaker names?

Not directly. Whisper transcribes words; speaker identification requires a separate diarization model like Pyannote, and even then the output is anonymous ("Speaker 0", "Speaker 1"). Mapping those to real names from context typically requires another LLM call.

What about open-source pipelines like WhisperX or pyannote-whisper?

They're solid options if you need full control or are running at very high volume. For published podcasts at moderate volume, the per-episode cost and zero-maintenance trade-off usually favors Spoken.

Does Spoken work on YouTube podcast videos?

You can search by pasting a Spotify, YouTube, or other podcast app URL — Spoken matches it back to the podcast episode and returns the transcript.

TL;DR: Whisper + Pyannote is fine if you need the control. For published podcasts, Spoken collapses the pipeline into one call — real speaker names, flat per-episode pricing, no infrastructure to maintain.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire.