The Whisper + Pyannote diarization pipeline is a popular DIY approach to podcast transcription. Spoken is the managed shortcut — for published podcasts, it replaces every step.
The Whisper + Pyannote pipeline is six steps you have to maintain. Spoken replaces it with one API call for published podcasts. Whisper transcribes audio, Pyannote runs speaker diarization, and you stitch the outputs together with a third pass to map "Speaker 0" and "Speaker 1" to real names. Spoken handles all of that and returns Markdown with real speaker names already attached.
If you've built a podcast transcription pipeline before, this list is familiar:
That's a week of plumbing for the first version, and ongoing maintenance for as long as you run it. Each step has its own failure modes — GPU availability, model versioning, timing misalignment, speaker count guessing.
| Whisper + Pyannote (DIY) | Spoken | |
|---|---|---|
| Setup | Multi-step pipeline, GPU or paid API | One curl command, no setup |
| Cost per 1-hour episode | ~$0.36 Whisper API + diarization + your time | $0.08–$0.15 flat |
| Speaker names | "Speaker 0", "Speaker 1" — you map them yourself | Real names included |
| Timing alignment | Manual — Whisper and Pyannote outputs aren't aligned by default | Already aligned |
| Latency | Audio length + processing minutes | Under 30 seconds |
| Output | Whatever you build | Markdown with bold names + timestamps |
| Maintenance | Ongoing: model updates, infra, failures | None |
| Works on your own audio | Yes | No |
| Works on published podcasts | Yes, after you fetch and host the audio | Yes, in one API call |
The Whisper + Pyannote pipeline isn't going away. There are real reasons to run it:
For published podcasts, the pipeline is reproducing work that's already been done. The transcripts exist; you just need them in a usable shape. Spoken is the wrapper.
# What the entire pipeline collapses into
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{episode_id}
# Returns Markdown with real names. That's the whole job.
Spoken serves transcripts that have already been produced and published by the shows themselves, then adds speaker name detection and clean Markdown formatting on top. No on-demand Whisper transcription happens per request.
Spoken doesn't transcribe audio per request. It serves cached, processed transcripts. The cost is one fetch per episode rather than minutes of GPU or API time, plus the work of running and aligning Pyannote.
Not directly. Whisper transcribes words; speaker identification requires a separate diarization model like Pyannote, and even then the output is anonymous ("Speaker 0", "Speaker 1"). Mapping those to real names from context typically requires another LLM call.
They're solid options if you need full control or are running at very high volume. For published podcasts at moderate volume, the per-episode cost and zero-maintenance trade-off usually favors Spoken.
You can search by pasting a Spotify, YouTube, or other podcast app URL — Spoken matches it back to the podcast episode and returns the transcript.
TL;DR: Whisper + Pyannote is fine if you need the control. For published podcasts, Spoken collapses the pipeline into one call — real speaker names, flat per-episode pricing, no infrastructure to maintain.
Try Spoken with no signup — use API key pt_demo on any endpoint.
$0.10 per transcript. Credits never expire.