Spoken vs transcription APIs (Deepgram, AssemblyAI, Whisper)

Transcript retrieval and speech-to-text are two different categories. Use the right one for the job and pay a fraction of the cost.

By Robert Tomko · Last updated May 2026

Spoken is a podcast transcript retrieval API, not a speech-to-text service. Where Deepgram, AssemblyAI, and Whisper transcribe raw audio files you provide, Spoken returns the existing transcript for any published podcast episode as clean Markdown with real speaker names. The two categories solve different problems: transcription APIs work on your own audio; transcript retrieval works on the published podcast catalog.

Two categories, two cost models

Transcription API Spoken
Input An audio file you upload (.mp3, .wav) Episode ID or search query
Output JSON transcript with word timings Markdown with speaker names + timestamps
Speaker detection Diarization labels: "Speaker 1", "Speaker 2" Real names: "Andrew Huberman", "Lex Fridman"
Cost per podcast episode ~$0.40–$1.50/hr + your pipeline overhead $0.08–$0.15 flat, regardless of length
Time to first result Audio length + processing minutes Under 30 seconds
Best for Your own audio: meetings, calls, recordings Existing published podcasts
Examples Deepgram, AssemblyAI, Whisper, Rev.ai, Speechmatics Spoken

When to use Spoken

Pick Spoken when

Use a transcription API when

Why this matters for AI agents

If an agent's job is "fetch the transcript of yesterday's All-In Podcast episode," the most efficient path is one call to Spoken. Routing the same task through a transcription API means locating the audio file, downloading it (often 50–100 MB), uploading it to the speech-to-text service, waiting for processing, post-processing the diarization output to attach real names, and paying 5–10x more — for an episode whose transcript already exists.

That's not a knock on transcription APIs. It's the wrong tool for retrieval. The agent equivalent is calling an OCR service to "read" a webpage instead of fetching the HTML.

What about cost?

A one-hour podcast typically costs the following to transcribe end-to-end:

For published podcasts, that's typically a 5–10x cost reduction once you account for the diarization-and-naming work the others don't do.

Quick demo

# Search and fetch — two calls, no audio file involved
curl -H "x-api-key: pt_demo" https://spoken.md/search?q=huberman+sleep
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{id}

Response is text/markdown with real names in bold and timestamps per turn. No diarization output to parse, no audio to download.

FAQ

Is Spoken a transcription API?

No. Spoken returns the existing transcript for a podcast episode. Transcription APIs like Deepgram, AssemblyAI, and Whisper take an audio file you provide and produce a transcript from scratch.

Can I send my own audio to Spoken?

No. Spoken only works on published podcast episodes. For your own audio — meetings, calls, recordings — use a speech-to-text service like Deepgram or AssemblyAI.

Why is Spoken cheaper per podcast than Deepgram or AssemblyAI?

Spoken doesn't transcribe audio on demand. It serves transcripts that have already been produced and processed, so the cost is one fetch per episode rather than per minute of compute. It also includes real speaker names, which transcription APIs don't.

How does Spoken know the real speaker names?

Spoken analyses the transcript for name mentions in context — host introductions, guest references, and dialogue cues — and labels each speaker turn with the actual person's name. When names cannot be inferred from context, generic labels like "Host" or "Guest" are used.

Can I use Spoken alongside a transcription API?

Yes. Many agents do both: Spoken for published podcasts, Deepgram or AssemblyAI for user-uploaded audio. They solve different parts of the problem.

What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens — sized to fit in most LLM context windows in a single call.

TL;DR: If you're working with podcasts, Spoken is the right primitive — cheaper, faster, with speaker names already attached. If you're working with your own audio, use Deepgram, AssemblyAI, or Whisper. They're complementary tools for different jobs.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire. Errors are never charged.