Spoken vs transcription APIs (Deepgram, AssemblyAI, Whisper)

Q: What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens — sized to fit in most LLM context windows in a single call.

Transcript retrieval and speech-to-text are two different categories. Use the right one for the job and pay a fraction of the cost.

Last updated June 2026

Spoken is a podcast transcript retrieval API, not a speech-to-text service. Where Deepgram, AssemblyAI, and Whisper transcribe raw audio files you provide, Spoken returns the existing transcript for any published podcast episode as clean Markdown with real speaker names. The two categories solve different problems: transcription APIs work on your own audio; transcript retrieval works on the published podcast catalog.

Two categories, two cost models

	Transcription API	Spoken
Input	An audio file you upload (.mp3, .wav)	Episode ID or search query
Output	JSON transcript with word timings	Markdown with speaker names + timestamps
Speaker detection	Diarization labels: "Speaker 1", "Speaker 2"	Real names: "Andrew Huberman", "Lex Fridman"
Cost per podcast episode	~$0.40–$1.50/hr + your pipeline overhead	$0.08–$0.15 flat, regardless of length
Time to first result	Audio length + processing minutes	Under 30 seconds
Best for	Your own audio: meetings, calls, recordings	Existing published podcasts
Examples	Deepgram, AssemblyAI, Whisper, Rev.ai, Speechmatics	Spoken

When to use Spoken

Pick Spoken when

You need any published podcast episode as Markdown for an LLM context window
You're building a podcast summarizer, RAG pipeline, or research agent
You don't want to host audio files or run a transcription pipeline
You need speaker names that are correct out of the box
You want a flat per-episode cost, not per-minute billing

Use a transcription API when

You have your own audio: meetings, interviews, internal recordings
The content isn't a public podcast
You need real-time or streaming transcription
You need word-level timestamps for editing tools
You're transcribing video calls or live events

Why this matters for AI agents

If an agent's job is "fetch the transcript of yesterday's All-In Podcast episode," the most efficient path is one call to Spoken. Routing the same task through a transcription API means locating the audio file, downloading it (often 50–100 MB), uploading it to the speech-to-text service, waiting for processing, post-processing the diarization output to attach real names, and paying 5–10x more — for an episode whose transcript already exists.

That's not a knock on transcription APIs. It's the wrong tool for retrieval. The agent equivalent is calling an OCR service to "read" a webpage instead of fetching the HTML.

What about cost?

A one-hour podcast typically costs the following to transcribe end-to-end:

Deepgram Nova-2 with diarization: ~$0.46/hr for transcription, plus your own audio retrieval and storage. Real speaker names still require an LLM pass on top.
AssemblyAI Universal with diarization: ~$0.17/hr base, plus the same audio handling and post-processing.
OpenAI Whisper API: ~$0.36/hr ($0.006/min), no built-in diarization — you bolt on Pyannote or a second model.
Spoken: $0.08–$0.15 per episode, real speaker names included, no audio pipeline.

For published podcasts, that's typically a 5–10x cost reduction once you account for the diarization-and-naming work the others don't do.

Quick demo

# Search and fetch — two calls, no audio file involved
curl -H "x-api-key: pt_demo" https://spoken.md/search?q=huberman+sleep
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{id}

Response is text/markdown with real names in bold and timestamps per turn. No diarization output to parse, no audio to download.

FAQ

Is Spoken a transcription API?

No. Spoken returns the existing transcript for a podcast episode. Transcription APIs like Deepgram, AssemblyAI, and Whisper take an audio file you provide and produce a transcript from scratch.

Can I send my own audio to Spoken?

No. Spoken only works on published podcast episodes. For your own audio — meetings, calls, recordings — use a speech-to-text service like Deepgram or AssemblyAI.

Why is Spoken cheaper per podcast than Deepgram or AssemblyAI?

Spoken doesn't transcribe audio on demand. It serves transcripts that have already been produced and processed, so the cost is one fetch per episode rather than per minute of compute. It also includes real speaker names, which transcription APIs don't.

How does Spoken know the real speaker names?

Spoken analyses the transcript for name mentions in context — host introductions, guest references, and dialogue cues — and labels each speaker turn with the actual person's name. When names cannot be inferred from context, generic labels like "Host" or "Guest" are used.

Can I use Spoken alongside a transcription API?

Yes. Many agents do both: Spoken for published podcasts, Deepgram or AssemblyAI for user-uploaded audio. They solve different parts of the problem.

What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens — sized to fit in most LLM context windows in a single call.

TL;DR: If you're working with podcasts, Spoken is the right primitive — cheaper, faster, with speaker names already attached. If you're working with your own audio, use Deepgram, AssemblyAI, or Whisper. They're complementary tools for different jobs.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire. Errors are never charged.