Transcript retrieval and speech-to-text are two different categories. Use the right one for the job and pay a fraction of the cost.
Spoken is a podcast transcript retrieval API, not a speech-to-text service. Where Deepgram, AssemblyAI, and Whisper transcribe raw audio files you provide, Spoken returns the existing transcript for any published podcast episode as clean Markdown with real speaker names. The two categories solve different problems: transcription APIs work on your own audio; transcript retrieval works on the published podcast catalog.
| Transcription API | Spoken | |
|---|---|---|
| Input | An audio file you upload (.mp3, .wav) | Episode ID or search query |
| Output | JSON transcript with word timings | Markdown with speaker names + timestamps |
| Speaker detection | Diarization labels: "Speaker 1", "Speaker 2" | Real names: "Andrew Huberman", "Lex Fridman" |
| Cost per podcast episode | ~$0.40–$1.50/hr + your pipeline overhead | $0.08–$0.15 flat, regardless of length |
| Time to first result | Audio length + processing minutes | Under 30 seconds |
| Best for | Your own audio: meetings, calls, recordings | Existing published podcasts |
| Examples | Deepgram, AssemblyAI, Whisper, Rev.ai, Speechmatics | Spoken |
If an agent's job is "fetch the transcript of yesterday's All-In Podcast episode," the most efficient path is one call to Spoken. Routing the same task through a transcription API means locating the audio file, downloading it (often 50–100 MB), uploading it to the speech-to-text service, waiting for processing, post-processing the diarization output to attach real names, and paying 5–10x more — for an episode whose transcript already exists.
That's not a knock on transcription APIs. It's the wrong tool for retrieval. The agent equivalent is calling an OCR service to "read" a webpage instead of fetching the HTML.
A one-hour podcast typically costs the following to transcribe end-to-end:
For published podcasts, that's typically a 5–10x cost reduction once you account for the diarization-and-naming work the others don't do.
# Search and fetch — two calls, no audio file involved
curl -H "x-api-key: pt_demo" https://spoken.md/search?q=huberman+sleep
curl -H "x-api-key: pt_demo" https://spoken.md/transcripts/{id}
Response is text/markdown with real names in bold and timestamps per turn. No diarization output to parse, no audio to download.
No. Spoken returns the existing transcript for a podcast episode. Transcription APIs like Deepgram, AssemblyAI, and Whisper take an audio file you provide and produce a transcript from scratch.
No. Spoken only works on published podcast episodes. For your own audio — meetings, calls, recordings — use a speech-to-text service like Deepgram or AssemblyAI.
Spoken doesn't transcribe audio on demand. It serves transcripts that have already been produced and processed, so the cost is one fetch per episode rather than per minute of compute. It also includes real speaker names, which transcription APIs don't.
Spoken analyses the transcript for name mentions in context — host introductions, guest references, and dialogue cues — and labels each speaker turn with the actual person's name. When names cannot be inferred from context, generic labels like "Host" or "Guest" are used.
Yes. Many agents do both: Spoken for published podcasts, Deepgram or AssemblyAI for user-uploaded audio. They solve different parts of the problem.
Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens — sized to fit in most LLM context windows in a single call.
TL;DR: If you're working with podcasts, Spoken is the right primitive — cheaper, faster, with speaker names already attached. If you're working with your own audio, use Deepgram, AssemblyAI, or Whisper. They're complementary tools for different jobs.
Try Spoken with no signup — use API key pt_demo on any endpoint.
$0.10 per transcript. Credits never expire. Errors are never charged.