Spoken vs AssemblyAI for podcasts

AssemblyAI transcribes audio at $0.17/hr with diarization. Spoken returns published podcasts as Markdown with real speaker names for a flat $0.08–$0.15 per episode.

By Robert Tomko · Last updated May 2026

AssemblyAI is a speech-to-text API. Spoken is a podcast transcript retrieval API. AssemblyAI Universal transcribes audio at roughly $0.17 per hour with diarization included, but outputs anonymous speaker labels (A, B, C). For published podcasts, Spoken returns the existing transcript with real speaker names (Andrew Huberman, Tim Ferriss) for $0.08–$0.15 flat per episode — no audio file, no upload, no post-processing.

Side-by-side comparison

AssemblyAI Universal Spoken
Category Speech-to-text API Transcript retrieval API
Input Audio file URL or upload Search query or episode ID
Speaker labels "A", "B", "C" (97% diarization accuracy) Real names: "Andrew Huberman", "Tim Ferriss"
Cost per podcast hour ~$0.17/hr + audio handling + LLM pass for naming $0.08–$0.15 per episode, names included
Output format JSON with utterances + word timings Markdown with speaker bold + timestamps
Bundled features Sentiment, PII redaction, summarization, topic detection Just the transcript, optimized for LLM context
Works on your own audio Yes No
Works on published podcasts Yes, after you fetch and host the audio Yes, in one API call

Where AssemblyAI wins

AssemblyAI is a strong choice when you need more than a transcript. The "Speech Understanding" model bundles sentiment analysis, PII detection, summarization, and topic detection in one call — useful for compliance-heavy or analytics-heavy workloads. It also offers excellent diarization quality (a 2.9% speaker count error rate per their published benchmarks).

For raw audio you already own — meetings, calls, recordings — AssemblyAI is one of the best STT APIs available, especially if you'd otherwise be stitching together multiple analysis passes.

Where Spoken wins

For podcasts that have already been published, the full AssemblyAI pipeline is overkill:

Spoken collapses all of that into a single call that returns Markdown with real names already attached.

Pick the right tool

Pick AssemblyAI for

Pick Spoken for

FAQ

Is Spoken cheaper than AssemblyAI?

For published podcasts, yes — typically 3–5x cheaper once you account for audio handling and the LLM pass needed to map AssemblyAI's "Speaker A/B/C" labels to real names. For your own audio, AssemblyAI is the right choice.

Does AssemblyAI give real speaker names?

No. AssemblyAI's diarization returns labels like "A", "B", "C" with high accuracy (about 97%), but doesn't identify who the speakers are. Spoken does this mapping for podcasts as part of the response.

Can I use both AssemblyAI and Spoken?

Yes. Common pattern: Spoken for podcasts in your product, AssemblyAI for user-uploaded audio. The two solve different parts of the problem.

What about AssemblyAI's bundled features like sentiment and summarization?

Those are powerful for raw audio analysis but unnecessary for podcast retrieval. Once you have the Markdown transcript from Spoken, you can run sentiment, summarization, or any other LLM-based analysis using the model of your choice — typically cheaper than the bundled options.

What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens.

TL;DR: AssemblyAI is excellent for your own audio, especially with bundled analytics. For published podcasts, Spoken is faster, cheaper, and returns real speaker names with no post-processing.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire.