Spoken vs AssemblyAI for podcasts — flat-rate retrieval vs per-minute transcription

Q: What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens.

AssemblyAI transcribes audio at $0.17/hr with diarization. Spoken returns published podcasts as Markdown with real speaker names for a flat $0.08–$0.15 per episode.

AssemblyAI is a speech-to-text API. Spoken is a podcast transcript retrieval API. AssemblyAI Universal transcribes audio at roughly $0.17 per hour with diarization included, but outputs anonymous speaker labels (A, B, C). For published podcasts, Spoken returns the existing transcript with real speaker names (Andrew Huberman, Tim Ferriss) for $0.08–$0.15 flat per episode — no audio file, no upload, no post-processing.

Side-by-side comparison

Where AssemblyAI wins

AssemblyAI is a strong choice when you need more than a transcript. The "Speech Understanding" model bundles sentiment analysis, PII detection, summarization, and topic detection in one call — useful for compliance-heavy or analytics-heavy workloads. It also offers excellent diarization quality (a 2.9% speaker count error rate per their published benchmarks).

For raw audio you already own — meetings, calls, recordings — AssemblyAI is one of the best STT APIs available, especially if you'd otherwise be stitching together multiple analysis passes.

Where Spoken wins

For podcasts that have already been published, the full AssemblyAI pipeline is overkill:

Spoken collapses all of that into a single call that returns Markdown with real names already attached.

Pick the right tool

	AssemblyAI Universal	Spoken
Category	Speech-to-text API	Transcript retrieval API
Input	Audio file URL or upload	Search query or episode ID
Speaker labels	"A", "B", "C" (97% diarization accuracy)	Real names: "Andrew Huberman", "Tim Ferriss"
Cost per podcast hour	~$0.17/hr + audio handling + LLM pass for naming	$0.08–$0.15 per episode, names included
Output format	JSON with utterances + word timings	Markdown with speaker bold + timestamps
Bundled features	Sentiment, PII redaction, summarization, topic detection	Just the transcript, optimized for LLM context
Works on your own audio	Yes	No
Works on published podcasts	Yes, after you fetch and host the audio	Yes, in one API call

Pick AssemblyAI for

Your own audio: meetings, calls, recordings, interviews
Bundled analytics (sentiment, PII, summarization) in the same call
Compliance use cases needing PII redaction
Building voice agents or live captioning
Real-time / streaming transcription

Pick Spoken for

Published podcast episodes as Markdown
Podcast summarizers, RAG over podcasts, research agents
Real speaker names without a second model pass
Flat per-episode pricing
You want one call instead of a pipeline

FAQ

Is Spoken cheaper than AssemblyAI?

For published podcasts, yes — typically 3–5x cheaper once you account for audio handling and the LLM pass needed to map AssemblyAI's "Speaker A/B/C" labels to real names. For your own audio, AssemblyAI is the right choice.

Does AssemblyAI give real speaker names?

No. AssemblyAI's diarization returns labels like "A", "B", "C" with high accuracy (about 97%), but doesn't identify who the speakers are. Spoken does this mapping for podcasts as part of the response.

Can I use both AssemblyAI and Spoken?

Yes. Common pattern: Spoken for podcasts in your product, AssemblyAI for user-uploaded audio. The two solve different parts of the problem.

What about AssemblyAI's bundled features like sentiment and summarization?

Those are powerful for raw audio analysis but unnecessary for podcast retrieval. Once you have the Markdown transcript from Spoken, you can run sentiment, summarization, or any other LLM-based analysis using the model of your choice — typically cheaper than the bundled options.

What format does Spoken return?

Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens.

TL;DR: AssemblyAI is excellent for your own audio, especially with bundled analytics. For published podcasts, Spoken is faster, cheaper, and returns real speaker names with no post-processing.

Try Spoken with no signup — use API key pt_demo on any endpoint.

$0.10 per transcript. Credits never expire.