AssemblyAI transcribes audio at $0.17/hr with diarization. Spoken returns published podcasts as Markdown with real speaker names for a flat $0.08–$0.15 per episode.
AssemblyAI is a speech-to-text API. Spoken is a podcast transcript retrieval API. AssemblyAI Universal transcribes audio at roughly $0.17 per hour with diarization included, but outputs anonymous speaker labels (A, B, C). For published podcasts, Spoken returns the existing transcript with real speaker names (Andrew Huberman, Tim Ferriss) for $0.08–$0.15 flat per episode — no audio file, no upload, no post-processing.
| AssemblyAI Universal | Spoken | |
|---|---|---|
| Category | Speech-to-text API | Transcript retrieval API |
| Input | Audio file URL or upload | Search query or episode ID |
| Speaker labels | "A", "B", "C" (97% diarization accuracy) | Real names: "Andrew Huberman", "Tim Ferriss" |
| Cost per podcast hour | ~$0.17/hr + audio handling + LLM pass for naming | $0.08–$0.15 per episode, names included |
| Output format | JSON with utterances + word timings | Markdown with speaker bold + timestamps |
| Bundled features | Sentiment, PII redaction, summarization, topic detection | Just the transcript, optimized for LLM context |
| Works on your own audio | Yes | No |
| Works on published podcasts | Yes, after you fetch and host the audio | Yes, in one API call |
AssemblyAI is a strong choice when you need more than a transcript. The "Speech Understanding" model bundles sentiment analysis, PII detection, summarization, and topic detection in one call — useful for compliance-heavy or analytics-heavy workloads. It also offers excellent diarization quality (a 2.9% speaker count error rate per their published benchmarks).
For raw audio you already own — meetings, calls, recordings — AssemblyAI is one of the best STT APIs available, especially if you'd otherwise be stitching together multiple analysis passes.
For podcasts that have already been published, the full AssemblyAI pipeline is overkill:
Spoken collapses all of that into a single call that returns Markdown with real names already attached.
For published podcasts, yes — typically 3–5x cheaper once you account for audio handling and the LLM pass needed to map AssemblyAI's "Speaker A/B/C" labels to real names. For your own audio, AssemblyAI is the right choice.
No. AssemblyAI's diarization returns labels like "A", "B", "C" with high accuracy (about 97%), but doesn't identify who the speakers are. Spoken does this mapping for podcasts as part of the response.
Yes. Common pattern: Spoken for podcasts in your product, AssemblyAI for user-uploaded audio. The two solve different parts of the problem.
Those are powerful for raw audio analysis but unnecessary for podcast retrieval. Once you have the Markdown transcript from Spoken, you can run sentiment, summarization, or any other LLM-based analysis using the model of your choice — typically cheaper than the bundled options.
Clean Markdown, Content-Type: text/markdown; charset=utf-8, with speaker names in bold and timestamps per turn. A typical one-hour episode produces 8,000–15,000 tokens.
TL;DR: AssemblyAI is excellent for your own audio, especially with bundled analytics. For published podcasts, Spoken is faster, cheaper, and returns real speaker names with no post-processing.
Try Spoken with no signup — use API key pt_demo on any endpoint.
$0.10 per transcript. Credits never expire.