Building a retrieval-augmented generation pipeline that reasons over podcast content. How to skip the transcription step and start with Markdown that's already chunk-ready.
RAG over podcasts usually starts with an audio transcription step — Whisper, AssemblyAI, or similar — followed by diarization, speaker naming, and chunking. Spoken collapses the first three steps into one API call. You get clean Markdown with real speaker names, timestamps, and natural paragraph breaks. Drop it into your vector store with a standard text splitter and you're done.
Most podcast RAG cookbooks (Haystack, LangChain, LlamaIndex) follow the same shape:
Steps 1–4 are infrastructure work that has nothing to do with retrieval quality. They're table stakes you have to ship before you can start tuning the part that actually matters.
# Step 1–4 collapse into one call
import requests
resp = requests.get(
f"https://spoken.md/transcripts/{episode_id}",
headers={"x-api-key": "pt_demo"},
)
markdown = resp.text # Real speaker names, timestamps, paragraph breaks
# Step 5: chunk
from langchain_text_splitters import MarkdownTextSplitter
chunks = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=150).split_text(markdown)
# Steps 6–7: embed, store, retrieve as normal
Speaker names come through as bold Markdown (**Andrew Huberman** (0:00)), which means chunks naturally carry attribution. When the retriever returns a passage, you know who said it without an extra metadata layer.
Below is the shape of a working podcast RAG pipeline using Spoken. Adapt to your stack of choice.
from langchain_text_splitters import MarkdownTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import requests, re
EPISODE_IDS = ["...", "..."] # IDs from /search
def fetch(eid):
r = requests.get(f"https://spoken.md/transcripts/{eid}",
headers={"x-api-key": "YOUR_KEY"})
r.raise_for_status()
return r.text
splitter = MarkdownTextSplitter(chunk_size=1200, chunk_overlap=180)
docs = []
for eid in EPISODE_IDS:
md = fetch(eid)
for chunk in splitter.split_text(md):
# Extract leading speaker for metadata
speaker_match = re.search(r"\*\*([^*]+)\*\*", chunk)
speaker = speaker_match.group(1) if speaker_match else None
docs.append({"text": chunk, "metadata": {"episode": eid, "speaker": speaker}})
store = Chroma.from_texts(
[d["text"] for d in docs],
OpenAIEmbeddings(),
metadatas=[d["metadata"] for d in docs],
)
A back-catalogue of 500 episodes at one hour each:
And re-fetches during pipeline iteration are free, so you can keep tweaking your chunker without re-paying for transcription.
1000–1500 characters with 150–200 overlap is a reasonable starting point. Spoken's Markdown has natural paragraph breaks (around 85th-percentile inter-sentence gaps), so standard MarkdownTextSplitter tends to land cleanly on speaker-turn boundaries.
Each speaker turn carries a timestamp like **Andrew Huberman** (0:45). As long as your splitter keeps paragraphs intact, the timestamp stays at the start of each chunk and can be extracted as metadata.
Store the episode ID and speaker name as chunk metadata. At generation time, format citations as "[Show Name, Speaker, timestamp]" — Spoken's response gives you all three pieces.
Spoken returns transcripts in their original language. Speaker name detection works across languages where the names are mentioned in context.
Yes. Use the search endpoint to find new episodes by podcast or topic, then fetch the new IDs. Old episode fetches are free since you've already paid for them.
TL;DR: For RAG over podcasts, Spoken is the cleanest input you can give your pipeline — Markdown with real speaker names, ready to chunk and embed. It collapses the first four steps of every standard cookbook into one fetch.
Try Spoken with no signup — use API key pt_demo on any endpoint.
For a 500-episode back-catalogue: $40 at the volume rate.