RAG over podcasts: a clean transcript primitive for LLM pipelines

Q: Do timestamps survive chunking?

Each speaker turn carries a timestamp like **Andrew Huberman** (0:45). As long as your splitter keeps paragraphs intact, the timestamp stays at the start of each chunk and can be extracted as metadata.

Building a retrieval-augmented generation pipeline that reasons over podcast content. How to skip the transcription step and start with Markdown that's already chunk-ready.

Last updated June 2026

RAG over podcasts usually starts with an audio transcription step — Whisper, AssemblyAI, or similar — followed by diarization, speaker naming, and chunking. Spoken collapses the first three steps into one API call. You get clean Markdown with real speaker names, timestamps, and natural paragraph breaks. Drop it into your vector store with a standard text splitter and you're done.

The typical RAG-over-podcasts pipeline

Most podcast RAG cookbooks (Haystack, LangChain, LlamaIndex) follow the same shape:

Locate and download audio
Transcribe with Whisper or AssemblyAI
Run diarization (Pyannote or built-in) and align with transcription
Identify speakers from context (LLM pass)
Chunk and embed
Store in a vector DB
Retrieve + generate at query time

Steps 1–4 are infrastructure work that has nothing to do with retrieval quality. They're table stakes you have to ship before you can start tuning the part that actually matters.

The Spoken version

# Step 1–4 collapse into one call
import requests

resp = requests.get(
    f"https://spoken.md/transcripts/{episode_id}",
    headers={"x-api-key": "pt_demo"},
)
markdown = resp.text  # Real speaker names, timestamps, paragraph breaks

# Step 5: chunk
from langchain_text_splitters import MarkdownTextSplitter
chunks = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=150).split_text(markdown)

# Steps 6–7: embed, store, retrieve as normal

Speaker names come through as bold Markdown (**Andrew Huberman** (0:00)), which means chunks naturally carry attribution. When the retriever returns a passage, you know who said it without an extra metadata layer.

Why this matters for retrieval quality

Speaker context survives chunking. "Speaker 0" doesn't help an LLM reason about authority or perspective. "Andrew Huberman" does.
Markdown structure beats raw text. Standard Markdown splitters respect paragraph boundaries, which align with natural speaker turns.
Timestamps as metadata. Each turn includes a timestamp you can attach to chunks for citation-style answers ("Andrew Huberman, around 0:45, says…").
Idempotent fetches. Repeat fetches of the same episode don't charge again, so you can re-chunk during pipeline development without burning credits.

Reference pipeline

Below is the shape of a working podcast RAG pipeline using Spoken. Adapt to your stack of choice.

from langchain_text_splitters import MarkdownTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
import requests, re

EPISODE_IDS = ["...", "..."]  # IDs from /search

def fetch(eid):
    r = requests.get(f"https://spoken.md/transcripts/{eid}",
                     headers={"x-api-key": "YOUR_KEY"})
    r.raise_for_status()
    return r.text

splitter = MarkdownTextSplitter(chunk_size=1200, chunk_overlap=180)
docs = []
for eid in EPISODE_IDS:
    md = fetch(eid)
    for chunk in splitter.split_text(md):
        # Extract leading speaker for metadata
        speaker_match = re.search(r"\*\*([^*]+)\*\*", chunk)
        speaker = speaker_match.group(1) if speaker_match else None
        docs.append({"text": chunk, "metadata": {"episode": eid, "speaker": speaker}})

store = Chroma.from_texts(
    [d["text"] for d in docs],
    OpenAIEmbeddings(),
    metadatas=[d["metadata"] for d in docs],
)

Cost at podcast-RAG scale

A back-catalogue of 500 episodes at one hour each:

Whisper API + Pyannote + LLM-naming pipeline: ~$0.36/hr Whisper × 500 = $180 plus diarization compute plus naming-LLM tokens — typically $250–$400 once infrastructure is included.
AssemblyAI Universal + LLM-naming: ~$0.17/hr × 500 = $85 plus the naming pass — typically $120–$180 total.
Spoken at the 2,000-pack rate: $0.08 × 500 = $40. No naming pass needed.

And re-fetches during pipeline iteration are free, so you can keep tweaking your chunker without re-paying for transcription.

FAQ

What chunk size works best for podcast RAG?

1000–1500 characters with 150–200 overlap is a reasonable starting point. Spoken's Markdown has natural paragraph breaks (around 85th-percentile inter-sentence gaps), so standard MarkdownTextSplitter tends to land cleanly on speaker-turn boundaries.

Do timestamps survive chunking?

Each speaker turn carries a timestamp like **Andrew Huberman** (0:45). As long as your splitter keeps paragraphs intact, the timestamp stays at the start of each chunk and can be extracted as metadata.

How do I cite the source in generated answers?

Store the episode ID and speaker name as chunk metadata. At generation time, format citations as "[Show Name, Speaker, timestamp]" — Spoken's response gives you all three pieces.

What about multilingual podcasts?

Spoken returns transcripts in their original language. Speaker name detection works across languages where the names are mentioned in context.

Can I refresh transcripts as podcasts add new episodes?

Yes. Use the search endpoint to find new episodes by podcast or topic, then fetch the new IDs. Old episode fetches are free since you've already paid for them.

TL;DR: For RAG over podcasts, Spoken is the cleanest input you can give your pipeline — Markdown with real speaker names, ready to chunk and embed. It collapses the first four steps of every standard cookbook into one fetch.

Try Spoken with no signup — use API key pt_demo on any endpoint.

For a 500-episode back-catalogue: $40 at the volume rate.