Scaling Agentic Inference Across Heterogeneous Compute with Zain Asgar

**Amar** (0:01)
This podcast is sponsored by Google. Hey folks, I'm Amar, Product and Design Lead at Google DeepMind. We just launched a revamped vibe coding experience in AI Studio, that lets you mix and match AI capabilities to turn your ideas into reality faster than ever. Just describe your app and Gemini will automatically wire up the right models and APIs for you. And if you need a spark, hit I'm feeling lucky and we'll help you get started. Head to ai.studio slash build to create your first app.

**Sam Charrington** (0:33)
Join developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat and more than 75 other supporting companies to build the open tool stack for multi-agent software and trusted agent identity on Agency. Agency, which I recently discussed on the podcast in my interview with Vijoy Pandey, is now an open-source Linux Foundation project where you can help create the protocol, specs and tools that power next-gen AI infrastructure. Visit agency.org to learn more and join the build. That's agntcy.org.

**Zain Asgar** (1:15)
If you take a look at training hardware, it's kind of gone the way of building supercomputers, right? Like, people don't talk about building machines anymore. They're like, here's my entire rack, right? This starts looking like, you know, what Cray was doing. So in some ways, you know, you could be like, oh, we're kind of regressed back to the supercomputer era. And I don't know if I use that word positively, right? We're building this, like, fully vertically integrated systems. I'm not sure that's the route for inference. I think inference is much better served as a large scale workload where you can utilize a bunch of relatively commodity hardware and be able to like scale out efficiently.

**Sam Charrington** (1:59)
All right, everyone, welcome to another episode of The TWIML AI Podcast. I'm your host, Sam Charrington. Today, I'm joined by Zain Asgar. Zain is co-founder and CEO at Gimlet Labs and an adjunct professor of computer science at Stanford University. Before we get going, be sure to hit that subscribe button wherever you're listening to today's show. Zain, welcome to the podcast.

**Zain Asgar** (2:20)
Hi Sam, thanks for having me here. Super excited to be here.

**Sam Charrington** (2:23)
Excited to have you on the show and looking forward to digging into our conversation. We'll be talking about the work you're doing around heterogeneous inference for agentic systems. To get us going, I'd love to have you share a little bit about your background.

**Zain Asgar** (2:36)
As Sam mentioned, I'm co-founder and CEO of Gimlet and also adjunct faculty of computer science at Stanford. Prior to this, I was a general manager at New Relic to run an acquisition of my previous startup, Pixi. Actually, a bunch of people from Pixi are now at Gimlet as well. I was an EIR benchmark capital where the idea for Pixi came out of.
I was in Google Research and spent a lot of time at NVIDIA. I've focused on efficient compute and being able to orchestrate and run compute efficiently on large-scale clusters.

**Sam Charrington** (3:06)
Where did the idea for Gimlet come from? What are you going after there?

**Zain Asgar** (3:11)
When we started Gimlet a couple of years ago, we had a focus on how do we actually make AI workloads at least 10 times more efficient. Part of the challenge over here has been that, you've seen this huge explosion in AI workloads, especially around HMTK AI, where you're consuming 10x more tokens. And really, if you want to be able to keep this somewhat sustainable, you need to have these big leaps on improvements.
So that was our original focus at Gimlet. And when we started off, we were really thinking about how do we get models to run on things like your laptop and Raspberry Pis or whatever, like small-scale hardware, and how do we get the best efficiency? Kind of edge devices? Exactly, any kind of edge device. But one of the things we realized is that we built up our stack to work on this very, very heterogeneous system, right? Because two Macbooks look very different than typical Windows laptops running like Intel or AMD CPUs, which look very different than an iPhone. And we got really good at running models efficiently. So we realized that this technology that we've built up is actually pretty applicable to data center scale systems. And there's this large, large scale problem. And so we could have a much larger impact if we actually improve that system. And our team decided to start going after that space instead of directly targeting edge devices. Partly because we think that the edge market still needs a couple of years to really materialize.

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000739392367

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000739392367