Proactive Agents for the Web with Devi Parikh

**Sam Charrington** (0:01)
Join developers from Cisco, Dell Technologies, Google Cloud, Oracle, Red Hat, and more than 75 other supporting companies to build the open tool stack for multi-agent software and trusted agent identity on Agency. Agency, which I recently discussed on the podcast in my interview with Vijoy Pandey, is now an open source Linux Foundation project where you can help create the protocol, specs, and tools that power next-gen AI infrastructure. Visit agency.org to learn more and join the build. That's agntcy.org.

**Devi Parikh** (0:43)
We will no longer be interacting with the Web in the same way that we do right now. We won't be clicking buttons, fiddling with forms on websites and browsers. We will be interacting with the Web one level higher in the abstraction, where we're describing what needs to be done, or maybe our assistant is proactively noticing what needs to be done, and sort of agents in the background are starting to execute these workflows on the Web on your behalf.

**Sam Charrington** (1:24)
All right, everyone, welcome to another episode of the TWIML AI Podcast. I am your host, Sam Charrington. Today, I'm joined by Devi Parikh. Devi is co-founder and co-CEO of Yutori. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Devi, it has been a while since we caught up last. Welcome back to the podcast.

**Devi Parikh** (1:46)
Thank you. Thank you for having me again.

**Sam Charrington** (1:49)
Yeah, five years later, not much has happened at all, right?

**Devi Parikh** (1:55)
In some ways, a lot has happened, but in some ways, I'm like, wow, it's been five years. So yeah.

**Sam Charrington** (1:59)
I know. I know. I know. So we're going to be talking a bit about AI browsers and browser use agents and what you're building at Yutori. But I'd love to have you take a few minutes and catch us all up on what you've been up to recently.

**Devi Parikh** (2:16)
Yeah. Yeah. And I can go a little bit further back than five years. Just to talk about my background a little bit. So I've been working in AI for about 20 years now. Originally, my PhD thesis was in computer vision. And then over time, I got interested in seeing if we can find ways in which people can interact with these systems more naturally. And that's how I moved towards multimodal problems at the intersection of vision and language. So things like given an image, can you describe it in a sentence? Can you answer questions about it? Can you have a conversation going back and forth about the content of an image? And this was back in 2014 or so. So it was after that initial excitement of deep learning models, but it was starting to feel like, wait, these models are doing some things, stuff is actually starting to work. But it was well before all of the current excitement around GenAI and LLMs and so on. So these models weren't really as good as they are today. And yes, it was kind of fun to tinker on the boundaries of what's possible.
And then I started getting interested in seeing if we could find ways in which we could use AI as a tool for creative expression. And that's how I got involved with generative models for images and videos and music and other modalities like that. I was in academia for a while, faculty at Virginia Tech and then Georgia Tech. And then I was at Meta for about eight years, forced in FAIR, then in GenAI, where I was a senior director leading a lot of the multimodal research efforts there. So models like EMU, EMU Video, EMU Edit for Image and Video Generation and Editing, they were shipped across Meta surfaces. My teams were involved in that, and the multimodal capabilities in LAMA 3 were coming from my teams as well. And this was up until early last year, where my co-founders and I, we left Meta to start Yutori.

**Sam Charrington** (4:08)
Correct me if I'm misremembering this, but at some point you were interested in fashion. Am I remembering that correctly? Did you do some papers on fashion data sets or something?

**Devi Parikh** (4:20)
I did.
I had done a couple of projects in that space with other collaborators. I think the last time we talked, I was right at that edge of looking for the next thing, and I was starting to tinker in the space of like, and we use AI as a tool for creative expression. So there were a whole bunch of weird little projects that I had done at the time. Some of the fashion ones were more legit. I won't say those were weird, like I was doing them with other philabritters, but yeah.

Feed this to your agent

Try it now — copy, paste, done:

curl -H "x-api-key: pt_demo" \
  https://spoken.md/transcripts/1000737317305

Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.

From $0.10 per transcript. No subscription. Credits never expire.

Using your own key:

curl -H "x-api-key: YOUR_KEY" \
  https://spoken.md/transcripts/1000737317305