Why Vision Language Models Ignore What They See with Munawar Hayat

**Sam Charrington** (0:00)
Thanks so much to our friends at Qualcomm for their continued support and sponsorship of today's episode. Qualcomm AI Research is dedicated to advancing AI to make its core capabilities, perception, reasoning, and action ubiquitous across devices. Their work makes it possible for billions of users around the world to have AI-enhanced experiences on devices powered by Qualcomm technologies. To learn more about what Qualcomm is up to on the research front, visit twimlai.com/qualcomm.

**Munawar Hayat** (0:31)
If you take these foundation models, proprietary models, and if you give it very simple images, like an image of two boxes, two cardboard boxes, very plain, one is bigger, one is smaller, and you ask it to generate an image where we unstack these boxes. So once it unstacks, physical properties of the boxes change. They're not exactly the same boxes, their shapes might be deformed, their sizes might be different, and this is problematic.

**Sam Charrington** (1:11)
All right, everyone, welcome to another episode of The TWIML AI Podcast. I am your host, Sam Charrington. Today, I'm joined by Munawar Hayat. Munawar is a researcher at Qualcomm AI Research. Before we get going, be sure to take a moment to hit that subscribe button wherever you're listening to today's show. Munawar, welcome to the podcast.

**Munawar Hayat** (1:31)
Hey, Sam, thanks a lot. Thanks for having me.

**Sam Charrington** (1:34)
I'm really looking forward to digging into our conversation. We're going to be talking about some of Qualcomm's papers on Multimodal AI and Visual AI from the recent NeurIPS Conference. To get us started, I'd love to have you share a little bit about your background. How'd you get into the field?

**Munawar Hayat** (1:53)
So I did my PhD in 2015 from Australia. I got interested in computer vision, and I took some courses in my undergrad on image processing that fascinated me.
And during my PhD, I worked on visual analysis of facial data. And then afterwards, I have been contributing to mostly different subfields of computer vision. And yeah, and then after my PhD, I stayed in academia. I had a faculty position, and then I moved to Qualcomm in 2023

**Sam Charrington** (2:38)
What's your research focus at Qualcomm?

**Munawar Hayat** (2:40)
At Qualcomm, I have been mostly working on multi-modal generative AI. So on vision language models for understanding generation and retrieval. And those are the three areas where I have my papers at NeurIPS that I believe we are going to discuss deeply. So when I joined Qualcomm, the first project I worked on was making diffusion models run efficiently on Qualcomm hardware. That's a mobile phone. And we're generating images in under half a second. And then we also worked on visual understanding. That's visual question and answering, given an image and a question, a text question. We want to respond to the question. And for real question answering, we also had a model running on Qualcomm hardware.
So basically here in Qualcomm, I have been working primarily on multimodal generative AI, on generating visual content, understanding visual content, and also retrieving cross-modal information.

**Sam Charrington** (3:53)
When you think about how the field has evolved over the past couple of years since you started there, what's most exciting for you about this moment right now?

**Munawar Hayat** (4:03)
I guess we have come a long way. Like for decades, we were focused on solving niche, very specific problems. Think of we spent so much time in just solving, classifying different objects from visual images. And then we moved on quite a bit. But there is a lot to do, I guess, especially in visual understanding and also in visual generation. In visual generation, physics-based generation, I guess we largely lack on that. And also in visual understanding, once we marry a vision model with a language model, there are lots of works that show that vision largely gets ignored, and the language model takes over. And we are responding basically from the parametric memory of the language model and not by looking at the actual visual content.

**Sam Charrington** (5:06)
When you talk about physics-based generation, what do you mean by that?

**Munawar Hayat** (5:12)
So, physics-aware generation, meaning we had a summation recently in which we observed that if you take these foundation models, the proprietary models, and you should give it very simple images, like an image of two boxes, two cardboard boxes, very plain. One is bigger, one is smaller, and you ask it to generate an image where we unstack these boxes. So, while these models can generate very fancy images with intricate visual details, but a simple task like this, where we have very plain objects and we want them to unstack them. So, once it unstacks, the physical properties of the boxes change. They are not exactly the same boxes, their shapes might be deformed, their sizes might be different.
Why Vision Language Models Ignore What They See with Munawar Hayat

Feed this to your agent