**Tim Scarfe** (0:21)
Welcome back to the Machine Learning Street Talk YouTube channel. Today, we're gonna talk about Sara Hooker's The Hardware Lottery Paper. We get stuck in basins of attraction based on our technology decisions that we made previously. And it's super expensive to jump out of those basins. The key idea is what causes inertia or friction in the marketplace of ideas? Is there a meritocracy of ideas? Or do the previous decisions we've made enslave us, Sara Hooker calls this a lottery because she feels that machine learning progress is entirely beholden to the hardware and software landscape.
In today's episode, we talk about Sara's paper, the hardware lottery, but we also talk about pruning, bias mitigation, interpretability, and also the cultural divide which is present in the machine learning community.
Ideas succeed if they are compatible with the hardware and software at the time, and also the existing inventions.
If you haven't already, check out Yannic Lightspeed Kilcher's video about the hardware lottery. It's pretty cool when he goes into a fair bit of detail. Now, the machine learning community is exceptional because the pace of innovation is so fast. We operate largely in the open, and it's because we don't build anything physical, which is expensive and slow. The cost of being scooped isn't so high for us.
Sara feels that hardware disadvantaged connectionist or deep learning approaches for many years. She argues that the deep learning revolution was a fluke. So is this a story unique to hardware and artificial intelligence algorithms, or is it really just a story of all innovation? Every great innovation must wait for the right stepping stone to be in place before it can really happen. The Apple Newton PDA in the 1990s never took off because the hardware hadn't caught up to the idea yet. People call the decades prior the AI winter or the lost decades. In my opinion, it was the presence of large data sets as well as the computes which triggered the charge.
In the early 2000s, deep neural networks were unreliable, finicky and poorly understood. We already had all of the knowledge on neural networks, but they were intractable on the hardware we had. So we missed out on the current charge.
People might look back on the 2000s and think that it was this decade where we wasted so much time on deep learning because we were stuck in this basin of attraction, chasing objectives, trying to improve metrics without having a serious rethink about how we should approach this problem. Arguably, the tech industry has abandoned AGI research in favor of using their petabyte storage clouds and massive compute to memorize the answers to everything instead of seeking to understand. It's not even possible to get a university PhD supervisor or funding unless you're interested in the statistical approaches to artificial intelligence. There's a cultural divide in machine learning and many different schools of thought. Rich Sutton, who wrote The Bitter Lesson, thinks that we need to have massive amounts of compute and data. He thinks that we should stop trying to build overly simplistic models and instead rely on massive amounts of compute. He cites Deep Blue and AlphaGo as being brute force searching and learning rather than using human strategies. He thinks missing data is the fundamental ingredient of deep learning. Although as Max Welling pointed out, he missed data as the fundamental ingredient in deep learning. He focuses on problems where you can generate your own data or where there's lots of data available already. So these are essentially interpolation challenges. The trouble starts when we need to extrapolate. And this brings us on to Max Welling, who believes that we need better models with better priors. He believes that custom hardware and computation is one of the fastest ways to improve AI. He thinks it's the best hammer we've produced so far, and it's hitting nails in very nicely every day. So he's a proponent of deep learning. But he thinks that we need to be building better priors in our models so that we can be more sample efficient. We also have people like Noam Chomsky and Gary Marcus, who think we should seek to understand artificial intelligence and create handcrafted knowledge and rules. We also have folks like Judea Pearl and Chris Bishop, who advocate for a Bayesian approach and having causality as a first class citizen in our artificial intelligence models.
A lot of the pruning literature has been misguided. A lot of it has been about how do we alter the data set to preserve top line metrics such as accuracy. The key idea is that the long tail represents the low frequency information in your distribution.
This might correspond to underrepresented groups or just esoteric and nuanced information which we didn't see many examples of during training. Neural networks are inefficient. The majority of parameters in the network are used to encode low frequency and challenging instances in a very costly way. You encode the head of the distribution, the high frequency components, with a small number of parameters. Almost all of the capacity of the network goes into effectively memorizing the tail of the distribution.
74 more minutes of transcript below
Try it now — copy, paste, done:
curl -H "x-api-key: pt_demo" \
https://spoken.md/transcripts/1000651996090
Works with Claude, ChatGPT, Cursor, and any agent that makes HTTP calls.
From $0.10 per transcript. No subscription. Credits never expire.
Using your own key:
curl -H "x-api-key: YOUR_KEY" \
https://spoken.md/transcripts/1000495458613