Autoresearch, Agent Loops and the Future of Work

**Nathaniel Whittemore** (0:00)
Today, we're discussing what Andrej Karpathy's weekend project about auto research can tell us about the future of work. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.
Now, today we are talking about a new project from Andrej Karpathy called Auto Research. And you might notice that we are doing an entire episode about this instead of our normal division into the headlines and the main episode. It's because I think that this topic is actually even more significant than it seems on the surface of it. One would be tempted to think that all of us nerds were just getting overexcited because Andrej Karpathy, who is held in such esteem, released the new GitHub repository. And while that is certainly true, there is something bigger going on here.
You might remember a couple months ago me talking about something called Ralph Wiggum. Ralph is, in simplest terms, a software development loop that keeps running, building software in an iterative and persistent way by looping the same instructions over and over and over again. It's named after Simpsons character Ralph Wiggum for his lovable and indomitable persistence despite whatever is going on around him. Now we'll talk more about Ralph in a little bit, but the key concept to take away is this idea of an iterative loop. Karpathy's auto research is also at core about an iterative loop, and I think combined what you have is arguably a new type of work primitive. Primitives are the basic building blocks of work that are so fundamental that they show up everywhere, across roles in industries, and that people reach for automatically once they have it. New ones don't come around very often, and so this idea that agentic loops might be one is I think worthy of some serious scrutiny. But let's talk about what Andrej actually released first, and then we will come back to that. On Saturday, Andrej, who was on the founding team at OpenAI, and who was previously the Director of AI at Tesla, and who you might remember from coining such terms as vibe coding last February, and who has now suggested we are in a different era of agentic engineering as of this February, again tweeted on Saturday, I packaged up the auto research project into a new self-contained minimal repo if people would like to play over the weekend. It's basically NanoChat LLM training course stripped down to a single GPU, one file version of around 630 lines of code, then the human iterates on the prompt.md, and AI agent iterates on the training code.py. The goal is to engineer your agents to make the fastest research project indefinitely and without any of your own involvement. In the image, which he shared alongside it, every dot is a complete LLM training run that lasts exactly five minutes. The agent works in an autonomous loop on a Git feature branch and accumulates Git commits to the training script as it finds better settings of lower variation loss by the end of the neural network architecture, the optimizer, all the hyperparameters, etc. You can imagine comparing the research progress of different prompts, different agents, etc. Part code, part sci-fi, and a pinch of psychosis. As a caption to the image, he wrote, One day frontier AI research used to be done by meat computers in between eating, sleeping, having other fun, and synchronizing once in a while using Soundwave Interconnect in the ritual of a group meeting. That era is long gone. Research is now entirely the domain of autonomous swarms of AI agents running across compute cluster megastructures in the skies. The agents claim that we are now in the 10,205th generation of the code base. In any case, no one could tell if that's right or wrong as the code, quote unquote, is now a self-modifying binary that has grown beyond human comprehension. This repo is the story of how it all began. So let's talk about what auto research actually is, at least in the version that was released by Andrej. Auto research is a system for training a small language model. Basically, the kind of model that powers all of these AI tools but much smaller. The type of model that could one day run on, for example, an edge device like a phone. The goal is to make a model as good as possible at understanding and generating text. Normally or classically, a human researcher would sit there tweaking the training setup. Doing things like adjusting the model's architecture, changing how fast it learns, experimenting with different optimization strategy. They'd run an experiment, check the results, decide what to try next and repeat. That's basically the core loop of machine learning research, and it's bottlenecked by how fast a human can iterate. AutoResearch instead hands that entire loop to an AI agent. And it does so in an intentionally simplified and tiny way. In this repo, there are just three files that matter. The first is prepare.py, which is fixed infrastructure that doesn't change. It downloads the training data, trains a tokenizer and handles evaluation. The second is train.py. This contains the entire GPT model definition, the optimizer and the training loop. This is the single file the AI agent is allowed to edit. Everything in it is fair game. The model architecture, the hyperparameters, the batch size, the attention parameters, the learning rate schedule, literally everything. The third file is program.md. And this is the most conceptually important one, especially in the context of this idea of these loops being larger primitives. It's a markdown file, plain text, written in English, that contains the instructions for the AI agent. It describes how the agent should behave as a researcher, what kind of experiments to try, what to be cautious about, and when to be bold versus conservative. This is the file that the human in this equation edits. So the way that this is going to work is you point an AI agent, like Claude or Codex or whatever, at the repo, and tell it to read program.md and start experimenting. The agent reads the instructions, looks at the current state of train.py, decides on a modification to try, makes the edit, and kicks off a training run. Every training run has a fixed five-minute budget. When the run finishes, the system evaluates the model on a validation set and produces a single number. In this case, that's validation BPB or ValBPB, which stands for validation bits per byte. In this case, lower is better. The agent then makes a decision. If the new ValBPB is lower than the previous best, the change is kept, it gets committed to a get feature branch, it becomes the new baseline, and the agent builds on top of it for the next experiment. If the ValBPB is the same or higher, the change is discarded. The agent reverts to the previous best version and tries something different. Then the loop repeats indefinitely. Because of that 5-minute constraint, you can run this for an hour and get 12 experiments. You can run it overnight and get about 100 The session that Andrej shared showed 83 experiments of which 15 had improvements that they kept and which drove the ValBPB from.9979 down to.9697.
Autoresearch, Agent Loops and the Future of Work

Feed this to your agent