Interviewing Finbarr Timbers on the "We are So Back" Era of Reinforcement Learning

Interconnects

0:00

-1:08:32

Interviewing Finbarr Timbers on the "We are So Back" Era of Reinforcement Learning

Interconnects interview #11. An overview on the past, present, and future of RL.

Nathan Lambert and Finbarr Timbers

Dec 05, 2024

Finbarr Timbers is an AI researcher who writes Artificial Fintelligence1 — one of the technical AI blog’s I’ve been recommending for a long time — and has a variety of experiences at top AI labs including DeepMind and Midjourney. The goal of this interview was to do a few things:

Revisit what reinforcement learning (RL) actually is, its origins, and its motivations.
Contextualize the major breakthroughs of deep RL in the last decade, from DQN for Atari to AlphaZero to ChatGPT. How could we have seen the resurgence coming? (see the timeline below for the major events we cover)
Modern uses for RL, o1, RLHF, and the future of finetuning all ML models.
Address some of the critiques like “RL doesn’t work yet.”

It was a fun one. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Timeline of RL and what was happening at the time

In the last decade of deep RL, there have been a few phases.

Era 1: Deep RL fundamentals — when modern algorithms we designed and proven.
Era 2: Major projects — AlphaZero, OpenAI 5, and all the projects that put RL on the map.
Era 3: Slowdown — when DeepMind and OpenAI no longer had the major RL projects and cultural relevance declined.
Era 4: RLHF & widening success — RL’s new life post ChatGPT.

Covering these is the following events. This is incomplete, but enough to inspire a conversation.

Early era: TD Gammon, REINFORCE, Etc

2013: Deep Q Learning (Atari)

2014: Google acquires DeepMind

2016: AlphaGo defeats Lee Sedol

2017: PPO paper, AlphaZero (no human data)

2018: OpenAI Five, GPT 2

2019: AlphaStar, robotic sim2real with RL early papers (see blog post)

2020: MuZero

2021: Decision Transformer

2022: ChatGPT, sim2real continues.

2023: Scaling laws for RL (blog post), doubt of RL

2024: o1, post-training, RL’s bloom

Chapters

[00:00:00] Introduction
[00:02:14] Reinforcement Learning Fundamentals
[00:09:03] The Bitter Lesson
[00:12:07] Reward Modeling and Its Challenges in RL
[00:16:03] Historical Milestones in Deep RL
[00:21:18] OpenAI Five and Challenges in Complex RL Environments
[00:25:24] Recent-ish Developments in RL: MuZero, Decision Transformer, and RLHF
[00:30:29] OpenAI's O1 and Exploration in Language Models
[00:40:00] Tülu 3 and Challenges in RL Training for Language Models
[00:46:48] Comparing Different AI Assistants
[00:49:44] Management in AI Research
[00:55:30] Building Effective AI Teams
[01:01:55] The Need for Personal Branding

We mention

O1 (OpenAI model)
Rich Sutton
University of Alberta
London School of Economics
IBM’s Deep Blue
Alberta Machine Intelligence Institute (AMII)
John Schulman
Claude (Anthropic's AI assistant)
Logan Kilpatrick
Bard (Google's AI assistant)
DeepSeek R1 Lite
Scale AI
OLMo (AI2's language model)
Golden Gate Claude

Transcript

Nathan Lambert [00:00:00]: Hey, Finbarr, welcome to the show.

Finbarr Timbers [00:01:28]: I'm really excited to chat.

Nathan Lambert [00:01:30]: Yeah, this has been a bit long time coming, at least whether or not it's recorded. This chat has been a long time coming and we have some more timely things to discuss. I think depending on when we did this, the actual discussion matters would be different. And this time it's kind of like since O1, people have been more bullish on reinforcement learning again, which is always fun. I think it goes through this whole cycle of debates. Last year, we probably would have done PPO versus DPO, which also would have been fun. But for whatever reason, that's kind of on the back burner. So I think largely, let's just talk about RL and we'll get to the fun recent stuff in a bit. But just to kind of set the groundwork, like how do you define reinforcement learning?

Finbarr Timbers [00:02:14]: Yeah. Yeah, totally. I think we're definitely in the we're so back era of RL. Exciting. I mean, so like, look, like the classic definition is, you know, I really agree with it, like the classic MDP, like, you know, you have states, you have rewards, you choose actions to get new states and rewards. That's, you know, that is RL. So if you're going to define it, that's the answer you have to give. I think the question is, like, what is like RL in a, and you can use that framework in like so many ways. Like you could just say, oh, supervised learning is RL using that framework. So maybe the more useful definition is sequential decision-making under uncertainty. And so when I think about what makes a reinforcement learning task or like something that, you know, is actually has an interesting level of reinforcement learning, I think that that's when you have to make that exploration exploitation trade off. And particularly when there's non-trivial exploration or, you know, discovery of new knowledge is kind of how I think about it. So if we're thinking about Atari, Atari is an interesting reinforcement learning problem because we have to learn how to play these games without any prior information. And so supervised, you know, it's not like supervised learning at all, where we have this knowledge and, you know, we're actually discovering how to play this with, from a blank slate. So that's kind of how I think about reinforcement learning. And that's, you know, we'll get into this later. That's also why I think RLHF, while it's great, I don't think it is like true reinforcement learning to gatekeep.

Nathan Lambert [00:03:45]: Yeah. It's so hard to get across. The RLHF thing is like that you have this like environment that's a reward model. So you like control the reward model through your, you control the environment through your own kind of setup, which I just find to be so silly. I don't really, I don't really know how to make of this, but we'll get to this later. Like how did you get working in RL? Was this just the deep mind arc that everyone ends up being in RL, Stan? I was supposed to go, I was supposed to do my DeepMind internship in London and I was very excited to kind of drink this Kool-Aid for summer, but that's when COVID hit and then all the internships were just killed because they had, yeah, it was pretty funny. This is DeepMind lore is that the 2020 interns, instead of going remote, they nixed the whole program due to COVID. And then I think they had labor law concerns. So they gave all of us double salary to come back to 2021. So they're like, sorry for canceling your internship. Here's 40 grand. That's an aside.

Finbarr Timbers [00:04:49]: We had someone, I forgot that that happened because, you know, we were in, I was in the Edmonton office, in Alberta here. And we had an intern who started on whatever, like March 9th or whatever the day was where we were all sent home. And, you know, we didn't go back to the office for like two years or whatever. And so we had a guy, that was his first day and he was from the Czech Republic. And so, you know, he had flown all the way over here. He got to Edmonton, right, which is whatever, you know, 5,000, you know, kilometers away. And then he was like, oh, welcome, go home and, you know, stay in your apartment.

Nathan Lambert [00:05:33]: So he at least got a handshake from Rich Sutton. Was Rich Sutton in your area of influence at DeepMind at the time? Was he at DeepMind when you were there in Edmonton?

Finbarr Timbers [00:05:41]: I was in Rich Sutton's area of influence. I don't think he was in my, although I recommended books to him. We had a longstanding fiction book recommendation. But yeah, yeah, no, I worked with him. I know him. I went to his, he hosts a garden party every year and I went to that.

Nathan Lambert [00:05:58]: Okay, well, it's pretty funny. We'll come back to this, but like, how does the Edmonton office of RL kind of relate to the bigger RL culture of DeepMind at the time? Because we're going to kind of go into the details of the timeline. I think we can place where you were there versus RL history.

Finbarr Timbers [00:06:16]: So my background, just to give you some context here, is very odd. So my undergrad was in the math and economics program at the University of Alberta here. And so it was basically a math degree, you know, like a pure math, like I was planning to go to grad school, become a mathematician, like that was the education. And you know, also taking some econ courses because I thought, okay, I want to be an academic economist. Like how can, you know, how can I do that? It turns out the best way to become an academic economist is to do a math degree, which is an odd quirk and maybe a condemnation of the economics profession. So I did that, then I did my master's at the London School of Economics in econometrics. So like statistics applied to economics. And I got really disillusioned with economics when I was there and I thought, okay, you know, this is garbage. I'm going to go and do, go into machine learning, which seemed, you know, this was in 2014. So it's kind of the heyday, like, you know, AlexNet had just happened, DQN had just happened, and it was really exciting to see everything happen in machine learning. So I thought, okay, I'm going to go into this. And so I had this game theory background from working in economics, right, where you do like game theories, you know, one of the fundamental parts of economics, particularly once you get into the kind of graduate, like doctoral level stuff. And I was, that was when they opened up the, the DeepMind office in Edmonton. So they had a large game theory team, which was formerly the computer poker research group at the University of Alberta. They kind of, they hired, you know, Rich Sutton's lab, and they hired the computer poker research group. And you know, they brought them on board. And so then I was part of the team that, you know, they were hiring to help with the poker research. So then that was kind of how I got into RL. And it was really interesting because the Edmonton, the University of Alberta generally has this really different approach to machine learning and to reinforcement learning than a lot of the broader community. In particular, they're very skeptical about deep learning, which is interesting because you'll go to talks there, and you'll have all these people who are kind of like, I mean, I haven't been to talks in a few years since, truthfully, since before COVID. So maybe it's changed. But you go to these talks, and they'd kind of be really skeptical about deep learning and think, okay, like clearly deep learning works, but, you know, it's this really unsatisfying answer. And they'd really be probing that and trying to find, you know, better solutions than just saying, okay, let's, you know, use more data, let's use more compute, like let's make the network bigger. And it was just a really interesting perspective to see because, you know, generally the perspective of the broader community was totally different. It was like, oh, yeah, neural networks go brr, let's make it bigger, let's use more compute. So it was just really interesting.

Nathan Lambert [00:09:03]: This is funny because The Bitter Lesson also came from Rich in, oh, was it like 2019 is when this blog post was written? Yeah. That is pretty much disowning most of that sentiment. Well, I don't think it is.

Finbarr Timbers [00:09:13]: I think a lot of people misunderstand The Bitter Lesson. The point of The Bitter Lesson isn't, let's just, you know, blindly use more compute. There's a bit of a straw man, that's not what you said. But the straw man that I see other people, you know, who are, say, on Twitter, is that, you know, oh, you know, Rich Sanders believes in scale, like, you know, just make the networks bigger. And I don't think that's true. I think it's just you should, it's more of a response to stuff like Deep Blue, the IBM chess playing agent, or to the general history of expert systems, where you have a bunch of smart humans making their biases into the system. That's where I think The Bitter Lesson's response to, it's saying that we should just, we should try to find methods that, you know, don't rely on human knowledge and that discover things. So I don't think, yeah. So I mean, I think, you know, there's a scale of compute, but it's not a pure scale thing.

Nathan Lambert [00:10:07]: There's a phrase that I was going to highlight in it, which is the scaling learning and search. And I think the GPT's self-supervised stuff is very much on those kind of scaling learning side.

Finbarr Timbers [00:10:17]: Yeah.

Nathan Lambert [00:10:18]: And I don't really think many people, I don't know if I have a good worldview about this around kind of what the scaling search part would mean. I think some people now are going to kind of overly attribute that to O1. I have a whole blog draft written about how you can explain O1 without actually using inference time search. You just do a lot of RL and kind of let it be weird. So I don't really know if, I think it'd be great to have an answer from Rich himself on like, what does the scaling search part mean in context of, I mean, it's five years old now. So like, what does he think of that? Because the compute part is obvious, but I don't think we've done the other half.

Finbarr Timbers [00:10:59]: I work out of the Alberta Machine Intelligence Institute, AMII, occasionally. And so yeah, maybe, which is where he, the research institute that he's a PI at, and so maybe I'll try to corner him at the coffee craft and get his thoughts on it. But I think that NLP is a really great example. I think you're totally right. The GPT line of research is a perfect example of the bitter lesson, because if you look at the pre, not that I claim to be an NLP expert, but if you look at the pre-deep learning approaches to NLP, it's stuff like these huge Ngram models, which is really break down and they're kind of, you know, they have a lot of more structure to it. And then, yeah, like you go to the GPT set up, like it's just where there's a lot less structure and you're learning it all. And so, yeah, it's absolutely this, like there's no baked in biases. It's all like learned. And I think it's, yeah, it's a great example of that.

Nathan Lambert [00:11:55]: Can the reward function be a bottleneck for scaling search? Like is the reward function in applying RL, which is often manually tuned by humans to do certain things? Like, is that going to be a bottleneck in scaling search?

Finbarr Timbers [00:12:07]: Yeah. I mean, I think that the reward modeling generally, we're terrible at it and we do a really bad job of it. Like that's where, like, if you remember, I think it was LLAMA 3 or maybe LLAMA 2, one of the LLAMA releases, it was kind of bad. Like the initial one that they, the initial LLAMA 2, it was LLAMA 2, and I think a lot of that was because they just overfit to their reward model because the reward model wasn't super robust. And so I think that's just like a general, yeah, we don't do a good job modeling reward. The standard thing to do is train these Bradley Terry models for preferences. And it's just not, it's just hard to do that in a really robust way. So yeah, I think that absolutely, I think that is the major bottleneck. And if we could have better reward models, I think that would massively accelerate research in, not just in the reasoning style, but like generally in generative AI more broadly.

Nathan Lambert [00:13:04]: Yeah, I almost think it's getting trained into like verifiers is how people are thinking of it rather than reward models. So like a verifier for if a question is right. But then eventually if you have a verifier for everything, it's not different at all than a reward model. Yeah.

Finbarr Timbers [00:13:18]: I mean, I think that's what's so exciting about the, sorry, how do I pronounce it, Tulu? Yeah.

Nathan Lambert [00:13:25]: Tulu. Tulu. Tulu is a very not well-known species of hybrid camel. So I think it's like a Bacarian camel and the other type of camel. If you cross breed them, you get a Tulu camel, but it's hard to search on many language models because Tulu is also a dialect, I think of some Indian language. So it's really hard, like it's really hard to search.

Finbarr Timbers [00:13:50]: Well, that's, that's very neat. I think that's what's so exciting about the, what was it, the RL with reinforcement learning

Nathan Lambert [00:14:00]: with verifiable rewards. Yeah.

Finbarr Timbers [00:14:02]: I think that's so exciting because it's like you have this great, like you've a reward function that's not just like fitting another deep net. Like it's something that actually, yeah, like has really robust rewards. I really liked that approach. I think stuff like that, like, yeah, the verifier, I think that that's a really good approach that's going to be used.

Nathan Lambert [00:14:20]: There's a lot of work on this. I think we'll come back to this because we'll do a whole O1 section. I kind of want to go through the RL history and see what we kind of uncover because it kind of ends with O1 is the most recent major event because I've kind of split it up into somewhat of eras where I have like era one, which is before, which is like, I call it deep RL fundamentals, but it's pretty much everything until the big project started coming. So TD gammon is the one that people start with, which is, it's just funny how academia ends up like this. I have no semblance of what TD gammon is actually a breakthrough for. It's just like the seminal citation. And then we started having actual like deep RL algorithms and reinforced came along. And then DQN is the thing that you kind of mentioned, it's where I think they've kind of switch flipped for deep RL, which I don't really, it's just so funny how these timelines come about. And I think one of my overarching things is like, it's so weird how deep RL is culturally associated with grand breakthroughs, but there's kind of the undercurrent of stuff just continuing to get better and better in the background. So it's, it's almost like makes the community have a hard time. So I can just keep going through. So 2013 was deep Q learning for Atari, the first paper. And then I think they had another one in 2014, a bigger paper, but then like 2016 is when AlphaGo actually happens. So this is before the PPO algorithms. So AlphaGo as a project was very early. So I kind of see it as being before most of the modern deep learning work really started, which I think sequentially in my brain, I had misunderstood.

Finbarr Timbers [00:16:03]: Well, it's interesting because like, yeah, yeah, no, you're totally right. Like what were we doing in 2016 on the, like ImageNet was still a really difficult problem at that point. Right. And yeah, like they're doing this crazy, yeah, like actually learning like a really difficult value function. But what's really interesting though, is if you go back and you look at, if you read the TD Gammon paper, which I actually, I did, or I read the section of Sutton and Bartow that deals with TD Gammon and it's basically like AlphaZero, like there's so many things they do that are exactly what AlphaZero and MuZero would do later. Like it's really remarkable how much of a modern paper it is. The only difference is that, you know, the network, it's like a three-layer MLP and there's something fancy, but it's really remarkable how precious that was.

Nathan Lambert [00:16:56]: It almost seems like AlphaGo as a success was this like DeepMind pioneering their infrastructure. So if you worked in RL at DeepMind, you would see that they have like all this async learner, multiple actors, distributed setups where they could just kind of scale things up and just keep learning more. And like, that seems like what AlphaGo was, it's just, you have bigger, if you scale deep learning so that the representations can work on the value network, and then you just need the infrastructure to get through enough compute with it.

Finbarr Timbers [00:17:28]: Yeah, well, I think that's kind of interesting, actually. I think a lot of the benefits from DeepMind infrastructure is just like general kind of software infrastructure. So yeah, like the scale stuff you're talking about. But it's also that like when you use, you know, they have this framework called X Manager, which is how you launch experiments. And there's an open source version of it, so you can check it out. And it makes it very easy to, yeah, like launch these multi-process jobs where you can say, you know, okay, we have a bunch of actors, we have a bunch, we have a replay buffer and we have a learner and we have like an eval job. And you have these, you know, these four different job types and they can all communicate with each other and they make that very easy. Whereas as soon as you get outside the ivory tower of Google and you try to do that yourself, it's just really painful and hard to do. And so yeah, like having more. It was like six years earlier.

Nathan Lambert [00:18:20]: Open source has it a bit now, but it wasn't until the low 2020s. So like I know someone I work with now, like CleanRL, they have async multi-learner PPO and stuff, but that's just so much later.

Finbarr Timbers [00:18:35]: Yeah, no, exactly. And I mean, like this is stuff, you know, it was a lot harder than when they did it for AlphaGo because, you know, I think that was just after the acquisition. I forget when the acquisition happened. But yeah, like it's just Google's infrastructure makes it really easy to do this stuff and then to launch it at massive scale. So yeah, like that's this huge advantage. I mean, the other thing is that they have all this logging, I mean, a lot of this came later, but like, yeah, the Google infrastructure is an incredible advantage that they have. And I desperately wish that they would release more of it to the public. You know, even if you could, if it was like a paid service on GCP, I would love that because it's something I desperately miss.

Nathan Lambert [00:19:13]: I'm surprised that they don't have a big, like looking at this RL stuff, it makes me feel more sad that they don't have a bigger advantage in language models. It's like their RL infrastructure was six years ahead. How did they blow that on language models? They just didn't have it?

Finbarr Timbers [00:19:30]: Well, I think that the thing you have to look at though is the gradient, like how quickly are they improving? And if you look at how quickly, like we had what Bard a year ago, like what Gemini is relatively new. I think they're rapidly improving and I think that they're going to be, you know, the people to beat. I think it's that, yeah, super high status, it wasn't a huge priority with them to make these large models until, you know, chat GPT came out and then all of a sudden it was and they turned the ship. It's this massive ship that takes forever to turn and they finally turned it and they're all, you know, full steam ahead. Like I'm very curious to see what the next version of Gemini looks like, especially now that they've got, you know, Gnome and some of the other people coming back on board too. Yeah.

Nathan Lambert [00:20:17]: It's been leaked that Gemini 2 is supposed to come out next week. So when this is actually released, it might be over almost the same day. Like someone posted on LinkedIn saying Gemini 2, some like salesperson in GCP was like Gemini 2 is coming in this week of December. I was like, oh boy. I digress. Continuing the timeline, the PPO 2017 is another good one, which is the PPO paper, which feels very late relative in my mind frame, because that's like the seminal algorithm for most people in like modern deep RL. And the same year's PPO paper was AlphaZero, which I think AlphaZero is largely a demonstration of the bitter lesson again, which is removing human data, which is kind of a, just a constraint in AlphaGo. And then the performance is even better. I don't know if they use, they probably didn't use less compute for AlphaZero, but I would be interested to see if you could do less compute when you remove human data. You probably can't. It probably just gets to a better final spot.

Finbarr Timbers [00:21:18]: I thought they did. I thought, well, I mean, you know, I need to go back and look at the papers, but for AlphaGo, they trained it for weeks. I don't know how many machines they were using, but AlphaZero was, you know, they used more machines, but it was only trained for like 20 hours. It was this very, for Go at least. So I think that they did use less compute, but maybe I'm, maybe I'm misremembering that.

Nathan Lambert [00:21:41]: I mean, even the time shrink is representative, which is like in a year, it shrinks by a factor seven with algorithmic and compute improvements. And then 2018 was OpenAI 5, which is like OpenAI's big foray into doing similar things. So like, this is the only point where they kind of, I mean, they have other interesting works, but this is like their biggest major deep RL project, which kind of looks like an outlier at this point. I guess there's more DeepMind projects, but it's like so interesting that OpenAI did this and then like immediately turned ship when they, like the next year, I think was when GPT, the next, I don't know if it was GPT 2 or 3 came out. It's like this OpenAI 5, which is just very large scale, multiplayer PPO, if I recall correctly. Yeah.

Finbarr Timbers [00:22:27]: Well, cause GPT 2 was, yeah, February of 2019. So it's, yeah, it was right after that. And I mean, you know, I think the other thing is that with OpenAI 5, it, it wasn't that great, right? Like it, like, you know, like, like it's a really difficult problem. I don't, I certainly couldn't do better. But the performance really, like it was good, but it could still be beat by humans. And so I think that that's something where I'm curious if they started to see the writing on the wall, that there are these really difficult problems in reinforcement learning that we don't have the tools to solve right now. And I think with Dota specifically, they started running into the problems of exploration of this incredibly difficult exploration task. And then all of the game theoretic issues that we see arise in games of imperfect information, right? Like games like Stratego and, and poker and, you know, this is something that I worked on for five years at, at DeepMind is, is doing, you know, games with imperfect information. And the math is just incredibly difficult. It is so much more difficult in my opinion, than the games of perfect information like chess and Go. It makes sense to me that they'd start to hit a wall and we see the same thing with AlphaStar. Like AlphaStar was this incredible effort, but it really, you know, I, like, I don't think it was, it was as like, as optimal as anywhere near as optimal as AlphaZero wasn't in Go. And so it's, yeah, it's, I, I would have opened my eyes on it and just thought, oh, man.

Nathan Lambert [00:23:57]: I feel like these are kind of the turning points and where RL was going and kind of how it was perceived. So 2018 is opening at five, 2019 is AlphaStar, which was the next one, which I don't even know what the algorithmic changes were other than probably some things to handle with this uncertainty that you talked about. And then also I have marked out of 2019 when some of these kind of seminal sim-to-real RL papers start, which is something that like RL actually works for. So like RL for locomotion is kind of something like almost industry standard. I don't know exactly to the extent that it is, but some robotics companies, some robotics companies rely very heavily on sim-to-real for locomotion of quadrupeds and stuff like this. So there's this whole lot of work from 2019 to 2022 and today using this. And the first major paper was in 2019, I can read the title, it's like Learning Agile and Dynamic Motor Skills for Legged Robots, I think. I mean, it has Vlad Colton, Marco Hutter on it, and Marco Hutter has a lot of these. I mean, those are very big names in robotic learning world. And then I think the kind of big, the momentum for big RL game projects has slowed. I really wonder if we revisit RL as games today, if it could be bigger, because you have like the representation cap capacity of the transformer for better value networks. But I don't know if that was the bottleneck. I don't know if it was just like a basic, if you can't learn the right value function or not. It's kind of hard to debug in that way.

Finbarr Timbers [00:25:24]: I don't even think it's that. I think that the problem is exploration. And it's like, how do you discover a good, not even good, like a policy that does reasonable things in, or interesting things in a game like Dota or StarCraft? Like you can't Epsilon Greedy, I mean, you can, I guess, if you use enough compute, but it takes a lot of compute to do it. So I think that that's where, yeah, these large language models are interesting, because if we can take the, like they could probably come up with an interesting default policy that we could then, you know, Epsilon Greedy, you know, into some exploration. But it's, yeah, like how do we get this like base interesting policy that we can do stuff with? That's this really difficult problem that we haven't had a historically great answer for it yet. Maybe now with LLMs, we can't.

Nathan Lambert [00:26:13]: Yeah, we'll come back to exploration with O1, because I think it handles it in kind of a strange way. It's interesting from exploration, and we should talk about this. We're into like the, what I call the era of like the slow years, which is 2020 is MuZero, which is obviously a huge accomplishment. I was working in model-based RL at the time. So MuZero was like, oh, we're validated. Let's go. Model-based RL. It didn't change that much. And then 2021 Decision Transformer, mostly just like to represent that the RL community was trying to embrace transformers. I don't necessarily think that they had the answer. Decision Transformer and similar things were mostly for offline RL. And I don't think those things have dramatically shifted the needle. I would guess there might be some applications of it, but it's mostly like good academic work and hasn't landed in the real world.

Finbarr Timbers [00:27:04]: Well, in many ways, RL seems perfectly designed for transformers because you have a sequential, you have a sequence of historical states and actions. And so it's like, yeah, okay, let's use a, let's, you know, attend over all of the historical states. So I think it seems really good. It just, I don't know. It's probably the way to go. Like if you were to train like alpha zero today, like I think you would probably use a transformer. Same thing with MuZero. If you were to train that from today, you're probably going to use a transformer that tends over, you know, a really big window. It's just not an order of magnitude difference.

Nathan Lambert [00:27:44]: It's not on the same plot that scaling laws are. Like there are scaling laws for RL and there are some papers for this, but I don't think that they had the kind of same like open domain impact or like, I think it's like this kind of few-shot transfer thing that's very important where scaling in RL, you get better results on your tasks, but there hasn't been the same thing for like NLP where you have this kind of new task generalization and few-shot transfer that has made things like transformers and chat GPT take off. Whereas RL, it's like the very, you're stuck on tracks. Yeah, absolutely.

Finbarr Timbers [00:28:18]: I totally agree with that.

Nathan Lambert [00:28:25]: And then kind of from here is when it starts getting interesting again. So 2022 is RLHF, chat GPT at the end of the year. I also note that 2022, there were more of these big papers on sim-to-real locomotion. I think the sim-to-real locomotion is just going. It hasn't really slowed down since it was 2019. And even today there's still solid work on it. And this is kind of like, I don't even really know what to say about RLHF at this point. I think we have to kind of incorporate it all into this two-year arc. So 2022, big boon for RL because RLHF is validated at a big scale. 2023 is like a big storm where there's people that are embracing it wholeheartedly. There are mega doubters. I have some quotes that are like, there's prominent people are like, oh, they don't think RL will grow in three years post RLHF. I think that was like a paraphrase of an Andrew Ng quote, so it's not a direct quote. There's lots of things like this. I mean, I remember, I think, I don't know if I want to name names, but there are companies that are very much, we don't need RLHF in the first half of 2023 that have come around. And even I was like, I'll put my chips on this because it's the only thing I could do as an RL person for consistency. But it's like, I have no idea if this was going to work. So that's kind of, I mean, you can retell this 2023 story in many ways, but I think in 2024, it's been pretty clear, which is everyone talks about post-training a lot more. Just like AI feedback, which isn't just human feedback, but using RL at scale for post-training, that was all looking good. And then we kind of have this OpenAI 01 monster, which adds a whole new dimension to using RL for language models. I don't know if we're necessarily ready to call it exploration, but it seems like it almost looks like exploration just because they got some much more open-ended RL behavior. So the traces look like something exploring. I think we need more replications in open source to actually call it like this is an exploration breakthrough for RL or not. But there are many other angles.

Finbarr Timbers [00:30:29]: It is exploration. Because I think it is discovering new knowledge that's not just learning from your dataset. So I think it absolutely is. What I think is really interesting though, is RLHF, you don't need RL. Like the RL bit of it, I think of it as like RL inspired fine tuning, where like none of it actually, like you don't need to read any of Sutton and Bardo, right? Like all you're saying is, you know, we have some, you know, scalar model that says, you know, when things are good, when things are bad, it makes you do the good things more. Like it's just none of it is actually important to the kind of Sutton and Bardo, like let's

Nathan Lambert [00:31:00]: study. I do think the loss function ends up giving you a little bit of a boost. So you get like a low percentage point boost by using some value function loss function rather than like DPO or something, which is really odd, but that's also not the core of RL. I agree.

Finbarr Timbers [00:31:17]: I think PPO is better than DPO, like, you know, or some, I don't think PPO is actually particularly important. I think, you know, some sort of reward model fine tuning, which I like, you know, PPO is probably one of the better ways to do it. You know, that's not important. Like, I think it's like, yeah, doing a reward model is better, but you can still get chat GPT with DPO or something like that. Whereas I think O1, I don't see how you get O1 or something like that with O and RL.

Nathan Lambert [00:31:43]: Yeah. So kind of set some groundwork here before we go into this. O1 is said to be large scale reinforcement learning. I am coming around to think that it is just really large scale RL. They're not really doing any search because doing search would be too hard to deploy to

Finbarr Timbers [00:31:59]: users.

Nathan Lambert [00:32:00]: I think their master stroke plot is beautiful, but it's like kind of from sampling. I don't know how much they can actually control the test time compute, but they can see a correlation that is obvious where the model, if it spends more on compute, it gets better answers. And I think if you do enough sampling, you can pretty easily see this. And now we're starting, now we have the DeepSeq one.

Finbarr Timbers [00:32:24]: But yeah, that's a good, well, I think, you know, when you say it's not doing search, like I think it's important to say, like, what do you mean by that? Because I think generally search is, you know, how do they use more compute to get a better answer? I don't think it's doing like MCTS or some sort of, you know, search like where you have some explicit like search like that. But I think it like what they're doing where you're spending more compute is a type of search. And in many ways, you know, chain of thought is search. All of this is search. And yeah. Yeah.

Nathan Lambert [00:32:55]: I was mostly meaning Monte Carlo tree search or like some sort of tree because trees essentially don't, it's hard to scale up the infrastructure for serving them. I think if you're going to serve it on day one to tons of people on chat to BT, you need a nice way to just store the entire KV cache of the model and just kind of keep going. If you have to like swap things in and out for tons of users, there's no way they could serve this, especially at long context at inference time. It's just like got to be simple.

Finbarr Timbers [00:33:27]: I don't, so I don't necessarily think that's true. I think you could, I think you could basically do it just as like a in-memory structure on top of the KV cache. I don't think you need to do anything that different. I think the real reason, I think it's kind of a bitter lesson thing where I think it's that they fundamentally don't want to be baking in all of like, there's this talk by Noam Brown. Noam is one of the best game, the top game theory researchers in the world. And he was, you know, there's kind of two main poker research groups, there's the University of Alberta and there's Carnegie Mellon. He was part of the Carnegie Mellon group. So we, you know, competed with them in poker for a long time. And so he's one, and he's now at OpenAI and he's this, you know, very successful, great researcher on search. And, you know, he had this talk where he was saying how they chose to do this kind of like in context based search. I think a lot of it is like the bitter lesson type stuff, whereas as soon as you start baking all this more complicated infrastructure, it just gets, you know, hard, you can't optimize it end to end. You have all these hyperparameters you need to choose, whereas if you just have the network do it and it's all, you know, like you backprop through all of it, it gives you more freedom. That's my suspicion of what they're doing, the motivation there.

Nathan Lambert [00:34:37]: Yeah, I generally agree. I think we're kind of debating the edges of the terms. And what I still don't know is how you get stable training for this. So how do you go from a language model to something that kind of talks to itself and doesn't derail? Having talked to enough language models, it's normally kind of binary or not, where once it goes kind of off track, it is just lost. So I think that's the beauty of it, is that it simultaneously seems very odd. And if you've watched enough RL agents, it kind of looks like it has the Coast Runners loopy loopy and then finds its way out, which is like, I don't know if that's a niche behavior that they somehow converged on, which I don't think it is. But it's like how to get recipes that elicit that regularly is just very impressive to me. And I think that people are going to have changed intuitions about language models and what RL is as a loss function once we figure out how to do this more.

Finbarr Timbers [00:35:41]: Well, that's a really interesting point, actually, because, yeah, I find like when you talk to base models that have been released to the public, it's like, yeah, they often even a lot of the really good ones still have these kind of these degenerate failure cases, the repetition, the weird, you're just saying, you know, weird stuff in these loops. And, yeah, you just don't see that in any of the mainstream, like the frontier labs. So, yeah, I think that they probably have a bunch of post-training secret sauce. But on the reinforcement learning side, I mean, I think that's where PPO really stands out. Like you talked earlier, you know, in that timeline, you have PPO on the list. I don't remember PPO being a big deal at the time. You know, I wasn't super. That was when I was just starting.

Nathan Lambert [00:36:24]: I was at Berkeley, which is why I'm just like, oh, my God. But like, I don't know. It wasn't like the same big deal that AlphaGo would have been. It wasn't mainstream.

Finbarr Timbers [00:36:35]: Now, I think the benefit of PPO is that you have this constraint, right, where you're constraining the updates. And that's that's incredibly useful with something like language model or where you have a known good policy. But, you know, that's not useful, I would argue, or at least not nearly as useful with something like, you know, AlphaZero or these sort of self-play regimes. So, you know, I'm curious if that's how they're able to do it, just with, you know, really smart constraints and hyperparameter tuning. That'd be my bias. But, you know, maybe they're or maybe, you know, TRPO and some of the second order stuff is finally living up to its promise.

Nathan Lambert [00:37:10]: OK, well, all of these, I think we said a lot of words. So whether you're talking about reinforced TRPO and PPO, I think the more interesting two for what you said is kind of TRPO and PPO, which is that they might be better suited for fine tuning due to the step size of the gradient constraint. I think that's kind of what makes them interesting, like TRPO and PPO differ on how they constrain the step size. One's like a trust region and one TRPO stands for trust region policy optimization. So it has a slightly different implementation. And I think PPO is simpler, but it doesn't I don't remember exactly what the constraint is. But then you also have reinforced, which doesn't have that same constraint behavior. So on one hand, it's a nice story to tell that you get this nice behavior for fine tuning language models because you have this kind of constraint and you already have a good policy. But if we also could just use reinforce and do the same, I don't know if it actually ends up mattering if you're willing to do the in-depth hyper parameter search and kind of really get the recipe. So maybe if you use. It's a cool thing to think about, and I don't I think we're going to have a final answer. We can go back to the kind of Tulu stuff that we were talking about, which I think is related to O1, which is just training the way that is interesting and I think nice for O1. There is other work on this. We'll add links like Vine, PPO and Quietstar, which are very focused, which are either like more complicated implementations or very focused on reasoning. And what this Tulu 3 stuff that we introduced is different for is because it's for kind of a general purpose model and we're applying it without really getting degradation in the model. So we're doing RL for mostly math and we get small bumps and the model is generally better. I have two interesting anecdotes for you. One, we did this on a broken 70B model, not fully broken, but we essentially did like way too high of a learning rate for 70B SFT. So all the valves start breaking. So when you're training these models, if you see knowledge of valves like MMLU, POPQA, if you did GPQA, probably the same. If you see these start breaking, normally your hyperparameters are wrong because they should be really stable in post-training or if you have special stuff, they might go up slightly. So we had an SFT model where these were totally breaking and then we applied RL on it and it like fixed the model, which I find to be so funny. So like it just like all the valves that were really far down, like all of them bumped back up to what were normal numbers. And we don't really know what that means, but it's clear that like this, like I don't know if you could just do this with SFT, but it's clear that this like RL value function thing was at least a nice regularization for the model's parameters, which was kind of shocking to us. That's an interesting one. Yeah.

Finbarr Timbers [00:40:00]: The second one is really interesting because.

Finbarr Timbers [00:40:04]: Sorry, you go.

Nathan Lambert [00:40:05]: I was just going to move on to the other one. We can kind of reflect on this, which is if we leave this RL training running for way too long. So generally what happens in any over-optimization is with these, like our first few steps of the checkpoint within a day, doing this RLVR after lots of training already is where we get a small bump. And then if we leave it running for days and days, you keep optimizing for math and the specific things we have in it. So those scores are fine, but things start to degenerate on other valves outside the domain we're training on. We also got the behavior where the model was like doing the, wait, let me check my work and it's chain of thought reasoning. So it's like a very O1 behavior, but the training is not that complicated. So it's like in some ways I'm like, OK, if we get the right seed, if we change the RL a bit, it's not that crazy to think about this happening for like every output. You just have to figure out how to get to that sort of area of the parameter space across your entire response region and not just math.

Finbarr Timbers [00:41:05]: Yeah, that's interesting. Well, what's really interesting is that you said that RL solved the problem for you, because I find that generally when you're trying to play RL in a real world problem, it's so hard to get it working. So that's really neat because that's kind of the promise of RL, right? Is that by the time you finally get it working, it can often work around all these broken issues with your setup. Like there's the famous the cheetah experiment where they had the simulated cheetah and they said, OK, your goal is to move as quickly as you can, like literally like maximize velocity. And it learned to like kind of like skid on its back along the ground really quickly.

Nathan Lambert [00:41:47]: I was on a paper from hyperparameter tuning for model-based RL and it like cartwheeled endlessly. So it just figured out a way to cartwheel and keep accelerating itself into the distance, which is a funny gif.

Finbarr Timbers [00:41:58]: Yeah, so that's really cool. That's the promise.

Nathan Lambert [00:42:03]: The lore behind this is that John Schulman gave us some advice and he was like, just do RL on the outputs. So if you go to the acknowledgements, John is like the first person in the acknowledgements. So this is like not that it didn't take us that many months to get this to actually work. I think John told us to do it. We set it up and the numbers went up and we're like, oh, huh. So if you think about OpenAI having had done this for over a year and explored all sorts of RL things, they probably just found a vein to keep digging in. Yeah, there's other things that are different. They might need to use a base model. They need really big verifiers to expand this to more domains. One of the differences between DeepSeek R1 Lite and OpenAI O1 is that DeepSeek R1 is very domain limited. It really only wants to talk about math. It says it talks about coding, but it still refuses a lot of my coding queries. So it's like really, really narrow, probably even more so than OpenAI O1 mini. But like this O1 preview, you really can ask it anything, which is just a totally different level of scaled training than what DeepSeek's version was. DeepSeek's version is impressive for how soon it came out and I've played with it. But it doesn't feel quite as simultaneously magical and unhinged as the OpenAI's

Finbarr Timbers [00:43:27]: version. Well, so that's kind of the interesting thing is that when you talk to, I talk to VCs, you know, just semi, just like regularly, just on like a casual basis. And a lot of them lately, the last two or three months or so, have been really excited about data annotation and labeling companies, like basically scale AI competitors. And a lot of this is because... They're like a year late, what are they doing?

Finbarr Timbers [00:44:00]: Classic VC, right? But, you know, a lot of them are really, they're really excited because they're all of the big labs are spending these huge amounts of money on these services. And I think a lot of it is just getting this data that you can then do like something like the Tulu, like verifiable reward thing. And it's like if you have enough data and you're paying all these, whatever, like, you know, PhD biologists to come up with interesting problems, like I think, yeah, you can do so much for it. I think that's the biggest advantage that, you know, the big labs have is they spend so much money creating these data sets that it's so hard for anyone else to compete with that.

Nathan Lambert [00:44:36]: Having done Tulu, I can give an example for what these would look like for O1. A lot of people would say that O1 wouldn't need human data because it would be like

Finbarr Timbers [00:44:43]: too narrow.

Nathan Lambert [00:44:44]: I think they're not going to give demonstrations for O1, so you're not going to have a human sit there and write out like a meandering chain of thought. What you need is a lot of prompts with true answers. So where we were limited in Tulu is we have the math training set, the GSMA training set, and we made a new training set based on IFFL, which is like the precise instructions, like write me a recipe and start every line with the letter A. So you could check that with Python as well. So we have a set of prompts that are like that. But if you want to scale O1, I'm guessing you need a million prompts that have a verified answer or a really good verifier. So like, and you only get a really good verifier by kind of training it on some

Finbarr Timbers [00:45:24]: sort of data.

Nathan Lambert [00:45:25]: So I can see buying a lot of prompts in that domain actually be extremely helpful to scaling up RL type approaches.

Finbarr Timbers [00:45:34]: Yeah. That makes sense. And again, like if you have an army of PhDs, you know, writing out, yeah, finding these facts and writing these prompts, like it, yeah, that makes sense.

Nathan Lambert [00:45:47]: In O1 it's all probably like code and I mean, they do programmatic filtering too. I think we need to set up code automatic verifiers. It's been clear for a while that everyone assumes that like Claude is doing that for code and other people are doing this for code.

Finbarr Timbers [00:46:02]: That's like, well, it's not that surprising. The other thing is that when you go to the demos that they have, they have all these and the promo material they put out, it's all these different domains, right? It's, it's all, they have this genetics one, they have this quantum physics. So I think that's, I don't know. Yeah. That, that's really interesting because especially on the different, like, I don't know, I haven't benchmarked it, but if you're like the difficult biology problems, like you can probably pick up a lot of textbooks, like just, you know first or second year textbooks. And there's probably a lot of interesting stuff. Whereas when it comes to coding, that's, you know, we're, I think we're a lot deeper in it just because there's so much, that's the domain that these, the people who make these models are familiar with.

Nathan Lambert [00:46:43]: Do you actually use O1 for anything?

Finbarr Timbers [00:46:48]: I don't like chat GPT, actually. I use Claude for everything. Every time I try to use chat GPT, generally, I find it really tough to work with. It does the whole, you know, it puts a comment saying, you know, you'll say, Hey, chat GPT, like write some code for me. And it'll like say, Oh yeah, here's a function. And then there's just a comment. And the comment says, you know, write the rest of the function. And it's like, well, that's what I want you to do. Whereas Claude is just magical and does everything you want to do. So yeah, I actually don't use chat GPT regularly. I just use Claude for everything. Yeah.

Nathan Lambert [00:47:21]: I use Claude for co-pilot like that as well. I think I use chat GPT from your pure like information munching. So if I have a table or something I need to change formats or just like even like rounding a table, it's so silly. It's like rounding a markdown table and converting it to LaTeX. Like chat GPT feels like a highly capable, flexible computer. And then I use O1 as like a Hail Mary, which is like, wow, I've totally screwed. This isn't going to work. Let's just chuck it to O1 and tell it to think for a long time and see if it comes up with something that actually works, which is a really hard thing to market to a product. Like, how do I do that? I also think the thing that's overblown with O1 is like, if you ask Claude a very hard question, I was doing some examples for O1 where I was comparing O1 to R1 Deep

Finbarr Timbers [00:48:08]: Seeks.

Nathan Lambert [00:48:09]: And then I was also comparing to Claude. So if you pass in a really hard question to Claude, it also takes thinks for a while before generating the first token because Claude has these hidden tokens in the system prompt, which is like think for a bit. So I also think that it's like, OK, this is kind of, I don't think it's writing the weird like switch to French reasoning trace and come back to English. But the basic thing is just get your model to spend more on compute for the right user queries. And in many ways, the storytelling behind it is too complicated.

Finbarr Timbers [00:48:40]: Well, the other thing is that, you know, you said that, you know, John Shulman was the guy who put you onto this. Like, isn't he at Anthropic right now? Or am I making it up?

Nathan Lambert [00:48:52]: Yeah, he's moved to Anthropic. I talked to him after he moved to Anthropic for this. I don't really know what he's working on there, but I think he just wants to do

Finbarr Timbers [00:49:01]: research.

Nathan Lambert [00:49:02]: My read is that John. Yeah, my read is that he fell into chat GPT without knowing it was going to be so big, did about 18 months leading the helm, knowing it was not really what he wanted to be doing. And then it was like, I'm I've done my time. I'm going to go do something else.

Finbarr Timbers [00:49:22]: I can think of a few researchers who've had that happen to them. It's kind of funny.

Nathan Lambert [00:49:28]: Because John is very just sincere and he is like very researcher when you talk to him. I don't think he wants to be running chat GPT. I would also not want to be running chat GPT. That sounds extremely stressful.

Finbarr Timbers [00:49:44]: Well, that's a funny, you know, it's the whole the like that's a common thing that happens, like I saw it all the time. In my time in industry research labs where you see there's a researcher who does a really good job and they get promoted and they get promoted again and then they get promoted a third time and now they've got a team of people and they're not doing research, right? They're they're managing a team of other researchers. And, you know, or I mean, there's also academia happens all the time. We become a PI and you're telling other people what to do. But it's just funny because, yeah, I don't know. That's a totally different job, you know, and I think it's much less interesting and much less fun.

Nathan Lambert [00:50:19]: OK, let's keep going into this, which is the extra advice section. I have reflections on this from the Tulu project. I don't before we start, I'm guessing you're an IC. I don't I don't think you're a manager from how you describe this.

Finbarr Timbers [00:50:33]: No. Yeah.

Nathan Lambert [00:50:34]: OK, so the whole building. You're presumably building some type of models, but I think building language models for kind of multitask abilities is a very weird management project where I don't think that it's like you can let one person just run. And I think you generally need a lot of motivation over a long time scale, which is not surprising because it's very detail oriented. And then you need two types of people. You need one person that's going to get to the bottom of every single thing that is relevant to their sub area. And then you need the type of people that can hold everything in their head and like run these projects and say no to people that are going off on goose chases and be like, you need to come back for a while and actually contribute to this model. Because like even doing to do is such a big project where we have these multiple stages of training. We have people that are focusing on SFT data, people focusing on math data. We have people doing explorations on like can rejection sampling work?

Finbarr Timbers [00:51:29]: What training algorithm should we do?

Nathan Lambert [00:51:31]: So it's just a lot of noise. And then you have to be able to understand what is actually working or not and fold that into like the direct recipe. And there's a lot of these waves and it's extremely complicated. I mean, there's just brutal chart of our slack for the month that we were pushing to do out, which is I have sent twice as many slack messages as anyone else in the company in the last month, just because I'm like in the weeds of every single thread on this silly post-training process, being like making sure we don't do this stupid thing wrong. And I was definitely operating as the person doing these details. But I think it's really important to getting a good language model done is having these people that kind of have oversight of these really deep individual contributors. Otherwise, you're just going to drop things. I've seen it in other projects where it's just like, oh, long context is pushed to the next project. But like that happened for our post-training. But it's like you just happen again and again and again if you don't have this kind of structure, which I think is very different than research. It's like research is like five to seven people. This Tulu project is like seven people that are pretty much core, which is just below first out there, which is just like it's and I can only imagine at these labs like Lama supposedly has 200 people on their post-training team. Managing that hierarchy and how you shuttle information up the chain is, I don't know how they handle it. Like that's why you hear about politics of big companies. But I think it's so much of a management problem to just build good language models. And I do think there's going to be.

Finbarr Timbers [00:53:01]: I think this is so that is generally a big problem with the large scale of reinforcement learning. Like we saw this as part of the I led the engineering effort for a while on this project at DeepMind where we were trying to combine DeepStack, which is the state of the art poker agent with AlphaZero. We're trying to make like an AlphaZero variant that could play poker basically. And it's just, yeah, it's exactly what you say. It's really difficult. And funnily enough, one of the highest value add people on this project, which was called the Player of Games project, one of the highest value add people was this program manager. This guy, Alden, who basically he did the standard project man stuff. That's kind of I think it's kind of like look down upon the tech industry. But he had a burn, he had a burn down chart and he had a checklist and he would just go down and say, OK, you know, what's the status of this thing? What's the status of this thing? And it was just it was exactly what you say. He was the guy who was making sure that, you know, all of the what we now call post training, like all of the post training stuff was done in the right order, that it was done, that, you know, all of the prerequisites were being done. And like that's such a it's such a value add and, you know, a productivity multiplier if you can have someone to really keep that stuff straight. And then when you don't, everything collapses. I kind of suspect that with some of the labs that we're seeing where they're struggling to ship stuff, I think that that's the problem is that, you know, it's project management, which is just not really taken seriously. There's a ton of impact, particularly in research.

Nathan Lambert [00:54:26]: I think all these labs, like all the individual improvements are probably astronomical. And then putting the pieces together into one model is really hard. The other thing we've seen with Tulu is like we have this SFT data set, which kind of multiple people did all these ablations on. And we were trying to apply it to Olmo, which doesn't have code. It's not multilingual relative to Lama, which is trained built for Lama, really. But I just remove all the multilingual data. I tried removing all the multilingual data and like the performance is worse. And it's like I would have expected that, you know, don't need like it's not a target for Olmo. We don't need this. But like the numbers go down. There's just so many weird second order effects in this stuff.

Finbarr Timbers [00:55:05]: And I don't know.

Nathan Lambert [00:55:07]: Yeah, I'm excited to see more coherent theories emerge from the industry on this over the next few years, which is just like how to set up a foundation modeling team. I don't know if people think that there will still be pre-training. That's the other question is like, is it only going to be three companies pre-training, in which case I don't know if we'll actually get this information on how to set up these teams.

Finbarr Timbers [00:55:30]: Well, I think generally, though, this is a problem we see with industrial research where, you know, the process that I've seen happen repeatedly is good hearted, right? You set up a benchmark, you have a bunch of people go and try to beat it and they come up with, you know, increasingly convoluted ways of beating it and what you really want to do. And this happens all the time. Like there was this happened in NLP for a long time. Like there was this one paper that was really great where it looked at, you know, like an LSTM from whatever. Like there was some LSTM paper used as a baseline from like 2015. And they looked at all of the research on the benchmark for, you know, that was in 2019 and 2020. And it was all the research in the years since. And they compared it to like then they took the same LSTM and they just did a really good, really thorough hyperparameter search. And it turns out that with a properly done benchmark, the LSTM was as good as all the improvements since then. And I think that it's stuff like that where people don't do a good job of benchmarking. They didn't do a good job of benchmarking and combining improvements before we got into the large language model era. And then now that's kind of impossible because you can't ablate all of these different experiments. So, yeah, it's a really big problem. And I don't think we're doing a good job of it as an industry.

Nathan Lambert [00:56:44]: Yeah, I mean, this is a classic story. We don't need to rehash all of this. I think even AI2 has some of this where our VAL scores are really when we have an eval, we're targeting our VAL scores would be really good, but we don't really know how to do like character training. It's something I might spend a few months on next year. It's like, how do we make Olmo like have a coherent character like Claude does? This is a really deep cut, but like buried in the Lex Friedman giant interview with Anthropic People, Amanda Askell essentially outed that they do effectively character AI for or constitutional AI for character as well, rather than just kind of like moral principles, which is like refining synthetic data to help teach the model to behave in like a certain language or in a certain way and have a certain style, which I think is important for kind of, I mean, it's why we like Claude. It has a personality, but then they also have to maintain it. That's something that there is no academic research on.

Finbarr Timbers [00:57:40]: Well, that's what's really interesting about Golden Gate Claude and the whole Sparse Autoencoder line of research is that seems perfectly designed. Like if you want to make a state of the art character AI, right, I think you could do that by just taking whatever the best, you know, open foundation model is training a bunch of Sparse Autoencoders on it and then, yeah, tuning it to the various characters that you wanted. So I think it's, yeah, they seem to be, you know, I think that they're really pushing the boundaries of that work, but it's kind of flying under the radar.

Nathan Lambert [00:58:10]: Yeah, they should keep it, they should have a paid tier where we can have Golden Gate Claude for a day. They'd probably make a lot of money. People would pay for that. It's ridiculous. OK, to wrap up, I wanted to shout out that you have a fun list of advice buried in your website, which I have a target audience for as another tech person that does outdoor sports. I think there's two that, some of this is beating a dead horse in a way, but you have number 13 is encourage people to do things. I think that is very good. And the other one I highlighted is number 14, which is to paraphrase, to take a part of it is you have to create a public surface area for people to be aware of you. I think part of this is just kind of these interviews I do is keeping touch with people that have figured these things out and kind of understand what truths there are to building AI and AI careers in this kind of modern space, which I think those two kind of encapsulated it for me. It's like you have to have agency and you have to be a team player, but like you've got to fend for yourself in a way. And you've been writing and you have an audience to do that now.

Finbarr Timbers [00:59:22]: Well, I'm glad you find the advice useful. There's a lot of it's kind of esoteric, like how to wax your skis properly. So I'm glad there's something that provides value. I mean, yeah, I think a lot of that is stuff that came out of my career at DeepMind, whereas you start to advance in your career pretty quickly. Technical skill is not the bottleneck. The bottleneck is your ability to influence others and how you get along with others. And that's kind of the encouragement piece where if you want to try and have an impact on the culture of an organization, it's really tough to go to someone and say, hey, stop doing this and be negative. Like no one ever responds well to that. But if you say, oh, man, you know, Nathan, you know, I love your podcast. It's so great. Like, please keep doing it. And, you know, I love that episode you did with, you know, with Finbarr, you know, please do more like that, you know, then someone, people just respond better to positive encouragement. And often people just don't get anywhere near as much positive feedback as you think, you know, particularly these large companies. I found that, you know, other than my manager, like it was very rare to actually, you know, and DeepMind was quite good for this compared to other places I've worked. But, yeah, it's just rare for people to get positive feedback. So it really stands out. And so it's a way you can, you know, like influence others and kind of, you know, steer them in directions that you think are fruitful. And then, I mean, in terms of the rich internal, the surface area thing, yeah, it's just there's a lot of people I can think of at DeepMind or, you know, for that matter, from grad school or from undergrad who are way smarter than me. But the thing is, is that they don't they don't do anything public. Like they just show up to the, you know, the way that they get a job is they, you know, apply to the online application portal and they just do that job. Right. And there's no and so, you know, if you're trying to see how brilliant they are, you have to try to try to find them somewhere. Whereas if you can start writing things like, you know, I've been, you know, fortunate in that a bunch of my articles that I've written have gone to the front page of Hacker News. And so just because of that, people know that I exist. And so that alone, like, you know, I might be whatever, like an average, you know, DeepMind, you know, when I worked there, I normally work there. When I worked there, I was maybe just like an average employee. But it turns out that, you know, of the ones who write, there's just very few of them who write publicly. And so you have this competitive advantage and people find out about you. They want to do things with you. And so, you know, I've had some really interesting opportunities come up. Thanks to that. It's been probably the single best thing I've done for my career.

Nathan Lambert [01:01:55]: Yeah, I mean, the funny example is like Logan Kilpatrick. It's like he was at OpenAI and now he's back at Google. And I think it's entirely they're fighting over him because he's like the most prominent voice in the AI developer ecosystem. And it's like you have to look at that and not be extremely cynical if you are not doing anything like it. I think it's easy to be cynical and be like, oh, that's not actual value, which is

Finbarr Timbers [01:02:17]: but it just do things. Value. Because, again, you know, what he's done is he's made this brand as being this guy who is on the side of the customer. And, you know, he has this this persona of someone who is willing to fight the powers that be inside the company to make it better and easier to use. And, you know, Logan will fix your problems. And that was my experience. Like, you know, I had some problems using the API and I was complaining about them. And then, yeah, he tried to fix and he worked on it. And so he's in this position where by giving him presumably a lot of money, you know, Google is able to say we care about our customers. And so, yeah, that's a that's a really great place to be in the same sense. Like, you know, you are in this place where, you know, you've established yourself as an expert in, you know, language models and an expert in reinforcement learning. And so if there's a company that really cares about that sort of, you know, that stuff and they really want to take your reasoning seriously, they can go and say, oh, hey, Nathan, you know, here's a dump truck full of money we're going to give you. If you come in, you know, we can put your head on our slide deck and show it to investors. Right. Like it's it's just a really brilliant, you know, persona that he's. Yeah, that he's he's come up with and that you've come up with and that, you know, all of us who write publicly are working.

Nathan Lambert [01:03:35]: Yeah. One of my more interesting things I tried to figure out in the back to the encouragement one is how to encourage people to do this and how to start building momentum for people. I think for some people it's easy when you have I have a lot of surface area. So I can just like give talks to people in my team that I know are good. I can be like, look, you know, that's like, go give this talk. Like it's good for them. And then just how to keep building this kind of momentum. At the same time, I think that it's so fun to watch Google Anthropic and OpenAI because they all have extremely strong internal culture right now, I believe from the I don't know, talk to as many people at them. But from the messaging, it seems obvious that they have really strong internal culture and that culture is a great motivator for people. It's almost like a way to get people to encourage themselves, which is you could have silly sayings like they want to learn. But all of these companies are riding that wave. And a lot of just having a midsize company is probably like cultivating that on your own without being at the absolute forefront of AI. And it'll go very far.

Finbarr Timbers [01:04:38]: Well, and it's the classic thing where you're talking about the culture that gets really easy to build a culture. And as long as you're because it's, you know, the culture is who you reward, like it's who you say, you know, this person's a great job, you know, this person's getting promoted and you can really see that start to happen. And so, you know, I saw that in various jobs I've had where, like, you start to reward people for, you know, coming up with a new framework for, you know, training models. And then all of a sudden, everyone has their own framework and they're working hard on it. And so you have this like, you know, you have too many frameworks, you need to collapse. And so, you know, conversely, if you reward people for shipping or for getting good results, then everyone's going to orient themselves around that. And it's, yeah, I think that's a lot of the success of somewhere like OpenAI, you know, compared to Google. Like, I think Google had, you know, certainly more and, you know, on par or maybe stronger, you know, that that's arguable. But, you know, they're a very strong technical team, but they didn't have that same culture of shipping aggressively and building products, whereas OpenAI did. And so, you know, that was this huge advantage.

Nathan Lambert [01:05:39]: Yeah, I mean, part of it right now, I have a major release last week, trying to get another major release out the door. I think one of the things is just like meet your deadlines. Like once you start meeting your deadlines, it actually shifts culture a lot. Like, yes, it sucks to do a 10 day death march to get a model out the door and write a 50 page paper or whatever. But when you're like, look, we set this date from a long way out and we did it is like, I mean, we both are endurance athletes, but it's just like you have to do the things you say you do. It's like almost really like making releases comes down to just like doing the training that you said you were going to do in some ways. And it adds up once people do a lot of them. But it's hard to get the culture of doing it if no one has done it before.

Finbarr Timbers [01:06:22]: Well, that's what I love about outdoor endurance sports, is that you kind of just have to finish it. And so, like, you know, I like to go backcountry skiing. I like to these hut trips where you go in the Canadian, you ski, you go backcountry skiing, you ski uphill and, you know, you're skiing to a hut in a mountain pass in the middle of nowhere and you're going to go ski out of it. But you have to get to the hut before it gets dark or, you know, you're going to be sleeping in the forest in the Canadian winter, which isn't pleasant. And I think, yeah, with a lot of this stuff, it's like, yeah, you need to get it done. And, you know, particularly when I was at some of these larger institutions, there's not a lot of pressure to release projects. And so it's very natural to say, oh, we can make it better. Well, let's wait until it gets better. Like I was on this one project that took us, you know, several years and it was, you know, we kept thinking that we had six months to go. And then, you know, six months later, we'd say, oh, well, we still have six months to go. And that happened for years.

Nathan Lambert [01:07:15]: A six month deadline is not a deadline. It's only a deadline within like a few weeks.

Finbarr Timbers [01:07:21]: And if you have a deadline and you say, you know, we are pushing something out at this deadline, then you're going to put it out and maybe you're going to have to sacrifice quality or sacrifice scope or something, but you are going to put something out. And so you have to figure out, OK, where are we going to cut? What are we going to, you know, go through? And that exercise is incredibly important. Like I think that's an advantage that someone like Apple has. And I think they're going to start shipping really well because every year they have to put out a new product. And there's going to be this incentive to try and get your stuff in, to get your stuff in to the new iPhone release. Then, you know, when it comes time to your performance review, you can say, you know, you got it in. And it's just this massive, yeah, this massive incentive. And so I think that'll bode really well for their call. Yeah, that's fun.

Nathan Lambert [01:08:04]: This is really good. I'm going to go back to trying to ship this model. I have a bucket load of Slack messages and some vowels to run, but it's fun to chat. Yeah, thanks for coming on. I'll click stop.

and a longer blog archive https://finbarr.ca/blog/.