0:00
/
0:00
Transcript

Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next

Interconnects Interview #14. Ross's second time on the show.

I’m excited to welcome Ross Taylor back on the podcast (and sorry for the lack of episodes in general – I have a lot going on!). The first time Ross came on we focused on reasoning – before inference-time scaling and that sort of RL was popular, agents, Galactica, and more from his Llama days. Since then, and especially after DeepSeek R1, Ross and I have talked asynchronously about the happenings of AI, so it’s exciting to do it face to face.

In this episode we cover some of everything:

  • Recent AI news (Chinese models and OpenAI’s coming releases)

  • “Do and don’t” of LLM training organizations

  • Reasoning research and academic blind spots

  • Research people aren’t paying enough attention to

  • Non language modeling news & other topics

Share

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Show outline as a mix of questions and edited assertions that Ross sent me as potential topics.

00:00 Recent AI news

Related reading is on Kimi’s K2 model, thoughts on OpenAI’s forthcoming open release.

  • What did you think of Z.ai’s GLM 4.5 model (including MIT licensed base model) with very strong scores? And Kimi?

  • What will OpenAI’s open model actually be?

  • What do you make of the state of the ecosystem?

12:10 “Do and don’t” of LLM training organizations

Related reading is on managing training organizations or the Llama 4 release.

This is one of my favorite topics – I think a lot of great stuff will be written on it in the future. For now, Ross asserts…

  • Most major LLM efforts are not talent-bound, but politics-bound. Recent failures like Llama 4 are org failures not talent failures.

  • Most labs are chaotic, changing direction every week. Very different picture from the narrative presented online.

  • Most labs represent investment banks or accountancy firms in that they hire smart young people as “soldiers” and deliberately burn them out with extremely long hours.

36:40 Reasoning research and academic blind spots

Related reading is two papers point questions at the Qwen base models for RL (or a summary blog post I wrote).

I start with: What do you think of o3, and search as something to train with RL?

And Ross asserts…

  • Most open reasoning research since R1 has been unhelpful - because not enough compute to see what matters (underlying model and iterations).

  • Best stuff has been simple tweaks to GRPO like overlong filtering and removing KL divergence.

  • Far too much focus on MATH and code - AIME has tens of samples too so is very noisy.

  • People are generally building the wrong kind of environments - like puzzles, games etc - instead of thinking about what kind of new capabilities they’d like to incentivise emerging.

50:20 Research people aren’t paying enough attention to

The research area I hear the most about right now is “rubrics” – a per-prompt specialized LLM-as-a-judge to replace reward models. SemiAnalysis reported OpenAI scaling this approach and lots of great research is coming out around it.

I start with: What do you think of the state of RL scaling and generalization? What of models losing

Ross asserts…

  • Rubrics are underhyped on social media - they were driving force behind projects like DeepResearch - and GenRMs are interesting but perhaps slightly overhyped.

  • There is an evals crisis - there are not enough high quality evals, particularly for frontier tasks like automating research and real life work. Impediment to anyone building agents or ASI.

01:02:46 Extra stuff!

I ask Ross: What AI are you using today? Why?

To conclude, Ross wanted to discuss how AlphaEvolve has been underhyped on social media, and means the future isn’t just RL. Shows there are other effective ways to use inference compute.

Interconnects is a reader-supported publication. Consider becoming a subscriber.

Transcript

Created with AI, pardon the minor typos, not quite enough time this week but I’m hiring someone to help with this soon!

Nathan Lambert: Hey, Ross. How's it going? Welcome back to Interconnects. What did I take a many month break off podcasting? I've been too busy to do all this stuff myself.

Ross Taylor: Yeah, I was trying to think of all the things that happened the last time we did a podcast like a year ago. And I think in AI time, that's like two hundred years.

Nathan Lambert: Yeah. So I was looking

Ross Taylor: at it on age. So

Nathan Lambert: We talked about reasoning and all of the like I don't think o one had happened yet, which was pretty funny, I think. For a brief intro, Ross was a co founder of Papers with Code, and that brought him to Meta. And then at Meta, he was a lead on Galactica, which was a kind of language model ahead of its time relative to ChatGPT. So if people don't know about Galactica, there's a great paper worth reading. And then he was doing a bunch of stuff on reasoning with LAMA related to a lot of the techniques that we'll talk about in this.

And now he's doing a startup. I don't know if he wants to talk about this, but generally, we talk a lot about various things. This got started through o one and trying to figure out some of these scaling RL stuff. We started talking a lot, but then we also just resonate on a lot of topics on training language models and other fun stuff and trying to be one of the few people not in these big labs that try that tries to talk about this and think about what the heck's going on. So we're gonna kind of roll through a long list of a lot of things that Ross sent me he wanted to talk about, but these are really just a a compilation of the things that we've talked about and just kind of flesh them out outside of the signal chat.

So, Ross, if you wanna introduce yourself more, you can, or we'll just kind of start talking about news because I think a lot of people already know you.

Ross Taylor: Yeah. Let's get into the news. Think that's lots of fun things to talk

Nathan Lambert: the last two weeks of Chinese models? I think we had Z dot AI's GLM 4.5 today. Kimmy two last week. I think Quinn is on a roll. It's like, what is the summer is supposed to be chill, but this is crazy.

It's like I I haven't even used all of these. It's like the pace is just incredible. And the all the open models have actually good licenses now. But is is this gonna actually hurt anyone in The US, or where do you see this going in six months?

Ross Taylor: Yeah. So yesterday was, like, the one day I actually tried to turn off Twitter. And, yeah, so when you told me in the morning about the the strict Xinhua, like, new model, I was like, okay. I had to read off on that. So then maybe that gives an idea that, you know, you take your eye off Twitter for one second, then you're, like, I don't know, two months behind on open source.

Maybe that's an exaggeration. Yeah. I think the general theme is it's just been absolutely relentless. Right? So, again, thinking about the last time I spoke to you on the podcast a year ago, you know, Lama three, I think, was a fairly, like, established standards.

You know, like, I think there was things happening in the background, you paid attention to things, but now it's just absolutely relentless. I think the thing about particularly Chinese business cultures, as soon as they find something successful, they're just very good at concentrating resources and going after that. So I think we see a highly competitive space. I think the context is very interesting on like different dimensions. Mean, there's the geopolitical dimension, which I think you've hinted at in some of your blogs and like, what does this mean now if like the open source standard is Chinese, right?

What does that mean if we think about these models not as, like, just things with power products, but it's, like, infrastructure? Then it seems like China has a great advantage if they want to, like, be the standard that you know, for the whole, you global south. Right? But there's also

Nathan Lambert: thing Yeah. There's There's a few things that we're gonna come back to in this conversation that was so interesting. We're gonna roll into our our what the heck does it take to train these models. And we're gonna talk about, like, how crazy and political and hard it is in The US. But we have all these orgs popping up in China, which is like, is this just, like, partially a US problem?

But then we also have OpenAI that's supposedly gonna release a model. Like, it's like there's multiple things, I think. Yeah. Same way. Training dynamics later.

But it's like why is like, China is just really well is are they well suited to training these language models when we talk about politics later? Like, is is it that easy in some ways?

Ross Taylor: Yeah. I I don't wanna make, like, generalizations because I think I think if anything, we've seen is actually a lot of these new Chinese orgs are actually good at some innovations, right? I mean, this week we had GSPO, which was a nice innovation. But I think that the general sense is that you know, once something is established as, a successful thing, like, the specification is essentially just an engineering problem, then traditionally, like, Chinese based orgs, the culture is just very well set up to to do well in that. The other dimension I think is now, especially after deep seek.

Right? So the Chinese government traditionally has been very good at, like, recognizing what's successful and allowing you to resources in, especially with things like, you know, private, you know, like public collaborations. Right? So, like, I think that the conversation I saw in Twist this morning is like, okay. So Xinhua has their own, like, state of the art LLM?

Like, why doesn't MIT have their own,

Nathan Lambert: like They're kinda deprived. But

Ross Taylor: yeah. Yeah. I mean so I think I think the The US will wake up to this. But My

Nathan Lambert: understanding is that Xinghua is a or Z dot ai's, I think, Zipoo, I don't know how to pronounce it, is a startup that, like, spun out of Xinghua. So it's like, I don't know if that's, like, the best comparison. And also, like Alibaba's the clear winner here because they have Quinn, but they've also invested in Moonshot, which is Kimi, and then I think also this Z dot ai. So it's like Yeah. I'm more interested in the question of, like, why like, the question of why they're all open is way more important relative to the talent because there's there's universities that have model orgs spinning out of them in The US, surely, but it might it's not all of them, and it's really not all of them in China.

I think MIT may do it. It's there's there's a lot of, like, some other small numbers that we're dealing with in that case. But it also it is a thing that I I obviously agree that The US should have more compute deployed for academics, and a lot of the universities are just spinning them up. It just takes a long time. So I think it's kind of a there's a lot of mixed things there that are easy to draw.

Like, and potentially, like there's a good tweet in it, but I don't think it'll be a 100% true, which makes for a very viral tweet when it's like, it feels true.

Ross Taylor: I think there's definitely a naivety about like how things are actually working, right? And there's asymmetric information, you don't know what's going on in the inside. And I think the other thing is like, I mean, maybe this is a separate topic, but it's like, I think there's a tendency to be open source models that it's like homogenous cache tree, right? But then there's actually very different use cases, right? So, if I want to do like, I know like a new reasoning paper, I'm gonna use a quint model, right?

But then if I'm doing distillation, I'm gonna use DeepSeek or Kymi. I think that fits into the OpenAI question, because in my mind, still, I mean, we'll see. I'm sure it'll be like a great model, but I don't quite see how it would fit into the ecosystem. Right? Because is it gonna be something that people are gonna build research on?

Like, if it's a post trained model, probably not. Right? And then so then you're thinking

Nathan Lambert: Yeah. But their tweet was about safety, so I I doubt they're releasing a base model if they're like, if they delayed it for safety. And I I I do think that they actually did delay it for safety. It's very an OpenAI's culture. But it I don't think it's gonna change the ecosystem, but it's, like, an interesting one off.

Because I also don't expect them to release a model that's based on their GPT's 3.5 or liter architecture. I bet they, like, took it off the shelf architecture, which is actually probably based off by Quen or Llama. So, like, a lot of the recent OMO models are, like, very Quen y. And then and they're, like, you choose the sizes based on what fits on the cluster, and I think Quen is very deep rather than wide. And it's like, Olmo two is very similar to that.

So it's like, bet that the OpenAI model is probably also gonna fit that mold, which is pretty funny.

Ross Taylor: I think so. Yeah. I I guess one way to think about it is they're just trying to distill our infrastructure into white space, right? As opposed to make clear the architectural choices in the public. But then, yeah, back to it, do you think, I mean, maybe it's a question for you, Nathan, do you think the NACE over there model is, like, more comparable in its use case to, like, a Kimi or a DeepSeq?

Or is it on the Quen level? Or is it actually something completely different where it's supposed to be, I don't like, on device model? Smaller.

Nathan Lambert: But I I expect it to be smaller. They joked about on device, which I don't know is the right framing.

Ross Taylor: Yeah.

Nathan Lambert: I'm also just realizing now if like RL is their great strength, part of the challenge of shipping an RL model in the open source is that you need your training infrastructure to match the inference infrastructure. So it's like, unless they train this on like exact VLM that people have access to and some weird like open source environments, Like, they're not gonna be able to dump this and say, oh, you could do search and code execution in your open model stack. It's like so there's so many weird things like that. I don't know exactly how Quinn and DeepSeek have gone about it. My impression is that they're actually not as useful in terms of tool use because it's so hard.

Like, I think that tool use is naturally a closed model reinforcing thing because it just benefits to have these tools match up.

Ross Taylor: Yes. It's one I've, like, seen the the QUAN models are pretty good at things like function calling and stuff, but I think it's a more recent thing. So Kimmy, at least in the benchmarks, was, like, pretty good at these kind of, like, agentic tool use benchmarks. And then, I mean, is a separate discussion, but they had this nice training innovation, right, where they just call these MCP servers, which is a nice synthetic data strategy. But yeah, it depends, right?

Because you're just seeing mostly headline evals, which you shouldn't really trust anyway. So

Nathan Lambert: I I think of Claude four as the one that kind of ended the eval chasing. And there were that release was, like, on paper was so lame, but, like, it delivered for everybody, which is very bold, honestly, for philanthropic because there's a lot of money on the line. They're constantly fundraising. If one, like, fundraiser gets spooked because they're like, oh, your mo your numbers are bad. Like, it's a lot of CEO calls they've gotta make up.

Ross Taylor: I think was thinking about this a few months ago. It might have changed now given the pace of AI, but I'm thinking, like, how do you what's the timeline for your impact for a release? Right? So day one is just like, to be honest, like, bullshit benchmarks. Oh, I've got, like, this amount on MMLU Pro.

But then the next tier is like the day after where people have got all these like weird bespoke evals on Twitter. And then there's like, I know

Nathan Lambert: The pelicans and the rotating And hexagons and balls in then you're

Ross Taylor: getting more confidence because you're like, unless they're very smart, which I think some of them are, by the way, they probably haven't optimized for, like, the day two benchmarks. Right? But that's when you're beginning to believe, oh, actually, like, maybe this stuff actually generalizes. Right? And then it's, like, the the week or two weeks where you have, like, the real, okay.

I've actually tried to sell quite a lot now. It's actually, like, yeah, very good. So, yeah, that that's the time.

Nathan Lambert: Yeah. Refute my claim. Chinese providers are still optimizing for benchmarks more than OpenAI, Google, and

Ross Taylor: Yep. I mean, probably said it.

Nathan Lambert: It's like I I it feels so obvious to me. I think that China has closed the gap a remarkable degree, but I don't think they've caught up fully. I I think that's hard. It's just that it's just hard to get all of that data and and pipelines in place. A lot of it is actually, I think, user data and know like, know your user, and then he'll climb on that.

So, like, all these APIs not working is a huge issue for them.

Ross Taylor: Yeah. And I think they've been helped by the fact that the reason they haven't been quite quite exposed is that imagine you're an academic, right, using a reasoning paper. You can do stuff where data is available, like math and codes. Right? So you're already in that kind of base and they've optimized for anyway.

So therefore, you're not yeah. Even the stuff which kind of reinforcing Quen use is not really necessarily, like, testing the true bounds of the generalization of that model. Right? Because we already know that the Quen models are, like, heavily mid trained on math and code. So, like, we're not, yeah, really exposing it maybe to some of the tails where which are more interesting.

Nathan Lambert: Yeah. Okay. This is a good preview for for the episode. I think that we've the main things are definitely gonna be this training organizations and then, like, called academic reasoning research and how to bridge that. I think we can talk start with this org chart essentially, the org chart question.

It's like, how do you make a good org? Alright. Or there's two things. One, how do make a good org chart for training language models? And two, how do you make an effective culture?

I think this is quickly becoming one of my favorite little niche interests because there's just so much intrigue in the person side of it, like, an individual. There's just so much money on the line to break everything. So you sent me some hot takes if you wanna read them, but the floor is yours for what doesn't does not work.

Ross Taylor: Yeah. So I think I've just been like yeah. I mean, if you anyone's been on, like, social media, the the general trend is the check social media, then you've seen these, like, kind of, like, NFL draft style tweets of someone's being recruited by an org. First of all, researchers have always moved between orgs. This is not a new thing.

And a lot of the orgs that were hyped were just regular moves. But I think there's also just a general tendency just to see, like, the bottleneck in a lot of LM projects, at least on Twitter, has been, like, kind of skill issues. And at least for my, like, n n equals one experience, that hasn't been the case. And I think there's a number of ways to, like, make this case, but I think I'd start by saying, like, the machine learning is just a heavily empirical science. Right?

So, like, what does genius even mean in that context or talent actually mean? And there's certainly some skills which are useful, like, how do you form like the right, you know, minimal viable experiment? Like, how do you iterate fast to, you know, for a research direction, we're gonna hit dead dead ends. But a lot of it, to be honest, this comes down to like hard work, good infrastructure, and like resources. So in in that context, like, to be honest, like, most of these orgs, you know, even before, like, you know, certain public, like, failings had, like, very good people.

Right? So and I don't think the difference in talent between different orgs is, that big, to be honest. Like, is this like, you know, smart people eventually figure things out eventually. So then more often than not, like, you see, like, the difference in, like, let's say, a good model versus a bad model, it's actually just reflecting some inefficiency in the ability to channel resources to your talent. So I think that's the fundamental point.

Now you could say on the flip side, okay, Ross, well, if that's true, is Zuck paying people these massive amounts of money? And I think that's a separate question. But yeah, more often

Nathan Lambert: No, than Trisha, what think? Do

Ross Taylor: Yeah, so I think so I'm kind of torn on this because on the one hand, I think the new group will probably make very good models. Like, I think, yes, they're, like, very smart people. And I think having a new org as well, I think, is the right way to do it. So I think in in the leadership's mind, it's probably just a case of, like, look. We tried this multiple times.

We're very serious about this. We have resources. So let's just do, like, the maximum conviction play. And I think that's, like, broadly what you would do because it's still, like I mean, it's big expenses, but it's still not, like, you know, massive, massive spend. Right?

I mean so so so I think that I think that's gonna But on the other hand, I do feel kinda sorry for, this isn't a meta point, but in general, like, I think it's a shame that a lot of the organizations don't have good mechanisms to already identify the talent that's in their orgs, like, doing the hard work, and they need to do things afresh. I think that's the kind of tragedy of it. But yeah. So it's kind of that that conflict is in my mind. I think they'll make they're great models.

I I think it's the right approach to do things afresh. But at the same time, it's it's a shame that all the people who grew after them, previous generations of models sometimes are just treated a bit like an asset. Right? So you've used them, you kind of like work them hard, and now you move on to like a new group of like people. Right?

So I think this is just a matter point, but I think it's a yeah.

Nathan Lambert: You put this in your provocations. You sent me that that language modeling labs are sort of like banks where people are slotted in to to to burn out and burn through. I mean, know a lot of the work that needs to be done is somewhat mundane data work and it can be paralyzed if you're like, our users are asking this type of question. Let's create new prompts and manage human workers and create synthetic data pipelines for this one thing. And it works a lot of the times.

What is it like the Dwarkash has these podcasts with Sholto and Trenton. And it's the one of the the one that works they've they've both moved jobs, which reinforces your point. But they're like, oh, You just need to convince someone at a frontier lab that this problem was important. So it's like people talk about this. So you just have to do this.

Do you really see do you see a lot of people being dispatched to solve these specific things versus the dynamic where the individuals are kind of a free rein and it's fun on the ground and you choose the things you want to add to your beautiful final model? So you can present a positive and a negative. It might vary across labs, but I guess your provocation is that there's a bunch of places where it kind of is a meat grinder and you just put people in and chew through them.

Ross Taylor: I think so. I think a model for unfortunately, like, a lot of successful tech companies is just you get very young, like, motivated, definitely base level of smart people who just are willing to work very long hours on, like, a on a strong mission. Right? So that was, like, the original, like, Elon way to run a company. Right?

I But think that's the model for a lot of Frontier Labs. Right? You have your soldiers who are the ones, like yeah. Who traditionally, like, look on on the surface. They look like, you know, quants at, like, kind of hedge funds, like, 10 ago who are just gonna, like, work incredibly long hours on something that they think is impactful.

You have a culture, like, of friendly competition where everyone wants to be the best. Right?

Nathan Lambert: I will say, I know a bunch of people at OpenAI, and they do work crazy hours. Yeah. That's funny. Like, I also work a lot, but I do a lot of things that aren't grinding data to go into the model. Do things that I think are at least partially fine.

Ross Taylor: Yeah. And then I I think the decisions are generally made by, like, people who are a little more experienced or at least have, like, some successes to their name. But, yeah, you need to have soldiers in this kind of climate. Right? It's just just highly competitive.

And I think that's a shame. Think, at least, because even for myself now trying to build a startup, I'm trying to think, obviously, we need to work hard, but is there an alternative where you kind of invest in your employees as kind of using them as you know, burning them out and then moving on to a new

Nathan Lambert: Oh, this is

Ross Taylor: really that's, like, what I'm trying to work out for myself.

Nathan Lambert: I feel like a lot of people are just kind of more cynical now in tech, myself included. Because it's like I got a great cold email from an fresh out of undergrad, and I was like, I'm pretty sure in two to three years, this person's gonna be super legit. And it's like, I would tell a coworker about it. It's like, what do we do to like like, how can we capture that? And they're like, oh, yeah.

Well, we do it anyways. We'll go to OpenAI in two years. We don't get any of the upside. So I think some of that is just kind of cynicism. And investing in people is still the right thing to do because you'll you'll end up keeping the ones that are a bit more grounded even if it's really hard.

I mean, I've lost people that are extremely talented that I wouldn't want to keep. So I don't know how to balance that cynicism versus reality of building teams in the long term. Would guess smaller teams might be a bit easier to maintain. Where if you're at a tech company, the churn is kind of impossible to prevent for a lot of because there's so many levels in moving up. Think a lot of rumors of meta, like especially around Llama four is just like I mean, Dylan Patel of semi analysis.

We could find the quotes from that. He was essentially like, they were doing the most cowboy crazy model training ever, like training the pre changing the pre training mix halfway through. And that points to like middle management being like, I need to use my data so I get my promotion. But most labs, I don't think are doing that type of shit on their leading models. And I don't think Meta is normally doing that.

I think that was a pressure cooker side effect.

Ross Taylor: Think actually in a weird way, all of these labs, at least from what I've heard, are like deeply chaotic places. Like they change direction every week, right? I mean, that's just the nature of the field we're in. And I think, but then maybe definitely certain labs are good at projecting, at least externally, that they have their shit together. They have AGI internally, all this kind of bullshit.

The truth is it's like a shit show everywhere. It's just that if you're gonna be a shit show, you at least wanna be a functional shit show, and you wanna make good models. Right? I think there's definitely plays to be made about, do you take the view that you want to invest in your talent more as opposed to just grind them out? I think if you're a startup, you don't have a choice because you can't grind out your employees if you don't have 10 of them.

But then there's also in my mind, like yeah. Especially in like, I would say, like, kinda lab culture, I think people tend to overvalue just like raw talent again, especially in empirical science. So again, if you take the view that an empirical science is mostly about, like, experimental velocity, again, notice you don't just, like, value infrastructure in that world. Also just value, like, okay. I wanna hire someone who's just, like, very collaborative and is, like, very willing to help other people out.

Right? It sounds like a bullshit point in Yeah. It sounds like a bullshit point in the the field that, like, lionizes, like, individual intelligence, but I just feel like if you're making a marginal hiring choice, like, how does someone, like, add to existing group? Right? Are they gonna yeah.

I this is like and I think these things are actually undervalued just because I I think it's all just now in in the minds of people. It's just about like, okay. Find the smartest people who are gonna, like, you know, be super cracked till this gloomy narrative and stuff. So, yeah, I I think there's new plays to be made on talent. It's But difficult.

There's there's nuance because, like, don't get me wrong that there are people who are, like, especially productive. Like, I've seen it in in person. It's not like, you know, everyone's, like, equal. That's that's definitely not the case. But it's just that I feel it's yeah.

Individual talent is everybody's

Nathan Lambert: the differentiation right now is honestly just people who are willing to put more highly focused hours turning the crank. Think there's every organization has this baseline of the cost of being there in terms of meetings, whatever your life is, maybe you have to live somewhere where you have to commute or something. But then it's just like in terms of AI, unfortunately, it seems like the people that do more and more, they just have a bigger fraction of time actually spent doing stuff as well, which is it just favors young people that don't have a lot of responsibilities. Yeah. It just is kinda like But

Ross Taylor: this is maybe a transition onto another point, but, like, maybe I'd make a more controversial point, which is that even the things in ML, which seem like more in the realm of, like, doing novel research, are just like you can also pitch that as a form of just, like, persistence as opposed to, like, inspiration. Right? So take, like, you know, this time last year, we were, like, we were both speculating about, like, what o one was and strawberry was. Right? And then speculation, like, tends to make you think it's, like, some amazing new thing.

But, actually, when you looked at it, I mean, it was what you were basically doing and what I was doing, like, two years ago. Right? Essentially, just, like, RL and verifiable rewards. But with, like, probably a, very good base models, right, because they are in a good position to do it, and b, enough ablations to find like a mix that worked. Right?

So and I know that's like oversimplifying us for the fact, but, like, just take the view that they had to do the work to make the recipe good. Like, it just comes down to experimental velocity. Right? And then also having the the right infrastructure and a good enough base model. So then in that world, like, what is talent?

Right? Is is talent, like, the person who says, oh, we should make the models think more, or or is talent the person who actually is actually on the ground, like, doing the ablations to find out which recipe works. Right? Because I can also make models think more by doing best event, but, obviously, that's not a very good way to do it. Right?

So Yeah.

Nathan Lambert: I mean, I think I analogize a lot myself with my athletics career, like, rowing in college. I think so much of it is the same. It's like I wasn't the most gifted athlete, but if you put in the hours and you're, like, understand where you're spending your effort, it's gonna it works out for people. So a lot of, it's, like like, the super talented person that's doing this, like, complex, like, end of one research perfect deal over on this, like, okay. The other person, let's just hill climbing and data will win out.

Think the question that I want to ask you on this topic is, given that these orgs are so chaotic, what does this mean about the ceiling in progress? So one of the most coveted questions is what is the trend line? Think there's obviously going to be new paradigms. I think inference time scaling was actually a quite obvious one if you were to look at first principles of, what compute intelligence is. But even if we don't have a new paradigm, like, what is the ceiling?

If there's so much chaos, I'm biased to think that the ceiling is not that close.

Ross Taylor: I I think it's interesting. Right? Because I think even in climates which are, like, organizationally chaotic, you're still gonna have things which kind of, you know, lift all boats. Right? So like, a good example recently was like, you know, these like gold medal results like IMO.

Right? So I think it was like three different labs all had different approaches and found they crossed the threshold. Right? So if you were to zoom out or or and one more one way to do this is imagine you're looking, like, twenty years into the future back at this time. Right?

Like, would you look at the individual methods that these researchers did, or would you just say, oh, they just, reached, like, a critical threshold of compute where things start to work. Right? So I I I think compute is, like unfortunately, that's, like, the the big, like, kind of exponential that's, like, underlying all of this. And then in in our kind of, like, if you zoom into, like, a shorter, like, time horizon, then you're getting to things more like, okay. What's the current challenge?

Like, what's the bottleneck? Right? So maybe the bottleneck to I know agentic models is, like, our own environments. Right? Or maybe the bottleneck to reasoning, you know, even better is, like, longer context windows, and that's, like, the smaller term things.

But I yeah. Fundamentally, so long as, like, compute continues to increase, like, I think the trends look good. And then all this kind of, like, organizational stuff is just, like, short term noise, which slows down progress a little bit, but it's not that meaningful in the long term. But, unfortunately, it's still meaningful for people in their careers because, like, okay, one to two years of, like, organizational chaos or whatever is could could could matter. But on bigger timelines, it doesn't really matter.

Nathan Lambert: Yeah. I mean, I agree. It seems like the question is what happens when the fundraising starts to slow slow down.

Ross Taylor: Yeah.

Nathan Lambert: It's like we're we're on a trend line of compute rollout. And then if Sam Altman can't raise again, that is a very big sign. That's like the end of the quote unquote bubble, and OpenAI is not gonna go away because of that. But it's just if OpenAI can't get the next cluster that Google is using and that will that's where it's like we can't make arguments on it if it was some miracle until Sam Walton can't raise anymore. Because otherwise, Google and OpenAI are gonna be doing effectively just quitting the same.

Ross Taylor: I mean, I'm quite optimistic because I think it's just, like, you only have a bust if this, like, AI, like, ceases to be useful or at least a promise, like, it doesn't live up to certain promises. But I don't you know, even if there's no algorithmic progress or, you know, I I I still think AI is gonna continue to continue to be increasingly useful. I don't think there's any fundamental barriers. It's just a quite a question of, like, how quickly, like, you get that right. I think that argument would have been slightly different two years ago because if the the reasoning paradigm didn't come through, then I think it would have been trickier to justify some of the expenses because then you'd be looking at, like, benchmarks and reasoning thinking, oh, shit.

Like, to push this forward, I need, like, this amount of data annotation or I did this amount of new dead data.

Nathan Lambert: You look at GPT 4.5 as the example.

Ross Taylor: Yeah. Exactly. That's a really good example. So that's like almost like a a counterfactual universe where where, you know, reasoning didn't happen, and we're all looking at this and saying, okay. It's good that, you know, creative writing, but then, okay, it's not really doing the the the things we were It's not

Nathan Lambert: that good at writing.

Ross Taylor: That that's tangibly better. I'm sure that's a really good model, by way. I didn't replay enough to to find out by sports.

Nathan Lambert: I like I've been using it a lot. Like, I I use g p t 4.5 for a long time, especially until Cloud four as kind of like it's just it's it's just nicer, especially when GPT 4.1 was so sycophantic. I was like, I I can't use that. But GPT 4.5 was still it's it's interesting in a way, like, how, like, different models didn't really have it for normal stuff. If you're just asking about any random thing that a language model will know about, you can

Ross Taylor: just talk about dogma. It had a good vibe.

Nathan Lambert: So I think it's a quite good model, but it's also just such interesting release of the history of where AI was going.

Ross Taylor: Yeah. So I'm gonna flip it around. I have a question for you, Nathan. So, like, let's say we're here in a year's time. Like, you know, what does the key benchmark look like for LMs that everyone's focused on?

Nathan Lambert: Oh, it's fully gonna be some like agentic thing. I don't know if it'll be as stupid as how much money does it make on the stock market when it's doing it, but I I had written this post of what comes next. And I I think one of the most poignant things I was looking at in this is just scaling is not really the path that models are taking anymore. And it's like all the marketing is shifting to agents. And I think some of that is just cause it's not easy to scale parameter size anymore.

Scaling RL is happening but not gonna make these So we've taken the Every RL curve is this log plot and we take the first log of the performance which is like 90%. So it's just hard, But the agent thing things are working so well. So we have Cloud Code show up Yeah. There's gonna be that in all sorts of domains and more people working to evaluate it. So I think it's a interesting marketing problem at the same time where all the labs need to fig refigure out how they communicate that their model is so good.

Like, Cloud four didn't do it. They didn't land that, but it was good. So it was okay. Exactly. Yeah.

But everyone needs to, like, switch this this narrative. It's it's just like all the model sizes, GPT four, four, and then it's like 4.1 mini and nano, and then Gemini Pro and Flash, and the Claude Opus Sonnet. Like, all these things have the same size classes. Like if they really 10x ed size, they would give it a new name. And I think that'll come eventually, but in a few years.

But I I think it's all on this agentic side, is a big shift in what the language modeling companies need to think about. And it's a like, the prioritization of the company is also different, where it's like the modeling has always been the central for us. And I'm still, like, modeling pills. So I still think that that is the most important thing to the company. I think that I've said that's the most important thing to AI too.

And, like, these open models just because you can AI too can offload who is building products and agents on OMO to the rest of the academic community. And, like, OpenAI can paralyze this to many teams building products. But these sub teams that are building products are gonna hold more weight than they used to. Yeah. There's gonna be more interesting kind of man like how these companies manage it and do comms will change.

So I Yeah. I mean, I think Cloud Code is great. I I think that it's hard to integrate in some things. Like, it's how do I get Cloud Code running on my on our cluster at AI two where we have all of our, data and models and then model launch evals from our, like, our file system on the GPU machines. It's like, I don't like, I don't think I can assault a lot on that.

Maybe I'm doing something wrong.

Ross Taylor: Think it's a

Nathan Lambert: stuff like that too.

Ross Taylor: Yeah. I I I agree with your, yeah, your answer. Like, I think the way I see it, so there were several years when I was doing, like, papers of code, which tried to, like, focus heavily on the kind of evals before they were even, like, a big thing, you know, trying to index all these leaderboards and stuff. And I think now is an interesting situation because I feel like if you make good evals now, you possibly have, like, more leverage than you've ever had, like, in the field of ML, which is a weird thing because traditionally, evals were quite an unsexy thing to do. It was a thing that researchers didn't wanna do because they'd rather be training models.

But now the ability to define, like, a, yeah, metric, like you said, which maybe gives some leverage to products, but then also just like a capability that you'd like to see, whether it's like, okay, the model is good at, like, trading stocks, right, or is the you know, good at, you know, doing scientific research. Right? I feel that's like there's incredible leverage for, like, a small group of people in, like, even, like, universities to say, okay. This is the new, like, north star that we should try to achieve for agents and then, like, take control that way.

Nathan Lambert: Can happen. I mean, we released a we released a like, an IFVL replacement, which is like IFBench, which is just more constraints, like, a different prompt sourcing. It's just like a harder IFVL. And then I was like, okay. We need to make the goal of having the two, like, two of the frontier labs to adopt it because it doesn't work well.

Yeah. And it's like, I message people, and it's like, I message someone at OpenAI, they're like, oh, yeah. Did that last week. And I was like

Ross Taylor: Yeah.

Nathan Lambert: Yeah. Like, else is making research that is like, actually has a shot of getting into the OpenAI internal platform.

Ross Taylor: Yeah. Yeah. Exactly. Right. Yeah.

Yeah. Yeah. So it's incredible, yeah, leverage. And then I think the other interesting thing though is, like, the the friction to actually making using and using good eval is just gonna increase, like, quite a lot. Right?

So even, like, some of the stuff, like, recently, like, ML scientist, like, kind of ones, like, Emily Bench and Paperbench or some of OpenAir ones, like, you know, in the case of, like, some of these benchmarks, you need to have, like the the RL agent needs to have a GPU available, to do ML research, and you need to, like, spin up, you know, lots of servers for RL to next. I think I think it is long gone are the old days where you just have, like, two CSVs or whatever, which is like a train and a a test split. Right? So that that's on, like, the user side. But then even on the eval creator side, there's a big difference between especially as the models become more capable.

A bad eval just means that you're going to get incredibly egregious reward hacking, and you're not going to learn anything useful. And a good eval is like a quite a new capability. I have

Nathan Lambert: a related question on this.

Ross Taylor: Yeah.

Nathan Lambert: So I see three I see three eras in evals in some ways based on what people are doing with models. At pre training, the best evals are testing knowledge and these very broad things and they're hard to game. It's just kind of like flops. Yeah. At post training, a lot of evals is actually formatting and extraction.

I think formatting became even clearer to people when these RL environments became the hot new thing. Yeah. And I actually think that post training might be like the ugly duckling in the middle, where then if you go into agents, all the agentic tasks are gonna be evals of actually doing anything and you can't like format lie your way through that. So it might be that like post training evals are, like, actually the hardest one to get right.

Ross Taylor: Yeah. And I think you're gonna see more cases of, like, people claiming, like, good results. But then when you look at the actual surface, like, absolute, like, insane reward hacking. So the meme right now I'm not sure you've seen is like the KernelBench evals. Have you seen these?

Nathan Lambert: Oh, So

Ross Taylor: you have like these amazing speed ups, which like yeah. First of all, aren't even like popular given like basic information on the configuration of the hardware. And then yeah. And it just goes to show even to, like that's not a problem with kernel benches. I would say it's more a problem with the people, like, publishing the papers without looking at the results.

But just to get a, you know, eval in the right place for, like, a task like that is actually, a lot of work. And even with the progress in models, like, I I I don't think you're just gonna be able to automate the construction of a a good eval like that at least in the next year. I might be wrong. That'll certainly help. So I think, yeah, that that that's that's an area which is just a lot a lot of leverage right now.

I mean, I think if you were to ask me at the top of my head, like, what is the central eval right now, it'd probably be something like SWE bench verified. But even that is like, in my opinion, quite saturated. So there's like a big blue sky that someone can like define like what the next like task is for ML. And actually that doesn't need like a big cluster to define. So I think that's quite an exciting thing for.

Nathan Lambert: Yeah. And when you think about the amount of money that'll be steered by these things, it's so crazy to have the uncertainty there and like who will come up with that as well. I think that it's part of what makes it fun, I think. Page is obviously fun, but having more people actually contribute is good. Yep.

We should talk about reasoning things.

Ross Taylor: Reasoning. Yeah.

Nathan Lambert: Where do we start? I don't think I've ever done, like, that mean of a rant about the academic community chasing these things. For a pretext, I understand why academics individuals are doing this, which is new algorithm, something that is in well established systems showing remarkable scores. But a lot of these papers are just kind of extracting things that are hard to document from a model or something else or formatting

Ross Taylor: Yeah.

Nathan Lambert: Yeah. Or something like that. So it I mean, I was on one of these papers, which was hilarious. It's it's just like a very lot you're like, figure out that you train Quinn on random rewards, the evaluation will go up. You have to go through the logic of how can this happen.

Because if there's no reward, the advantage is zero and the gradients are all literally zero. And then it turns out that the algorithm manipulates the most common sequences. It's actually something if you read a lot of the reasoning literature, people talk about like, we wanna make sure our algorithm doesn't squash these uncommon sequences. And then, like, the real hammer is, oh, if you do if you do random rewards, like, literally the model just kind of has modal collapse onto the things that it was trained on. It's like Yeah.

And that can make scores go up. So it's like if you have a model that two thirds of the time has a certain behavior in its reasoning and that behavior is good on the benchmark, if you just fiddle the weights a little bit, sometimes it does that more, which is just like It points to a somewhat structural failure. I would also say a good example for why people should be using truly open models for research purposes and why they're so good for innovation. Because it's like all these great human hours. If we knew what goes in Quinn data and if someone just filtered it and it was like, oh, look, I found the found the GPQA prompts in it.

It's like like Yeah. Data contamination happens. It's not necessarily the Quinn case is borderline. I don't know how exactly to characterize it cause the Quinn models are fantastic, but there's so much research that is showing that they are very likely to be doing some dubious things in terms of benchmarks. It's hard for people that aren't super in the weeds to hold both of these in their brains.

So I don't know. What what do you think of the last six months? Have we actually made any progress? Has the academic community made any progress?

Ross Taylor: I think there's been, like, little progress. In in the literal sense, little progress. There has been some progress. Yeah. I think he he can answer it in different ways.

So I think after DeepSea came out, there were two, at least, approaches in open source more generally, which was either you go down kind of distillation routes to make no interesting small models or you go down the RL routes. I think the initial thing that was kind of undervalued at least from a practical engineering perspective is that if you're dealing with smaller model sizes, it's just like way more efficient to do distillation than RL. But obviously, from an academic perspective, you wanna do the RL. Then on You

Nathan Lambert: mean it in not just in compute, but also in performance. It's like it's hard to do RL on the small models.

Ross Taylor: I think that point's been made twice now. So was an original DESIG paper, and then more recently, there was a Quen paper as well, which I think the Quen paper showed that the RL was like kinda 17 times more compute. So one way to think about that is that RL really is like a brute force lever to do data, like, generation. But then okay. Assuming that RL is still good, that you wanna be able to do research in academia, I think the difficulty is, like, it's just a classic problem where if you don't have enough compute, you don't know what structure imposing is gonna generalize.

And my worry is that a lot of the kind of results are, like, relatively low compute budgets, both in terms of, like, the underlying base model, which determines, like, how well the RL kind of approach learns, but then also the number of steps. So it's just quite hard to see unless there's, like, a massive gain, what's truly important. So, like, the most useful things are actually, in my opinion, quite boring things. Like, you know, there was the, you know, the the DAPO paper where it's like, okay. We shouldn't, you know, the filtering for over long sequences.

We shouldn't, like, bias that. I I think there's been some interesting stuff showing that maybe even kind of simple approaches in gRPO might work way down to clipping. So Reco was doing a lot of good work using, like, kind of reinforcement learning, you know, leave one leave one out. But even there, it's kind of a GPS because you just know don't know, you know, is that algorithm gonna generalize to like agentic traces. Right?

So it's not clear. I I think the recent stuff actually this week was quite good. The the GSPO stuff, I think that was like, well, they actually found it. And then if you saw their graphs

Nathan Lambert: explain it to people. I think people a lot of people have heard of the other ones by now. But like GSPO, I think, is group sequence policy optimization Yeah. With QuenCoder. Why why are you positive on it relative to all the I think ideas are well motivated.

But, like, why why is GSTO actually getting hyped more?

Ross Taylor: So I hope I don't botch this because it's the morning. But, essentially, with gRPO, you essentially assign, like, a reward to the whole sequence, like, advantage. But then you're you have this kind of importance weight, which is, like, your kind of, like, policy kind of likelihood than in the old one. Because when you do RL, you typically simple like, sample lots of rollouts, but then you do several mini batches. So that means in practice, you go a little bit off policy.

So to fix that, you just have this, like, kind of, like, term, which is like an importance weight. But then but the importance weight, while the reward is applied to every token, like, kind of uniformly, the importance weight is for each individual token as from a single sequence. So, like, you're just, like, looking at, like one way to think about that is your importance weight. If you had more sequences, like, it would be, like, more accurate way to reduce the bias. But, actually, if you if you're just doing it on a single sequence, it's actually introducing a lot of variants.

So, essentially, the the short answer of what they do is instead of just looking at, like, a token likelihood, they look at the likelihood of the whole sequence. So now the clipping is not, like, on the individual token basis, but, actually, it's looking at, like, let's say, you know, one of the sequences in your group and saying, okay. This is, like, less likely. We'll just, like, not look at that sequence. And the TLDR is, at least from the results they show, it seems to be a lot more, like, sample efficient.

I mean, it's not just, like, a few, you know, points five percentage points or something like that. But I I think the reason I trust it more is it's very simple, and it seems to be, like, quite directionally well motivated just, yeah, from basic, like, understanding of importance sampling. If it were more complex, I'd be a lot more skeptical, but it's fairly simple and seems seems to work here.

Nathan Lambert: Yeah. I mean, I'm still fairly skeptical. I think all of this is that once you're, like, on a narrow when you think about the academic research as being relatively wide of what people are trying right now and the labs being relatively narrow. And when you're further along in your modeling journey, you're you're dealing with different parts of the state space and then these algorithmic tweaks just like help your model on whatever blocker it was or your implementation. I thought GSPO, the sequence thing was so funny because when you read the gRPO paper, you're like, oh, the the the reward is just per sequence.

Like, all the tokens in that sequence will get the same loss function. Yeah. But the standard implementation is to break it down per token. And then this this GSPO is essentially to take that standard implementation and you change the weightings on every token back to this. And I was like I was like, is this really is this really gonna be a major like, it's a cool idea.

And I think, especially for junior researchers, I think one of the good things about this era is that you could really learn the math by studying all these algorithms and thinking about what they get implemented like. I I hadn't done that in a few years and that was me writing this, like, RLH Jeff book on policy gradients. It's like, oh, boy. Like, why the heck is it length biased if you're doing a per token loss and instead of in GRPL, why is that? Like what?

And it's the fact that you have these normalizing factors and pro if you have these per token probabilities, those are gonna be roughly similar, but a longer sequence has a bigger denominator when you're length normalizing it. And then if that's your loss, then you have a smaller loss or like a smaller gradient or something like this. And and for students to be able to do this in their brain, it is really, good for thinking about the interface between algorithms and systems. But I I'm team like GSBO doesn't really matter. Clint had collected some really beautiful results on their infrastructure.

Because there's there's even like worries on like, the difference between the gradient ecosystem and the inference ecosystem in the open. And I think that I think everybody kinda has some of these worries. There's just too many things to nail down that I think nailing down exactly, like, what your your own system is doing tends to create better models.

Ross Taylor: You know, it's interesting. I think as AI became hyped after Chattypete and you just see, like, more people, like, reading papers, which, overall, I think is a great thing, by the way, I think you have more people, like, just reading papers kind of in the wrong way. I think for me, it's just, like, the the basic logic is, like, how much like, what's the reported gain of the paper and, like, how much complexity does it introduce? Right? So if you get, like, a gain, but it's just, like, shitloads of complexity, it's probably not gonna stand the test of time.

Whereas if it's something relatively simple, but seems to, like, get a good gain, then that's, like, the thing that's gonna last. So I think I I don't I don't wanna be, like, harsh towards academia exactly.

Nathan Lambert: The o one lesson. Yeah. Yeah. Yeah. The simple thing, moving in o one.

In RL research, I've heard it described as, like, if you see something that only beats the baseline by a few percent, it's not gonna work. Yeah. But it's if it's like two x, if it's two x, that's the real innovation because, like, even if they whether or not they fine tune their baselines, they're still crushing it.

Ross Taylor: Yeah. Exactly.

Nathan Lambert: So I think that's a good heuristic for people right now.

Ross Taylor: And I think researchers are their worst enemy because they they want to see their own methods work. But at the same time, the weird thing in ML is, like, you know, neural networks, want to learn. Right? So it's like, if you push something enough, it will work. It's just a question of like, is it a good use of your time?

So it's like, okay, what's the right thing to scale? Right? So, and that's why, yeah, when you read papers, you just take the view that, yeah, just always think in terms of like, at least why I say to like younger researchers is like, okay, how much complexity, how much gain? And do you trust the gain? Right?

And then based on those three factors, you're like, is this worth reading? But I I think that is yeah. At least if, like, you're new to, like, reading papers, like, sometimes it's easy to, like, see, oh, okay. This this this this, like, new techniques look super cool or something, and it has, this kind of gain, and it's sometimes It's

Nathan Lambert: yeah. Papers aren't about storytelling, or they're not presented in a way that they are about storytelling. Yeah. But, researchers manipulate the results of their peer methods in the way to tell a story. So it's like you are when you are making a paper, you think about the story that you were being read.

But also subconsciously, researchers are manipulating their own results to make that story manifest. And I think these algorithms are a perfect example of it.

Ross Taylor: Yeah.

Nathan Lambert: So when you, like, kinda think about the what is the what is the word for this when you're, like, the not the mental model or mind map or map or something, but you think about the kind of cognitive behavior of the other side, it it's much clearer. You have to read a lot of papers and be chilled to do that.

Ross Taylor: And then, I mean, in the reasoning space, the other point is just, I mean, it's a everyone knows this now, but this yeah. I understand that people have focus on math and code because that's what data availability is. But I I just think if a a paper comes down, it's on, like, the AIME benchmarks and GPQA. It's just a lot less interesting, like, now than it was, like, a few you know, at least even in, February.

Nathan Lambert: I think code can be much better, but it's hard to benchmark it. So like describing what a good coding model is would take me an extremely long document.

Ross Taylor: But Yeah.

Nathan Lambert: That's not what the academic papers are doing. That's I I don't know how to do that. Like, I would love to have more like, we'd need a whole team to hill climb on that.

Ross Taylor: Yeah. And even the established ones. Right? So no. I mean, they're they're good benchmarks, but, like, SWE bench is, like, some ridiculous number.

Right? So I think I I don't wanna misquote it, but, like, it's mostly just issues from Django. Right? I don't I don't wanna, like, have that as a burn towards SWE bench because I think it's a great benchmark. But

Nathan Lambert: They already won. They already won. They can take subtle things. Like, they won. Yeah.

Ross Taylor: But, yeah, it it shows that there's still a lot of, yeah, nuance, like, to making a good, I don't know, coding benchmark or or whatever. So it's difficult because I'm in this position where, yeah, on the one hand, I just look at the papers, like, just doing hill climbing mapping code, and I'm just it's fundamentally uninteresting, but at the same time, I sympathize in the sense that, like, okay, what else is there to do? Right? There there aren't, like, a lot of grades, no open reasoning datasets in the open. And those that are open, I I don't even think that they're going to be good for testing RL necessarily.

They just test something more knowledge based, like medicine or something like that. So it's a difficult situation.

Nathan Lambert: This could be a good a good time to transition is, like

Ross Taylor: Yeah.

Nathan Lambert: What is the status of RL scaling and generalizing? What is the status of RL outside of math and code? I think my my prompt is somewhat like, what do you think about three, like, models with this crazy search behavior and this multi hop execution?

Ross Taylor: Yes. Okay. So first of all, I think, like, it was greatly overstated, this argument that, like, oh, it it doesn't generalize beyond math and code. I I think what happened in practice is that, at least from what I know, like OpenAI originally, they were very focused on math, logic and puzzles and stuff. And then eventually they kind of had to broaden out because it was kind of too rational and focused on those kind of like benchmarks.

But I don't think it was an ever in question that I was generalizing to other benchmarks. You could see that very early on. And the reason I think about this is like, we we kinda started with math and code because it was kind of easy to verify. And then through that, like, applying RL, it learned certain, like, strategies, like, okay. I shouldn't just, like, answer early.

I should check my work, or I should consider alternatives. And at a very high level, if you just have a model that kind of thinks for longer and checks its work more and it considers more things, then that's gonna be useful for things beyond math. Right? So and and that's reflected in the benchmarks. And that being said, if you wanna get, like, to superintelligence or whatever outside of math and code, then, yeah, you probably do want, like, specific benchmarks for that.

And I think the the question is less, does it generalize beyond math and code, but, like, how how well does performance? Like, how how good does it become? Right? And that that's when it gets you into more interesting questions about, like, okay. A, you know, if you don't have a numerical answer or whatever, like, how do you verify things?

Right? So rubrics all the rage right now, but then there's also other directions like stuff like

Nathan Lambert: Rubric things are so funny and need to be reinvented. It's just like Rubric is such a funny name because it's like It is. High avail. It's like it's just question specific LLM as a judge. It's, like, the most basic basic unit of evaluation or feedback.

Yeah. Yeah.

Ross Taylor: So I think that was actually something that wasn't very covered in the open. So the reason why it became popular was that the yeah. Essentially, like, deep research was the trigger. Yes. The rumor at least was, like, they at least OpenAI, they didn't actually need that much, like, examples in order to, like, kinda do quite well in these tasks.

So it wasn't, like, tens of thousands of, like, rubrics. I think it was, probably, like, a thousand to 2,000, like, you know, well crafted rubrics for questions. But, yeah, it it's clearly, like, worked very well to teach a model, like, how to, like, you know, browse the Internet. Right? Like, synthesize knowledge.

There's obviously, like, infrastructural detail as well.

Nathan Lambert: What would a rubric look like for deep research in this case? Like

Ross Taylor: Okay. So

Nathan Lambert: I think you have a rubric in, like, a general question of, like, you can think of, like, write me an essay on this, and then the rubric will be like, oh, this question should probably it should probably have be free of typos and have a clear argument and a good conclusion. It'll be like different checklists. But I think the deep research thing is a bit more complicated to You might have to immediately draw an example.

Ross Taylor: Yeah. So there's there's different themes you could have. It could be, like, okay, the general style of the answer. It could be, like okay. Let's say it's, like, we want, like, a review of, like, the latest, like, and greatest, like, RL algorithms for, you know, reasoning or whatever.

Right? So first of you you might have something that's, like, more high level, like, okay. It should maybe, like, have you know, compare at least a couple of, like, methods or maybe it should have a table comparing, okay, what's the underlying algorithm is based on? Is it, like, policy grade and and, okay, is it PPO based or reinforced based? But then you might have, like, more detailed things where you just have, like, strong conviction of what a good answer looks like.

Like, okay, Right now, it's probably mentioned GSPO. Right? That might change. So it's essentially just, a list of criteria. But the the goal of what you're actually getting is you're just trying to get, like, a a nice continuous reward where the model can, like, gradually, like, you know, learn as opposed to get something more sharp.

Because you you could unlike math where you can have a zero or, you know, one reward for something like, okay. What makes a good literature review on RL? Right? It it just the reward structure doesn't look like that.

Nathan Lambert: So How do you think of greater functions and stuff? I think we I've thought about this for code, which is like code, can do the percentage of unit tests that pass. So like Yeah. A lot of times your model will just get the easy unit tests then. It's like how like, do you think reward shaping is kind of here to stay or will it be washed away in the ever growing sea of compute?

Ross Taylor: I think it'll be washed away, but I think in the in the meantime, there's still a lot of value in, like, making very good handcrafted evals. And I hate the word taste, but, like, I think there is still taste, like, to begin with. And I think a lot of these things are quite codependent, because, okay, to make a good rubric for, a deep research task, then you probably need something that needs the ability to do, like, deep research, right? And, like, you know, okay, if we if we were to say, okay, what makes a good literature review on, you know, RL right now, it probably wouldn't be in the the weights of a language model. It have to go out and, like, search for things.

Right? So, yeah, in the long term

Nathan Lambert: You can tell it. You probably need to use search to answer this question. Yeah.

Ross Taylor: Yeah. If you've done your search, then you're probably doing the wrong thing. So, yeah, in the long term, it gets washed out because there's nothing, you know, a neural network can't do if it's better human. But in the short term, there's there's still a lot of, like, nooks and crannies that, yeah, a model wouldn't do very well on. So

Nathan Lambert: Can you create a generative reward model by training off a bunch of Rubrik eyes data? Probably. So you like

Ross Taylor: Yeah. I think verification is model

Nathan Lambert: is gonna go away. The

Ross Taylor: is also something that benefits from, like, thinking time. And I think most people are aware of this now, but it's just the the question is, like, how you actually execute that. Right? So I I think a generative reward model for something like, yeah, math and code where it's like a one or zero that's trying to figure out by thinking is less interesting than okay. Like, how do I yeah.

How do I go about answering this question from first principles? Right? Or how do I think in general, the simplest way I think about it is, like, if you're moving to a world where you have kind of long argentic traces, like, your, quote, quote, reward model just needs to add answer a simple question, which is, like, is is the agent making progress towards its goal? Right? But that's a very deep question.

So if it's like a Pokemon eval, right, and maybe it uses its knowledge of Pokemon to figure out, oh, the agent in this trajectory has caught caught in a loop or something. Right? And it should be going this way to, I know, Lavender Town instead of this way. Right? So there's yeah.

I I think that benefits some thinking time, but it's just yeah. The the devil's in the detail because if you're not careful, you're just gonna spend an inordinate amount of compute trying to get a reward. So

Nathan Lambert: I I do think there's gonna be a lot more that we learn there. It feels obviously salient. I mean, I describe it as verification kind of changes the slope of inference time scaling. And that's really, really valuable if you're spending a lot on inference, but we don't really know how to do this. Like parallel compute is another factor that kind of changes the shape of that curve.

I guess it's really all it's all a slope of a scaling law or like an offset or something, but it's hard to say things that are particularly true in terms of what we're hearing and what seems like something that you could tell somebody and, like, that's yeah. That's probably what they're doing other than this rubric stuff. It's just like getting r l to be able to be pointed at more problems. It's not that surprising.

Ross Taylor: Yeah. No. I think rubric mania is in full force right now. I mean, I think the longer term question, which has been made in several places, is that, okay, what happens when verification becomes, like, fundamentally harder? Right?

So I'm quite interested in, like, the scientific discovery things, but, you know, if it's, a, something like biology, right, you actually need to do, you know, a physical experiment in order to verify. Right? So it's not like something where you can just, like, you know, easily run things. And then, okay, if you want to simulate the underlying thing, well, then you'll bottleneck by the quality of the simulation, and it turns out to be quite hard to simulate some physical processes. Actually, in most science I I think this is the other point I'd make, which is that in ML, I think people, again, overvalue the power of, say, thinking in something like science.

They think of, like, an Einstein or whatever, and they think a lot less about, you know, what's the data generating mechanism, what's the instrument. Right? So, like, there there's no, like, Kepler without the telescope. There's no, like, you know, progress in biology about X-ray crystallography. There's no maybe new theories on dark matter in space without better telescopes, right?

And then so, I know it sounds like a weird thing to say in the context of RL, but then if you're thinking about the very hard things to solve in the real world, you're just gonna be bottlenecked by like, oh, I actually need to build a better instrument to get data. So that sounds like a digression, but I'm just saying very long term, you're gonna hit those bottlenecks for verification. But in the short term, we can still, like, solve very interesting things like the Riemann hypothesis and stuff, hopefully, but also that will probably take quite a while as well. So

Nathan Lambert: Yeah. Yeah. I I don't have anything particularly eloquent to say on the discovery point. I think that'll help people train language models right now will hold them back in a bit. So I think if these models that I I guess what's gonna happen is these RL is gonna be in training and then you kind of punt it off to the rest of post training.

So I I think just models need to be able to get really weird, but not weird in a way that they're just numerically lost. So I've reading a lot of reasoning traces these days, and the QUEN and DeepSeq reasoning traces really just seem numerically lost for a while, and then they Yeah. Yeah. Poop, pop out, and get the answer right. It's like they'll be, like, doing the wait wait wait thing, but it'll be, like, half English, half Chinese, and then it'll just, like, end up getting the answer right.

And I'm like, I don't know how that happened, but that does not feel like a mechanism to discovery. And there's some kind of fundamental research I'm making this reasoning process a bit more real and to get there.

Ross Taylor: My other bad case against reasoning models is like and this is mainly just like a devil's advocate point because I think I still fundamentally believe. But okay. So, like, since, like, World War two. Right? There's a lot more, like, scientific, like, human out, like, workers, like, in the world.

Right? A lot more scientists. But, like, would you say there's, like, more progress? Like, if anything, it feels like a lot of science has progressed slowed. Right?

There's more progress in, like, fundamental physics, like, now, or was there in, like, 1945? And I love that it's just because, like, the low hanging fruit is gone in, like, these kind of fields. But then that's also, a bear case that, you know, the bottleneck in a lot of places is not raw intelligence. It's actually just, okay, maybe it's, like, me to, like, increase the speed of, you know, physical processes or, like, be better at building, like, instruments for measuring, or maybe we need more funding from the government to build, like, a bigger particle collider. Right?

I mean, I'm exaggerating because I think AGI mostly means regular activities, automating law and finance and these kind of things. And I think that's a lot easier to do, but I'm just like, get this kind of mindset that, okay, we've solved, like, reasoning. Now, like, superintelligence is gonna come, like, a year like, next year or whatever. That that's, like, from what I can see Yeah.

Nathan Lambert: I'm very I'm on AI being used and bearish on whatever superintelligence tales. Like, that we're just too compute constrained for some takeoff. I think AI is gonna be very good for financialization and digitalization and seamlessly globalizing the Internet and making all, like, information transfer and acquisition effectively free

Ross Taylor: Yeah.

Nathan Lambert: Which is really good. And I think, like, historically, The US is actually very positioned to capture this by just making products that run on top of cheap AI models. Yeah. There's a lot to unfold, but to to get there.

Ross Taylor: Yep. Yep.

Nathan Lambert: I wanted to ask you what AI you actually use. I don't know if I've ever asked but you it's normally revealing.

Ross Taylor: Okay. So what we're using right now, so the base models, we're doing experiments mostly on Quen, Quen three, but then also just some on Quen two still just because we kind of know the quirks of that model a bit more. A lot people do that. Yeah. Then we do some, like, distillation experiments, and we're just mostly still using Deepsea Car one.

But we did use Kymi recently. But, like, we didn't in a weird way, for the benchmarks, we were saying we didn't see massive gains, which is a bit unusual, but that's the the kind of stack we're doing. Then from a personal productivity, like, perspective, like, yeah, Claude code is, like, very, very good. My main worry with Claude Code is that, like I think there's a paper on this, but I think people confuse, like, agents making you more productive versus, oh, agents prevent me from, like, exerting mental effort. So sometimes I'll have a day with full code where, like, I use very little mental effort and it feels amazing, but I'm pretty sure I've done less work.

And I I I think, like, that will change because, obviously, the models get better, but then I I'm trying to teach myself to be a bit careful because sometimes I need to

Nathan Lambert: It does seem like that equilibrium. I'm happy with it. It's like, I don't wanna have to grind out some plotting code.

Ross Taylor: I'm just gonna let it

Nathan Lambert: I'm just gonna watch some sports highlights and let it do it for me. That's fine.

Ross Taylor: Yeah. But, yeah, in general, very I mean, there's lots of positive feedback on ClaudeCode, but it's, like, a very, like, impressive product for me.

Nathan Lambert: What is the, like, niche of your use case, or is it a bunch of things? Like, do you have something you think you could yeah. It's, like, endorse.

Ross Taylor: Wow. I can't get

Nathan Lambert: Do you do it in, like, code tasks? Like, are you using it in your Exactly. Like, start ups code base?

Ross Taylor: I it tends to be better with, like, brand new code bases. Right? But it tends to be I use it mostly for tasks, which is, like, quite horizontally scalable. So I'll have some basic specification where it's like, I'll provide it some example code of like, okay, here's like what a good implementation looks like, but I need this done. Sorry, I'm being very vague because I don't want to talk about specifics, but like

Nathan Lambert: Yeah.

Ross Taylor: It tends to be better for that. And, yeah, where it becomes really bad is if you just I mean, it's obvious to say, but if the kind of file size becomes too long, then it begins to struggle and just, like, gets into these weird, like, kind of line search kind of modes. So, yeah, there's a bit of work to do where you have to kind of structure the code base a bit for it to do be be be be efficient. But in general, yeah, it's it's it's quite quite helpful.

Nathan Lambert: It's such a success that pretty much everybody that tries it that is doing at least, like, small code projects is like, yeah. It works. It's

Ross Taylor: like Yeah.

Nathan Lambert: It's like that this is, like, almost since, like, ChatGPT that there's been that good of a reaction.

Ross Taylor: Right? Yeah. Because I I think

Nathan Lambert: That is Is it like the GPT 3.5 level? Like, Cloud four is like GPT 3.5, the original ChatGPT, and then a couple iterations is gonna be incredible.

Ross Taylor: Yeah. I guess the thing to think about is, like, the obviously, the people who really appreciate Cloud Code are developers. Right? But it doesn't have the mass appeal of ChatGPT, which can, I don't know, generate poetry or whatever at the time, which is the killer use case at the time? It sounds crazy now.

Nathan Lambert: But so I guess pay for Cloud Code. People won't pay for ChatGPT.

Ross Taylor: Exactly. Right? So maybe it's probably a lot better business model. But, yeah, I think that's maybe that's a good question. I wouldn't say it's the chat I would say it's probably one of the most impactful products since ChatGPT, but I wouldn't call it ChatGPT moments because it hasn't got

Nathan Lambert: Yeah.

Ross Taylor: Mainstream appeal yet. But and the question is, like, what does that agent look like? I'm still, like, shocked that Apple hasn't done anything yet because, for me, that would be the killer thing. We'll see if they get that shit together. But, yeah, I'd imagine it'd be, like, some kind of on device authentic model would be my guess.

But, yeah. We'll see.

Nathan Lambert: Yeah. That's fun. Did you also wanna mention AlphaEvolve? I Oh, yeah. I've been so burnt by Google's, like, hype y projects, like their chip design and stuff.

And I'm like, I know RL if you, like, I mean this is the AlphaGo story, if you have a really high performance simulator, that's well matched to a task and you can scale RL, especially like you scale to many actors in parallel and you can just get a lot of samples, it tends to create very high quality performance. So it's like RL is somewhat repeatable in that regard. It might not work in every domain. I think I talked to like, maybe my last interview was with Eugene Vinitsky, which is one of my friends from Berkeley. And they were at Apple and they did this really parallel RL for self driving simulator, which was really awesome.

And I guess Alpha Evolve is somewhat away from that, but is it actually extracting the same vein of simulators?

Ross Taylor: Yeah, I think Alpha Evolve is very cool. In my mind, it's very interesting because it feels like going full full circle because in the nineties, like, the hot things were, like well, the cool things which didn't quite work were, like, kind of genetic algorithms and then, like, neural networks. Right? And and it feels like we often see, like, a new lease of life for several algorithms once, like, other components get in place. So in the case of AlphaRevolve, you're exploiting the, yeah, the very strong collating knowledge of a neural network, but then you also have, like, this almost neurosymbolic element.

Don't read too much into that, Gary Marcus, but in the form of a in the form of a database where you, yeah, store past programs. And then just having that kind of prior in the program is just a very good way to exploit the internal creativity of a language model as opposed to

Nathan Lambert: Like, does AlphaEvolve actually do this? I think a lot of people are just gonna not actually know what it is doing. I don't think I have a good knowledge of it, which is why I love

Ross Taylor: say it's like a a kernel optimization task. Like, so you're making a good kernel for, like, I know some common, like, ML architectures. So you start with, like, a reference implementation, and then in essence, you kinda using almost like it's a bit like in context learning that you're just, like, taking that and you're saying, okay. Propose a change, and then you benchmark it, and you get a score. And then you have a database where you have, like you store that program and its score.

And then when you sample, a new round, you kind of have an algorithm. You can look at it. It's, like, based on, like, island based algorithms where you kind of sample in proportion to, like, okay, score, but you also wanna explore a bit. And that's your new prior. So you're kind of successfully, like, iterating and evolving, like, a program.

Nathan Lambert: And this is just handed off to the language model to

Ross Taylor: And you just do it in in parallel. Yeah. Yeah. So

Nathan Lambert: Is the wait. What is the language model actually inferencing? Is it inferencing new programs?

Ross Taylor: Yes. So you sometimes need to be executed. Yes. So imagine you're constructing your prompt. Right?

So you fetch, like, a a past implementation from your database, goes in. It probably had the score as well saying, you know, this implementation above got this result. Please propose a new thing. I mean, I was simplifying, but this is the essence of it. Propose a new so then it writes a new program, get a new score, and back in the database.

So, basically, anything where you can pose, like, a very you can neatly pose it as a kind of, like, an optimization task, That tends to work very well. So I I think the debate there kind of alpha revolve versus RL, first of I think they can be complementary. But It

Nathan Lambert: should like the language model is trained with RL, I bet.

Ross Taylor: Yeah. That that too. I mean, interesting thing, by the way, the bulk of the alpha vol approach wasn't the strongest Gemini model. It was actually a weaker model, which is faster inference. So that that's, like, another interesting point, which is kind of anti model scaling pilled if that yeah.

There's, a nice balance to be found there. But in general, I I actually see it as, like, a broader trend of, like, how do you use compute if it's parallel or sequential because that's, like, a highly I would say the AlphaRevolve approach is, like, highly parallel, but they're not, like, going fully sequential yet. Right? But you can actually kind of use both. Whereas the RL approach, you're kind of usually starting from scratch.

Right? But then you could also think of, like, ways you might want to, you know, exploit good, you know, prize in the context. Right? I mean, colonel bent sort of does that anyway. It just doesn't evolve the reference implementation like AlphaRevolve does.

So I think it's definitely something to watch on. I think, yeah, I think AlphaRevolve is underhyped, but I I think the way it will end up

Nathan Lambert: It seems like a sign of things to come. But you can figure out parallel compute in the right way. I don't like, it might not be that the biggest model benefits the most in parallel compute.

Ross Taylor: Yeah.

Nathan Lambert: I mean, there's a lot of ways you could think about this, which is just wouldn't need more guess. Like, the guess is a 100 times cheaper and half as good. It's like

Ross Taylor: Yeah. And I think it's I I know. Maybe this is a bullshitty philosophical point, but it's like, you know, in the past, like, 5,000. Right? So, like, humans have made, like, a lot of progress, but their brains fundamentally haven't changed.

Right? But what makes you smart is that you kinda follow a natural curriculum each time. Right? So, you know, the the invention that you have now always, like you need some, like, previous invention beforehand. Right?

So in your RL context, like, would you rather start from scratch each time, or would you just, you know, use the best thing you have and then, you know, successfully iterate on that and stand on something's shoulders. Right? So I think that's definitely something to watch in the URL space, which is that instead of just trying to kind of like almost like alpha zero things from scratch, like, you know, how do you maintain the existing implementations and iterate upon those? And that's also attention as well in, like, if you're developing language models because to to your point about Claude code. Right?

You can imagine having an agentic model that is very good for start from scratch, but you could also have, a model that's very good at, you know, dealing with an existing code base. And the question is, like, which is more valuable? And the answer is both. But then depending on your on on how you actually use those models, you might end up preferring a model in a in a different way. Right?

So I think that I'm just trying to put it into a much bigger context and just let the Alpha evolve algorithm, but it can have plays into those different, yeah, arguments.

Nathan Lambert: Yeah. That's fun. I I'm not like, there's gonna be a lot more things like AlphaEvolve, which just kinda makes if if a language model can do that in one domain, it's it mostly is it takes people that have expertise in their niche to do the muddling and and fix things and and more will fall out. It it is very remarkable that you can do like a, what is it, like, a zero order optimizer, like a genetic algorithm just on prompts to language models and actually get anything out. That is, like, such a major win for language models of being some fundamental unit of compute.

It's it's really hard

Ross Taylor: Yeah. Absolutely. Work beyond creativity. Right? Because, like, the yeah.

Because the meme is like, oh, LMs can't be creative. I'm like, at a fundamental level, like, the softmax is, like, quite an expressive operation. You you'll get creativity eventually. It's just a question of, like, can you pick it out from the stuff you sample. Right?

So, yeah, I think it's also proof of creativity. You found, like, these new, like, implementations in AlphaRevolve and probably lots of other papers to come, which, yeah, humans haven't made. Right? So yeah.

Nathan Lambert: I would also guess there's people doing stuff like that that don't publish it. Or they've taken different models and it sounds like how to hill climb in their domain by setting up these weird loops. Yeah. These strange loops. I think that's I think these are good things to end on.

I think we're I mean, I'm kind of fading, so I think it's good. Thanks for coming back. I'm doing trip to London at some point. I don't think we've never met in person, but that that'll that'll happen at some point.

Ross Taylor: Awesome. Good

Nathan Lambert: to see you.

Ross Taylor: Yeah. Good to see you, Nathan. Yeah. I'll see you in a bit.

Discussion about this video