Interconnects AI

Reading today's open-closed performance gap

Nathan Lambert — Mon, 20 Apr 2026 18:25:02 GMT

It’s a clear, current equilibrium that open models will be in perpetual catch-up of closed models, but this gap being viewed as a single number, a “distance”, covers up a nuanced and crucial dynamic at what capabilities the models are covering. The most popular benchmark to comment on this gap is the Artificial Analysis Intelligence Index — a composite benchmark of ~10 sub-evals that they maintain over time to capture the “frontier” of current language model capabilities.

Particularly, I spend a lot of time understanding how dynamics that feed into that index are misunderstood by the natural tendency to reduce performance and trends to one number. Examples include:

How benchmarks evolve over time, becoming more or less correlated with how people actually use models,
How different models’ real-world performance relates to their benchmark rankings, and
How training regimes evolve over time to move said benchmarks.

Agentic benchmarks are in a decent place, but benchmarks are no longer as trusted as a correlate to real-world performance. A key example to this gray area is Gemini 3’s incredible benchmarks and remarkable irrelevance in where AI tools currently are being tested and deployed (agents). These trends point to obvious and lasting flaws in our measurements.

At the root of this dynamic — the dance of correlating model real-world performance and benchmark scores — is the constant shift of the industry. As all the models, i.e. both open and closed, evolve over time, the topics of focus for benchmarking shifts about every 12 to 18 months. All of the domains of interest have very different training domains associated with them, especially in post-training. The longer a single paradigm goes on, the better the industry gets at measuring performance. In a new era of rapid post-training improvements, I’m at a relative minimum in my personal confidence in benchmarks.

Task evolution and LLM paradigms

Right after ChatGPT the focus was a mix of chat, math, and simple code. Instruction tuning and RLHF dominated. Chat capabilities saturated and faded quickly, then mathematics became less focal. Through 2025 and to today, especially once reasoning models became the default, the focus shifted to more complex coding and other simpler agentic tasks. We’re at the tail end of this first era. Recent training recipes are all dominated by reinforcement learning with verifiable rewards (RLVR), but the domains it is applied in have shifted dramatically from basic question-answer checking to complex environments.

What we’re seeing is that the closed, frontier labs are investing astounding sums of money in mastering these current foci — i.e. code, terminal tasks, etc. — while starting to push into more diverse knowledge work tasks. These newer tasks encompass specialized domains, such as accounting, law, healthcare, etc. They are still agentic, but require more expertise and often integrations with existing software or domain-specific tools.

We have very limited evidence on the true balance of capabilities of these newer domains, but these are the areas I’m focusing on when I say open models will struggle to keep up. The problem is that evaluating complex language model workflows is also a challenging research problem in itself.

The tasks are getting harder and the data needed to hillclimb on them is getting more private (relative to code, which has swaths of code on GitHub). Leading open model labs are helped by dynamics happening in the data industry that are economically similar to building chip fabs. The few, leading labs in the U.S. pay astronomical sums to buy new environments and datasets, then the fast-following labs (often in China), buy these later at a steep discount.

This is a key missed point — that the levers non-frontier labs pull to keep up constantly shift over time. A focus on distillation as the key lever to Chinese models’ progress reflects a blind-spot to the importance of RL environments to current training regimes. If an environment can be built either as a single evaluation in the Artificial Analysis Index, or to mirror it, currently the Chinese labs will be able to keep up.

Economic pressure to reinvent “the frontier”

The question worth dwelling on is: How crucial is the current set of tasks (again, coding and terminal tasks), where the likes of OpenAI and Anthropic have a massive business-adoption advantage over leading open weight models (and even Google alike), is crucial to maintaining revenue numbers? In order to maintain these record growth numbers and trajectories, there needs to keep being a meaningful edge in performance. Many companies would love to reduce their token expenditure cost if they can swap in a far cheaper, open model equivalent.

If agentic coding abilities saturate and the “frontier” of AI performance moves elsewhere, a large amount of the enterprise revenue could be reliant on well-formed customer relationships, inertia, and better product development, rather than the models being leaps and bounds better.

This precarious position is what I describe as the frontier labs needing to constantly reinvent themselves, and the field’s prospects, for monetizing the vast buildout of AI infrastructure. I still tend to fall on the side that the buildout will be worth it, and Anthropic and OpenAI will be astronomically profitable businesses, so I take this as a faith of a mix of them continuing to unlock compelling, new, valuable use-cases for the models, and that the benchmarks the open models are closing in on as not being a complete signal.

I operate with a sort of presumption where the leading open models from China are focused slightly more on benchmarks than the leading closed labs in the U.S. They’re incentivized to do so — they want to present the image as constantly being on the heels of the best closed models. Saying the Chinese labs are only in this narrative because they’re overfitting to benchmarks would be incredibly naive and incorrect. They’re genuinely strong models, and these dynamics of overselling and real innovation are a fine balance.

There are a few out-of-distribution benchmarks where open-weight models are very far behind, such as WeirdML or ARC AGI 2, but there are countless random benchmarks that show these open models as being unexpectedly strong. When you use the models, you can pick up on this lack of robustness (e.g. in long-context capabilities, and needing to reset your agent context more often than Claude/Codex), but they’re not a category error in the sense that they’re fundamentally different classes of models. They’re far closer than many would’ve expected.

How long can open models keep up?

My bets on open models, mid-2026

Nathan Lambert — Wed, 15 Apr 2026 18:20:00 GMT

We’re living through the period of time when we’ll learn if open models can keep up with closed labs. The obvious answer is that no, they won’t. This answer is a form of saying they won’t keep up in every area. This framing closes off a popular prediction where the open models completely catch up, as in all models saturate and open and closed models only become increasingly similar. In living through this, it’s evidently very unclear when the longer-term stable balance of capabilities will solidify.

This is a very complex dynamic, where the core point we monitor is a capability gap between models. At the same time, this gap is intertwined with evolving dynamics in the funding of open models, who builds open models, how techniques like distillation that enable fast-following translate through new application domains, potential regulation hampering the open-source AI ecosystem, and of course who actually uses open models.

The capabilities gap is one signal in a complex sea of forces, pushing supply and demand into different shapes. In many cases the demand — where obviously tons of individuals, organizations, and sovereigns want, or need, open models — is largely separated from supply. Supply is fully dictated by economics. The question of “which business strategies support releasing open models” is still at stake.

With this complexity, I wanted to distill my key beliefs down into a clear list. These are downstream of 10+ pieces I’ve written or recorded on open models this spring (which are linked throughout).

It’s surprising that the top closed models did not show a growing capability margin over open models, based on compute differences for training and research, especially in the second half of 2025 and through today.
Open model labs are technically very strong at keeping pace on well-established benchmarks. This will continue and reflects a balance of abundant talent and sufficient computing power.
Chinese open-weight labs focus slightly more on benchmark scores than comparable closed labs in the U.S. Distillation helps the Chinese LLM companies do so, but it’s not a panacea. Changes in the distillation dynamic (e.g. regulation) will not be a determining factor on the balance of capabilities. This increase in focus is a natural evolution of their incentives in keeping the narrative on keeping up with the frontier alive, which is crucial to fundraising and adoption.
To date, closed models tend to be more robust and generally useful than similarly scoring open models. Closed models have certain hard-to-measure qualities that are not well captured in current or past benchmarks. This will be key to enabling closed models to dominate in markets where an individual user constantly presents new challenges, i.e. supporting knowledge workers as a direct assistant.
The open vs. closed model race, as monitored through benchmarks, will largely be a game of economic staying power and fast-following, until the market structure constricts. I expect Chinese open-weight labs to face funding difficulties first, as soon as later this year. Funding difficulties will be seen in different capability trajectories 3-9 months later.
The RL dominated training era has increased the relevance of distribution to real-world use-cases as a key factor in continued capabilities improvements. These are tasks where users directly use tools like Claude Code or Codex to solve problems in their job with agents. This is the first clear technical area that closed labs can dominate open-weight models on capabilities, potentially leveraging online RL directly based on user feedback.
Open models will be increasingly adopted in repetitive automation tasks, as measured in the relative share of the API market, for repetitive tasks across the ecosystem. This takes the form of many new AI-native applications, business backend automation, etc. The success of this will drive more investment in domain-specific, efficient open models.

This is a complex picture, where the long-term trajectory is more of an economics question rather than an ability one. Many other outlets can paint a far more simplistic narrative that “China will assuredly catch us in AI” and get more distribution because it is a simple story. The reality is complex. Only real AI revenue begets more investment, eventually that’ll be linked to the ability to keep improving models at a rapid rate. Economic realities have not yet impacted scaling open models, as a general category.

This economic-focused angle relates to my positions on the open model ecosystem more broadly.

Recurring calls to ban certain types of open models will continue to come but are in practice impossible to implement. Training strong AI models (i.e. near but not at the frontier) is a relatively small cost compared to large-scale deployments. E.g. if the U.S. bans open models over a certain compute threshold, another sovereign entity will eventually train them and release them publicly, with the models entering the U.S. market with less oversight.
The second derivative of influence on open models has shifted, and the U.S. will slowly regain ground in adoption metrics of open models starting in early 2027 (it takes a long time for China’s velocity to slow, then flip). Examples include Google’s Gemma 4 (a wild success), Nvidia’s Nemotron, and Arcee AI.
As ever-stronger closed models are built, previewed, and released, there will be more safety-shocks saying that open-weight versions of the strongest AI models never can be allowed to exist, similar to reactions to Claude Mythos. These can spur burdensome regulation on open models.
With the above, there will also be increased long-term interest in open models, as sovereign entities and existing power structures realize the coming, super powerful AI tools cannot land in the hands of only one or a few companies. These entities will see open models as a different governance paradigm.
New funding structures for open models will emerge, as many stakeholders realize dependencies on single, for-profit companies for access to intelligence are unreliable.
Local agents, OpenClaw, and other personal agents represent a large, to date, mostly ignored market for open model usage. It is a sort of dark matter, with pervasive, massive potential for influence on the balance of open-to-closed models.

A single word governs this post and is intentionally repeated — complex.

This complex reality has been driving me to think more deeply about how to clearly describe the open model gap, and why I can hold it in my head that I expect American closed labs to clearly draw ahead, despite the fairly unequivocal evidence in support of the capabilities of recent open-weight models. More on the nuance in the open-closed gap in another piece coming soon, so please subscribe!

Let me know any positions that I missed.

What I’ve been building: ATOM Report, post-training course, finishing my book, and ongoing research

Nathan Lambert — Tue, 14 Apr 2026 20:41:12 GMT

This post is a roundup of my recent efforts that did not warrant a standalone Interconnects post, why I’m spending time on them, and what they accomplished.

1. The ATOM Report: Measuring the Open Language Model Ecosystem

https://arxiv.org/abs/2604.07190

To accompany The ATOM Project memo, arguably a manifesto, making the case for investment in open models in the U.S. – originally launched in August 2025 – we’ve released an updated technical report with our latest data, analysis, and storytelling within the open language model ecosystem. The ATOM Report is dense with the methods Florian and I use to keep track of the open ecosystem. It covers GPT-OSS’s rise, inference market share, the influence of China’s mid-tier players like Moonshot, Z.ai, & MiniMax, signs of the U.S.’s progress on open models, and much more.

In particular, the paper details our updates to the Relative Adoption Metric (RAM), which we use to evaluate the adoption of recent models in a time-varying and size-normalized manner. Here’s a sampling of recent, primarily Chinese, models on the RAM score. The RAM score is designed so that a score >1 indicates a model is, at that point in time, on track to be a top 10 most downloaded model of its size category, ever. It reduces a messy landscape to one, easily interpretable number!

We used the data to also analyze the recent Gemma 4 release, which is showing incredible early adoption numbers. We’ll stay tuned on it!

Subscribe to the (infrequent) ATOM Project Substack for more updates like this!

2. RLHF Book is done & ready for pre-order!

http://rlhfbook.com/

The goal of this book was to write the book I wished I had when I was getting started in post-training language models. This project has been on my mind for a long time. I bought the domain rlhfbook.com and started to take it more seriously on May 20th, 2024. Here we are!

Last week, it was sent to production with the Manning team. This means content edits are done, and it’ll be sent to print in ~2 months. In the meantime, I’m spending my time developing the accompanying code and course (more on that below).

You can preorder on Amazon or Manning (currently cheaper).

3. A post-training course I’m making

https://rlhfbook.com/course

The goal of my book is for it to be the central resource for people looking to transition from beginner to expert in post-training. It’s not necessarily an entry-level book, but as AI models become stronger, it needs to be a community-building effort as well. The first step I’ve made to expand the scope from just a book to a complete learning experience is building a lecture series. The lectures will be freely available on YouTube and incorporate community questions & answers (as standalone videos in between lectures).

You can watch the first batch of videos below, and subscribe on YouTube for future ones. I’m going to build on the book platform more this summer, as I develop the book codebases and host in-person events.

4. Recent technical research

Long-time followers of Interconnects know that this blog has its roots in explaining fundamental research in the field. This has immense value in two ways. First, as AI moves incredibly fast, far more people need to be able to parse research to make the right bets on the technology. Research is the only early warning of some big changes coming. Second, it helps uplift the careers of my collaborators – the people I spend my life with! On that note, check out two papers I had the privilege of being part of below.

https://arxiv.org/abs/2603.16759 - TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities, Graf et al. 2026

This work explores the strengths of various models in multi-turn dialogue settings, how to create training data to improve it, and other quirks in post-training. My interests here have fully shifted to agents, where I see multi-turn interactions as a very important user interface problem — what information do I show to the user to solve the task as soon as possible without cutting corners?

https://arxiv.org/abs/2603.11327 - Meta-Reinforcement Learning with Self-Reflection for Agentic Search, Xiao et al. 2026

This paper frames solving hard problems with RLVR as a meta-learning problem, where context from previous attempts should be used to inform future rollouts. It’s a very obvious idea in some ways, where most of RL for LLMs is still very on-policy, but naive. The models learn from recent trials in parameters, but not in context. This research feeds into a ton of other recent work on ways that RL can be formulated to solve different forms of continual learning. Another great related paper is Learning to Discover at Test Time.

The inevitable need for an open model consortium

Nathan Lambert — Sat, 11 Apr 2026 13:02:06 GMT

Recently, I was talking with Percy Liang, Stanford professor and lead of the Marin project (another fully-open model lab), and it set in on me that there will eventually be a consortium of companies funding a foundational set of open models used across industry. It’s not clear when this’ll emerge, and Nemotron (Coalition) is Nvidia’s attempt to bankroll and bootstrap this approach within a single wealthy company, but a consortium is the only long-term stable path to well-funded, near-frontier open models.

In recent months, we’ve seen a lot of turnover in open model labs, with high-profile departures at Qwen and Ai2 (my comment). This shouldn’t be super surprising to followers of the ecosystem — it’s happened before with Meta shifting its focus away from Llama, and it’ll only happen more as the cost of trying to keep pace at the frontier of AI only increases. The other leading labs with models available today include Chinese startups such as Moonshot AI, MiniMax, and Z.ai — all of which look precarious on their ability to fund continued growth in the cost of training or R&D. Releasing one’s strongest models openly today is in active tension with the option of spending focus and resources on AI products that can currently generate meaningful revenue (and profits).

We’re going to see business models emerge around releasing some, or even many, models openly, but these will largely be smaller models that enable a long-tail of functionality, rather than models at the absolute frontier. This class of companies that’ll release many, strong fine-tunable models will include the likes of Arcee AI, Thinking Machines, OpenAI, Google with Gemma, and more in that class. The cost and relative advantage of keeping the best models closed in a business environment with many opportunities for revenue are too high. To summarize — there will be an ever increasing number of companies releasing models that are good for creating a lively niche of smaller, custom models, but an ever decreasing number of companies willing to release fully open, near-frontier models.

This is the core thesis of why I’m pushing hard for more people to do more research on how these smaller models can complement the best closed agents, the science of finetunability, etc. See my post below — it’s about creating a sustainable open model ecosystem, whether or not the frontier of open keeps paced with closed:

It’ll take years for this equilibrium to become more obvious, seen through the lens of more open model families coming and going. This year, it seems likely we’ll see Nvidia’s Nemotron reach new heights, Reflection AI challenge some of the Chinese models with a strong, large MoE, maybe Meta releases a new open-weight model, and so on. True pressure to change strategy will only come when the capital environment punishes the less efficient spend on resources (e.g. giving away your competitive advantage, in having an in-house model). This pressure will likely hit Chinese startups training these models first.

All of Moonshot AI, MiniMax, and Zhipu AI will show signs of financial challenge in the coming years if they retain their strategy, on top of their models falling further behind the best open models in terms of generality. This is inevitable pressure to evolve open models to areas that are profitable and complementary of the frontier of AI.

Nvidia, which is best positioned to support the open ecosystem in the near term to support its core GPU business, could face many pressures to pull back its open model efforts. It could:

Realize it’s too competitive to their biggest customers as they succeed too much with Nemotron,
Fall to competition on their core business and lose the free cash flow buffer needed to fund this (e.g. it’s 2031 and OpenAI, Anthropic, Google, and the other frontier labs are worth so much they build their own chips).1
Start succeeding beyond their initial goals and keep the chips for them to build ASI themselves, as a closed-weight model.

The pressures for new funding mechanisms for open models are based on the assumptions of continued, substantive progress on the capabilities of frontier models. Mechanisms such as self-improvement and scaling all stages of the training pipeline are underway. This progress of capabilities will only increase the potential profit in selling models as and in products, not giving them away. The scale of investment required has already begun to push away non-profits from the game of making truly frontier-scale models.2 Capitalism is designed to make companies ruthless and chase down leads on profitability, not donate technology as charity.

As the economic environment shifts companies away from releasing the strongest models openly, more companies that rely on these models will look for an outlet of securing model access into the future. This is going to be compounded by a growing group of companies who come to rely on open-weight models for their workflows.

These points loop back into how model training is getting more expensive, so where desire to have the models will go up, ability to procure them will go down for many players. There are x-factors that could multiply the demand for institutions to ensure the existence of open models, such as the best frontier models not even being available via API (such as if Claude Mythos never goes general access).

Subscribe now

As training relevant models is shifting to cost billions of dollars, rather than millions, few companies well be able to afford it. many companies will bite at the cost of paying 1/10th of the cost to train a frontier model, or if the consortium works, 1/50th. The upside for companies will be some mechanism to steer development (e.g. model sizes) or getting early access to develop internal and open-source tooling for the model.

It is in my nature to, by default, say this idea will fail, as training models is inherently a complex and high-focus endeavor, one that requires integration of every part of the stack and focusing specifically on your own vision and needs, rather than trying to serve every possible user. Eventually the need for open intelligence — and economic pressure to build it — will make a model consortium inevitable.

There’s a meaningful chance in my estimates that Anthropic, OpenAI, and Google are the most valuable companies in the world in the 2030s by owning frontier intelligence.

Truly open is a prospect for safety research and long-term innovation, which suits both the narratives of AI risk and AI optimism. We need it for both. Mech interp is one of the heaviest users of Olmo models. ~~If we don’t find what’s after the transformer, there may not be enough benefit to AI models.~~ (edit, I had published that as a half baked thought, it’s about how fully-open models operate in the ecosystem differently) All of these are largely orthogonal to the point of the post.

Claude Mythos and misguided open-weight fearmongering

Nathan Lambert — Thu, 09 Apr 2026 21:28:39 GMT

With the announcement of the Claude Mythos model this week and the admittedly very strong stated abilities, especially in cybersecurity, a new wave of anti open-weight AI model narratives surged. The TL;DR of the argument is that our digital infrastructure will not be ready in time for an open-weight version of this model, which will allow attacks to be conducted by numerous parties.

The backlash against open models in the wake of the Mythos news conflates too many general unknowns into a simple, broad policy recommendation that could actually further weaken cybersecurity readiness.

We’ve been here before – open-weight models were discussed as being extremely dangerous when OpenAI withheld GPT-2 weights in 2019, and when OpenAI released GPT-4 in 2023. Both of these waves came and went. The core mistake that is being made is the composition of two issues: 1) the acceptance of the open-closed model gap being static in time and 2) linking open-weight viability generally to specific issues.

I’ve written at length recently on how I think that the best, frontier-level open weight models are going to fall behind the best closed models in overall capabilities in the near future. I’ve also written about how the open-weight ecosystem needs to adapt to accept this reality. This is one of the times for the AI industry where I will repeat that it’s a total blessing to have the 6-18 month delay from when a certain capability is available within a closed lab to it being reproduced in the open. It’s a good balance of safety and monitoring the frontier of AI systems while allowing a useful open-source ecosystem to exist and thrive.

The core argument I’ve focused on in the open-closed model time gap has been in general capabilities – i.e. for general purpose, frontier models such as Claude Opus 4.X or GPT Thinking 5.X. The abilities of these closed models to robustly solve and work in diverse situations as agents remains out of scope of the best open-weight models. What the open-weight models have tended to be better at is quickly keeping pace on key benchmarks (which admittedly is helped to some extent, but not necessarily substantially by distillation). This discussion is entirely different, it has to do with if open weight models can keep pace on the specific skills related to cybersecurity, and when we could expect an open version of this model to be available to the world.

The case of a Claude Mythos level open weight model is admittedly more nuanced to me than the previous few anti-open weight narratives the community has experienced. Where GPT-4 was about a more hypothetical risk, especially in areas like bio-risk, the clear and present reality of cyber infrastructure being prone to attack is far more tangible. Still, much of this nuance in the moment comes down to not knowing the full details of what the system can actually do (i.e. Mythos), and the state of the environment it would act in (i.e. our digital infrastructure).

To properly assess this risk, we need to know what it takes to build and deploy a Claude Mythos scale model. This entails three pieces: 1) training and releasing the weights, 2) the harness that gives the model effective tools it knows how to use, and 3) the inference compute and software.

(Below I make some model size & price estimates to show my thinking, these should not be taken as ground truth.)

Current estimates put the size ranges of leading models like Claude Opus 4.6 or GPT 5.4 as being around 3-5T parameters. Currently, the largest open-source models, which have been coming from Chinese labs, are around 1T parameters. Claude Mythos’s preview pricing is 5X Opus, which could come from a simple multiplicative increase in active parameters (with the same serving system design), far higher inference-time scaling, more complex harnesses that make inference less efficient, lower utilization expectations, and so on. The simplest guess is that it’s a mix of all of the above, something like 2X bigger in parameters and much less efficient to serve. That’s a huge model, likely something similar to GPT 4.5, but actually post-trained well (GPT 4.5 was ahead of its time, infra-wise).

With size comes the challenge actually training the model, as bigger models always come with new technical problems that must be solved to unlock the capabilities. For the case of cybersecurity, my guess is that most of the capabilities can be learned by training a model to be superhuman on coding. Unlike some capabilities such as knowledge work, medicine, law, etc., coding can be studied and improved substantially with public data like GitHub. I’m far more optimistic in open-weight models staying fairly close to the frontier in narrow domains of code execution and processing, but I don’t understand the full scope of skills needed to be superhuman in cybersecurity understanding. How much expert knowledge and special sauce went into training Claude Mythos? That’s a substantial source of my error bars on the impact.

Second, we know nothing about how the model works under the hood. Today, models are complex systems that entail far more than just weights. They require complex tools and infrastructure to run them, of which Claude Code is the one we are most used to. Mythos very likely has its own innovations here.

My estimate for how many GPUs you’d need to serve an 8T parameter, modern MoE is something like O(100) H100 GPUs, which costs something like $10K a day (and this may be very slow in terms of tok/s). Heck, the official marketing copy of the Nvidia GB200 VL72 system is “Unlocking Real-Time Trillion-Parameter Models” on the rack. Does Mythos fit on one rack? The point isn’t to rely on my specific estimate as a policy reference, but to repeat that running leading AI systems is very expensive and not something you can just do on a laptop or self-service cloud portals.

There are far fewer actors who can get their hands on these resources, relative to those who can download the model. Of course, there are still many, but it’s important to flesh out all the details of what it would take to proliferate the capabilities of a Mythos-like model. In summary, tools like Mythos will make the best attackers have more powerful tools of the trade, but it won’t be handing a nuke to every teenager connected to the internet.

Personally, I do acknowledge there’s a chance that cybersecurity abuse is a red line that makes releasing open-weight text models above a certain capability threshold morally grey. Many people thought this red line would come far earlier, somewhere in between GPT-2 and GPT-4, through the harm axis of mis/disinformation, but that had different bottlenecks. For image generation models, we’re well past the first red line which is enabling non-consensual AI deepfakes with readily available open-weight models. We’re balancing the reality of these fears having come and gone before with a technology that’s becoming increasingly capable.

So, my second large source of error bars is “how bad is it actually” with respect to the state of cybersecurity. How much can humans clean up in the most important software with months of private access to a model like Claude Mythos? What will never get fixed?

For example, if we get open-weight models that are close to the capabilities of Claude Mythos, could those be fine-tuned by organizations to harden the security of their tools?

Currently, it’s too soon to call it as a general reason to stop progress in open models. When Claude Mythos is closed to so few partners, in some ways having strong open models close to the threshold makes assessing the danger easier. Having to rely fully on a single private company to determine the security of essential, international infrastructure is not a tenable equilibrium.

So, in conclusion, I urge people to further study three things:

How do we measure cybersecurity related capabilities across open and closed models. With this, are open models truly keeping up at a 6-9month lag, or are they only maintaining performance relevance in other areas of coding?
How do we independently measure the true impact of Claude Mythos and Project Glasswing on existing cybersecurity concerns?
If it is the case that the models are keeping up and the defensive capabilities of Claude Mythos are weak, how do we better monitor (and if needed, try to regulate) the targeted capabilities of open-weight models in narrow domains?

The goal is to encourage fears about open models remaining very specific. Any general ban on open models in a nation will immediately and likely irrevocably remove that entity’s ability to influence a crucial, and amorphous technology. If we stop building the best open models in the U.S., then another country will do this and become the center of the technology. There’s no way to fully kill open models, only influencing, understanding, and steering.

Gemma 4 and what makes an open model succeed

Nathan Lambert — Fri, 03 Apr 2026 16:57:36 GMT

Having written a lot of model release blog posts, there’s something much harder about reviewing open models when they drop relative to closed models, especially in 2026. In recent years, there were so few open models, so when Llama 3 was released most people were still doing research on Llama 2 and super happy to get an update. When Qwen 3 was released, the Llama 4 fiasco had just gone down, and a whole research community was emerging to study RL on Qwen 2.5 — it was a no brainer to upgrade.

Today, when an open model releases, it’s competing with Qwen 3.5, Kimi K2.5, GLM 5, MiniMax M2.5, GPT-OSS, Arcee Large, Nemotron 3, Olmo 3, and others. The space is populated, but still feels full of hidden opportunity. The potential of open models feels like a dark matter, a potential we know is huge, but few clear recipes and examples for how to unlock it are out there. Agentic AI, OpenClaw, and everything brewing in that space is going to spur mass experimentation in open models to complement the likes of Claude and Codex, not replace them.

Especially with open models, the benchmarks at release are an extremely incomplete story. In some ways this is exciting, as new open models have a much higher variance and ability to surprise, but it also points at some structural reasons that make building businesses and great AI experiences around open models harder than the closed alternatives. When a new Claude Opus or GPT drops, spending a few hours with them in my agentic workflows is genuinely a good vibe test. For open models, putting them through this test is a category error.

Something else to be said about open models in the era of agents is that they get out of the debate of integration, harnesses, and tools and let us see close to the ground on what exactly is the ability of just a model. Of course, we can’t test some things like search abilities without some tool, but being able to measure exactly the pace of progress of the model alone is a welcome simplification to a systematically opaque AI space.

The list of factors I’d use to assess a new open-weight model I’m considering investing in includes:

Model performance (and size) — how this model performs on benchmarks I care about and how it compares to other models of a similar size.
Country of origin — some businesses care deeply about provenance, and if a model was built in China or not.
Model license — if a model needs legal approval for use, uptake will be slower at mid-sized and large companies.
Tooling at release — many models release with half-broken, or at least substantially slower, implementations in popular software like vLLM, Transformers, SGLANG, etc due to pushing the envelope of architectures or tools.
Model fine-tunability — how easy or hard it is to modify the given model to your use-case when you actually try and use it.

The core problem is that some of these are immediately available at release, e.g. general performance, license, origin, etc. but others such as tooling take day(s) to week(s) to stabilize, and others are open research questions — with no group systematically monitoring fine-tunability.

In the early era of open models, the days of Llama 2 or 3 and Qwen pre v3.5, the architectures were fairly simple and the models tended to work out of the box. Some of this was due to the extremely hard work of the Llama, Qwen, Mistral, etc. developer teams. Some is due to the new models being genuinely harder to work with. When it comes to something like Qwen 3.5 or Nemotron 3, with hybrid models (either gated delta net or mamba layers), the tooling is very rough at release. Things you would expect to “just work” often don’t.

I’ve been following this area closely since we released Olmo Hybrid with a similar architecture, and Qwen 3.5 is just starting to work well in the various open-source tools that need to all play nice together for RL research. That’s 1.5 months after the release date! This is just to start really investing more into understanding the behavior of the models. Of course, others started working on these models sooner by investing more engineering resources or relying on partially closed software. The fully open and distributed ecosystem takes a long time to get going on some new models.

All of this is lead-in for the most important question for open models — how easy is it to adapt to specific use-cases? This is a different problem for different model sizes. Large MoE open-weight models may be used by entities like Cursor who need complex capabilities in their domain, e.g. Composer 2 trained on Kimi K2.5. Other applications can be built on much smaller models, such as Chroma’s Context-1 model for agentic search, built on GPT-OSS 20B.

The question of “which models are fine-tunable” is largely background knowledge known by engineers across the industry. There should be a thriving research area here to support the open ecosystem model. The first step is to understand characteristics of different base and post-trained models to understand what they look like. The second step is to tune pretraining recipes for open models so they’re more flexible.

For The ATOM Project and other Interconnects endeavors, we’ve put in substantial effort to measuring adoption trends in the open ecosystem. Everything takes a long time to unfold after a model is first publicly available — and adaptability is why. What we know for sure now, when Qwen has been going from strength to strength with its releases, is that technical staff across the industry has gotten comfortable working with Qwen models. Countless research methods and datasets were made to work with Qwen. It’ll take patience for any other model family to get to this point — a patience I’m not sure many open model builders have.

This takes us to Gemma 4, Google’s latest open models. Gemma 3 was released more than a year ago, in March of 2025, and is a bit underrated. Gemma 4 comes in 4 sizes for now, with a bigger, MoE model of over 100B total parameters rumored but not released yet. The models we have today come in sizes of ~5B dense, 8B dense, 26B total 4B active MoE, and 31B dense.

I’m most excited that they’re finally adopting a standard Apache 2.0 open source license. This’ll massively boost adoption. The standard of better licenses for strong open-weight LLMs was set by mostly Chinese open model labs in the last 1-2 years, and now U.S. companies are following suit. I will personally be so happy if the horrible Llama licenses and Gemma terms of service were an ~18-month transient dynamic of the industry being nervous about releasing strong open models.

The Gemma 4 scores look very solid, the small models have incredible benchmark scores (especially in general domains like LMArena) and the 31B model rivals the recent Qwen 3.5 27B, which is the leading member of that class. The ~30B size range is an important one, as it’s accessible both to researchers and to enterprises looking to deploy the model in real use-cases. Where the 7B model scale is the default for tinkering and research, a 30B model is the default for seeing if an open model can unlock substantial value in your specific workflow — a good mix of intelligence, low price, tractability for downstream training, etc.

Source: Sebastian Raschka, Ahead of AI

This takes us back to the above adoption criteria I mentioned for open models and the bigger question — do I think Gemma 4 will be an overwhelming success? Previous Gemma models have been plagued by tooling issues and poorer performance when being finetuned.

Gemma 4’s success is going to be entirely determined by ease of use, to a point where a 5-10% swing on benchmarks wouldn’t matter at all. It’s strong enough, small enough, with the right license, and from the U.S., so many companies are going to slot it in.

I’m cautiously optimistic that Gemma 4 is going to work better here. Winds are shifting for open models built in America. We saw GPT-OSS go through a bumpy launch to become an overwhelming success. There’s a collective energy around the likes of Reflection, Arcee, Nemotron, Gemma, Olmo, and peers that show substantial demand for building new stacks around open models. There’s capital to be spent on AI stacks across the economy by those who want more ownership of everything, including the model.

After launching The ATOM Project 240 days ago, the conversation is shifting into the next stage. Summer of 2025 was a crisis moment where the U.S. AI scene realized it can’t wait and figure out open models after building AGI. The two markets will capture different areas and proceed in parallel. Now that more companies in the U.S. are releasing strong models, we need to improve the ecosystem so that these models are easy to use, understand, and build value around. It’s the hard work to build another inflection point in these adoption plots I’ve been updating consistently, but that’s the work to be done. Join me in it.

More data coming soon! Here’s a sneak peek:

Latest open artifacts (#20): New orgs! New types of models! With Nemotron Super, Sarvam, Cohere Transcribe, & others

Florian Brand — Mon, 30 Mar 2026 13:02:45 GMT

This Artifacts Log post is unusual in how many diverse, quirky models there are across use-cases and modalities. Normally these model roundups are dominated by big models from the likes of Qwen, DeepSeek, Kimi, etc. There are models for all sorts of different use-cases in this post, from optical character recognition (OCR), RAG search, audio transcription, computer-use, code-editing, math theorem proving, and more. The artifacts covered this month also come from a much broader list of open model builders.

This gives us a lot of hope for the future of open models, where we see the need for domain-specific, cheap models as being crucial tools to complement the strongest, closed agents. When the top few models get the headlines, this vast, industry-scale tinkering can easily be forgotten. Reading this post gives a technically grounded, broad coverage of the many directions the industry is pushing specific models for. Expect more like this!

To encourage people to take a look at the diversity of models in this issue, the core part of the update is not paywalled. An otherwise quiet month at the top end of open models really delivered.

Artifacts Log

Our Picks

NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 by nvidia: The long-awaited mid-sized model from NVIDIA is finally here: 120B total params with 12B active, a 1M context window, and support for multiple popular languages. Furthermore, the model is based on LatentMoE and uses NVFP4 during pre-training, which is a first for open models. Like other things from NVIDIA, it comes with an in-depth tech report plus pre-training and post-training datasets, with the vast majority of the data being openly released.
cohere-transcribe-03-2026 by CohereLabs: A speech-to-text model by Cohere based on the conformer architecture, similar to NVIDIA’s Parakeet. It features 14 different languages, including some AIPAC languages and Arabic. Performance-wise, Cohere claims it beats similarly sized open and closed models. To top it all off: The model is released under Apache 2.0! Previous open models by Cohere were released under a non-commercial license.
sarvam-105b by sarvamai: The Indian startup Sarvam, which trained open models in the past, has scaled up everything for its new flagship models in terms of dataset size (12-16T tokens) and model size (30B-A2B, 105B-10A). As a result, they come close to or even surpass a lot of open models with similar sizes. The release also shows why sovereign AI is so important, something that few other countries have internalized yet: In comparison with SOTA open models, the Sarvam models are vastly more preferred in Indic languages.
Mistral-Small-4-119B-2603 by mistralai: A 119B-A7B model by Mistral, combining their previous model generations into one as a hybrid reasoning model with coding abilities.
zeta-2 by zed-industries: The open source code editor Zed has released their edit prediction model openly in the past, which we featured a year ago. While the previous version was based on open data, the new version, based on Seed-Coder-8B, is trained on open source code by users who explicitly opted into data collection.

Models

General Purpose

gpt-oss-puzzle-88B by nvidia: A pruned expert version of GPT OSS 120B. It also replaces some global attention layers with window attention. Puzzle is “a post-training neural architecture search (NAS) framework, with the goal of significantly improving inference efficiency for reasoning-heavy workloads while maintaining or improving accuracy across reasoning budgets.”
Olmo-Hybrid-7B by allenai: A hybrid attention + GDN (gated DeltaNet) model. See our blog post for more insights about the architecture and its challenges.

NVIDIA-Nemotron-3-Nano-4B-BF16 by nvidia: A compressed version of NVIDIA-Nemotron-Nano-9B-v2, which itself is a compressed version of NVIDIA-Nemotron-Nano-12B-v2. Nvidia has been pushing this direction more than anyone else with open models.

Multimodal

Yuan3.0-Ultra by YuanLabAI: A 1T multimodal model by the relatively unknown Yuan Lab. They pre-trained a 1.5T model on 2.2T tokens and subsequently pruned experts with a new technique, outlined in the tech report.
LongCat-Next by meituan-longcat: A multimodal model which can process text, vision, and audio as both inputs and outputs.
granite-4.0-1b-speech by ibm-granite: A small speech-to-text model supporting six languages. It also supports the generation of English audio for translation.
Phi-4-reasoning-vision-15B by microsoft: A Phi model which uses the SigLIP-2 vision encoder.

Special Purpose

MiroThinker-1.7 by miromind-ai: A fine-tuned version of Qwen 235B for agentic workflows, especially research.
tabpfn_2_6 by Prior-Labs: An update to the popular tabular prediction model, which is slightly larger than its predecessor. Its license allows research and internal evaluation only.
sam3.1 by facebook: An update to SAM 3, carrying the same restrictive license.
Holotron-12B by Hcompany: A policy model for CUA agents.
LongCat-Flash-Prover by meituan-longcat: A Lean4 fine-tune of the large LongCat model.
Leanstral-2603 by mistralai: A Lean4 fine-tune of the new Mistral Small 4.
reka-edge-2603 by RekaAI: A model for robotics, beating models such as Cosmos-Reason2. Its noncommercial license converts into Apache 2.0 after two years.

RAG

Qianfan-OCR by baidu: There have been a lot of great OCR models lately. This one is from Baidu and is licensed under Apache 2.0.
chandra-ocr-2 by datalab-to: An update to the Chandra OCR model, released under a restrictive license.
Reason-ModernColBERT by lightonai: A SOTA retrieval model released under a non-commercial license. However, there is also code to re-generate the data, allowing the training of a commercially viable version.
context-1 by chromadb: A fine-tuned version of GPT-OSS for agentic search with an in-depth tech report. It also marks the debut of Chroma into the open model space. Trained with Thinking Machine’s Tinker.
dots.mocr by rednote-hilab: The beloved dots.ocr model has been updated and supports SVG outputs. However, on top of the general MIT license, the model comes with additional usage restrictions, just like its predecessor.

Lossy self-improvement

Nathan Lambert — Sun, 22 Mar 2026 19:39:40 GMT

Fast takeoff, the singularity, and recursive self-improvement (RSI) are all top of mind in AI circles these days. There are elements of truth to them in what’s happening in the AI industry. Two, maybe three, labs are consolidating as an oligopoly with access to the best AI models (and the resources to build the next ones). The AI tools of today are abruptly transforming engineering and research jobs.

AI research is becoming much easier in many ways. The technical problems that need to be solved to scale training large language models even further are formidable. Super-human coding assistants making these approachable is breaking a lot of former claims of what building these things entailed. Together this is setting us up for a year (or more) of rapid progress at the cutting edge of AI.

We’re also at a time where language models are already extremely good. They’re in fact good enough for plenty of extremely valuable knowledge-work tasks. Language models taking another big step is hard to imagine — it’s unclear which tasks they’re going to master this year outside of code and CLI-based computer-use. There will be some new ones! These capabilities unlock new styles of working that’ll send more ripples through the economy.

These dramatic changes almost make it seem like a foregone conclusion that language models can then just keep accelerating progress on their own. The popular language for this is a recursive self-improvement loop. Early writing on the topic dates back to the 2000s, such as the blog post entirely on the topic from 2008:

Recursion is the sort of thing that happens when you hand the AI the object-level problem of “redesign your own cognitive algorithms”.

And slightly earlier, in 2007, Yudkowsky also defined the related idea of a Seed AI in Levels of Organization in General Intelligence:

A seed AI is an AI designed for self-understanding, self-modification, and recursive self-improvement. This has implications both for the functional architectures needed to achieve primitive intelligence, and for the later development of the AI if and when its holonic self-understanding begins to improve. Seed AI is not a workaround that avoids the challenge of general intelligence by bootstrapping from an unintelligent core; seed AI only begins to yield benefits once there is some degree of available intelligence to be utilized. The later consequences of seed AI (such as true recursive self-improvement) only show up after the AI has achieved significant holonic understanding and general intelligence.

It’s reasonable to think we’re at the start here, with how general and useful today’s models are.

Generally, RSI can be summarized as when AI can improve itself, the improved version can improve even more efficiently, creating a closed amplification loop that leads to an intelligence explosion, often referred to as the singularity. There are a few assumptions in this. For RSI to occur, it needs to be that:

The loop is closed. Models can keep improving on themselves and beget more models.
The loop is self-amplifying. The next models will yield even bigger improvements than the current ones.
The loop continues to run without losing efficiency. There are not added pieces of friction that make the exponential knee-capped as an early sigmoid.

While I agree that momentous, socially destabilizing changes are coming in the next few years from sustained AI improvements, I expect the trend line of progress to be more linear than exponential when we reflect back. Instead of recursive self-improvement, it will be lossy self-improvement (LSI) – the models become core to the development loop but friction breaks down all the core assumptions of RSI. The more compute and agents you throw at a problem, the more loss and repetition shows up.

I’m still a believer that the complexity brake on advanced systems will be a strong counterbalance to the reality that AI models are getting substantially better at every narrow task we need to compose together in making a leading AI model. I quoted this previously in April of 2025 in response to AI 2027.

Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread “general systems collapse”.

There are plenty of examples in how models are already trained, the deep intuitions we need to get them right, and the organizations that build them that show where the losses will come from. Building leading language models is incredibly complex, and only becoming more-so. There are a few core frictions in my mind.

1. Automatable research is too narrow

First, it is clear that language models this year will already be useful tools at optimizing localized tasks like lowering the test loss of a model. Andrey Karpathy recently launched his autoresearch that popularized doing just this. This allows AI agents to play directly on GPUs to target tasks like lowering the loss on the test set. This approach works in narrow domains, i.e. one general test loss or one overall reward. The problem is that there’s a long-standing gap between an on-paper more accurate model and models that users find more productive. The most provocative case is for pretraining, which was discussed more at length around scaling laws. Scaling laws show us that the loss will continue going down, but we don’t know if that’ll be economically more valuable.

In post-training, reinforcement learning algorithms are at least more directly tied to specific performance gains as most RL training environments can be used directly as an evaluation. Still, I worry about generalization and tying back to models that are better at the specific task of improving themselves. It’s a big leap from models get better at some things to that necessarily translating to models that are better at building themselves and designing experiments. We’ve seen many AI capabilities sort of saturate at certain levels of human taste, such as writing quality. AI research is a bit different here, as there is a very high ceiling to climb up to. Where models mostly saturate on writing because there’s inherent tension in preferences, models will saturate on research because the search space and optimization target is too wide.

The early benchmarks for measuring this sort of ability all fall prey to the same problem – narrow scope. Agents will do well at optimizing single metrics, but the leap required to navigate many metrics at once is a very different skill set. That is actually what the best researchers do — they make many scalable ideas work together.

The most related benchmark we have to measure this is PostTrainBench, which is quite fun, but progress will very rapidly get distorted on this. Over 90% of the challenge in doing post-training well is getting the last 1-3% of performance, especially without cooking the model in out-of-domain tasks. Post-training a general, leading model is extremely complex, and only getting more complex.

I could go on and on about this. Another example is from during my Ph.D. (2017-2022), when there was immense hype around a field called “AutoML” which aimed to use techniques like Bayesian Optimization to find new architectures and parameters for models. The hype never translated into changing my job. Language models will do more than this, but not enough to take jobs away from top AI researchers any time soon. The core currency of researchers is still intuition and managing complexity, rather than specific optimization and implementation.

2. Diminishing returns of more AI agents in parallel

The biggest problem for rapid improvement in AI is that even though we’ll have 10,000 remote workers in a datacenter, it’ll be nearly impossible to channel all of them at one problem. Inherently, especially when the models are still so similar, they’re sampling from the same distribution of solutions and capabilities while being bottlenecked by human supervision. Adding more agents will have a strict saturation in the amount of marginal performance that can be added – the intuition of the best few researchers (and time to run experiments) will be the final bottleneck.

A common idea to illustrate this is Amdahl’s law, which is taken from computer architecture and shows that a given task can only generate a fixed speedup proportional to how much can be parallelized and how many parallel workers exist. An illustration is below:

In AI this should be relatively easier to convey, as the low-level operating details of computers are fairly mysterious. Consider an AI researcher on the transition from writing code by hand to using AI autocomplete assistance to now using autonomous coding agents. These are all massive gains. Let us continue. Now this researcher uses 3-4 agents working on different sub-tasks or approaches to the problem at hand. This is still a large gain. Now consider this single researcher trying to organize 30-40 agents with tasks to do every day. Some people can get more value out of this scale, but not many.

How many people do you think could come up with 300-400 tasks for AI agents every day? Not many. This problem will hit the AI models soon enough as well.

3. Resource bottlenecks and politics

Fundamentally, all the AI companies are walking a fine line of acquiring substantial capital, converting new compute resources to revenue via sufficient demand, and repeating the process all-the-while spending an extreme amount on research. With the scale of resources here, there will always be political bottlenecks on who gets resources and what gets bet on. In this layer, research leadership sits above the AIs and the researchers. Even as models continue to improve, this source of friction will never get removed. It isn’t a substantial friction, but the AI models are fundamentally operating in organizations where humans are the bottleneck on resources.

The early scale of improvements with language models is local optimizations, where the resources used cost <$1M per day. With my other views on the frictions of AI, this is on its own a very minor impact on the rate of improvement, but for those with worries of fast take-off, RSI, and loss of control to AIs, it should be obvious that billions of dollars of compute resources for research are unlikely to be totally isolated for end-to-end experimentation of AI models.

The conclusion here is that because we’re at the early stages of using AI assistance, autonomously and at scale for AI-development, we’re collectively discovering the ways that AI can help us massively. We’re all applying these tools to capture the low-hanging fruit we see and our jobs are literally changing to be higher paced and more productive. The problem is that all of these axes have clear human, political, or technical complexity bottlenecks.

The bottom of every sigmoid feels like an exponential. We’ve ridden multiple exponentials in the era of language models, in 2023 we scaled to huge models and GPT-4 felt like magic, by 2025 we added inference-time scaling with o1 and reasoning models — they let us “solve” math and coding, now we’re going to take a big step by polishing the entire AI workflow (all the while scaling training compute massively). 2026 will feel like a huge step, but it doesn’t have a fundamental change convincing me that progress will begin to take off.

This could still cross the colloquial threshold for AGI, which is a drop-in replacement for most remote workers, which would be an incredible milestone. Much of the challenge in the debate of if we hit AGI in the coming years is that AI models are jagged and smart in different ways than humans, so they won’t look like drop-in replacements for remote workers, but in many cases just using AI will be far more effective than trying to work with a human. It’s reshaping what jobs are.

Let us consider the scenarios we’re working through.

Engineering is becoming automated today. Humans are way more productive, models can scale through complex infrastructure deployments much faster, run with higher GPU utilization, etc. Infrastructure gains become fixed improvements in the rate and scale of experimentation, the fundamental units of progress in AI.
Basic AI model research and optimization will be automated. The AI models are expanding in scope – they transition from writing kernels to deciding on architectures. This is moving from improving the experimentation toolkit to running minor experiments themselves. Configs, hyperparameters, etc. become the domain of the AI assistants.

These are both real. The problem is that a third era doesn’t have a simple scale to jump to. Where the AI models can create knowledge by synthesis and execution, the next jump requires harnessing thousands of agents or having models make more novel discoveries – like unlocking the next paradigm after inference time scaling. The improvements downstream of AI are going to make the industry supercharged at hill climbing, but I worry that this won’t bring paradigm shifts that are needed for new categories of AI – continual learning, world models, whatever your drug of choice is.

All together, the models are becoming core to the development loop and that’s worth being excited (and worried) about. The models are performing self-improvement. They’re not transforming the approach. We are scaling up the compute we spend on our own research practices and tools. There are diminishing returns. Agents are going to start being autonomous entities we work with. They feel like a cross between a genius and a 5 year old. We will be in this era of lossy self-improvement (LSI) for a few years, but it is not enough for a fast takeoff.

GPT 5.4 is a big step for Codex

Nathan Lambert — Wed, 18 Mar 2026 13:02:54 GMT

I’m a little late to this model review, but that has given me more time to think about the axes that matter for agents. Traditional benchmarks reduce model performance to a single score of correctness – they always have because that was simple, easy to quickly use to gauge performance, and so on. This is also advice that I give to people trying to build great benchmarks – it needs to reduce to one number that is interpretable. This is likely still going to be true in a year or two, and benchmarks for agents will be better, but for the time being it doesn’t really map to what we feel because agentic tasks are all about a mix of correctness, ease of use, speed, and cost. Eventually benchmarks will individually address these.

Where GPT 5.4 feels like another incremental model on some on-paper benchmarks, in practice it feels like a meaningful step in all four of those traits. GPT 5.4 in Codex, always on fast mode and high or extra-high effort, is the first OpenAI agent that feels like it can do a lot of random things you can throw at it.

I haven’t been particularly deep in software engineering over the last few months, so most of my working with agents has been smaller projects (not totally one-off, but small enough where I’ve built the entire thing and manage the design over weeks), data analysis, and research tasks. When you embrace being agent-native, this style of work entails a lot of regular APIs, background packages (like installing and managing LateX binaries, ffmpeg, multimedia conversion tools, etc), git operations, file management, search etc. Prior to GPT 5.4, I always churned off of OpenAI’s agents due to a death by a thousand cuts. It felt like rage quits. I’d feel like I was getting into GPT 5.2 Codex, but it would fail on a git operation and have me (or Claude) need to reset it. Those hard edges are no longer there.

The other subtle change in GPT 5.4’s approachability – the biggest reason I think OpenAI is much more back in the agent wars – is that it just feels a bit more “right.” I classify this differently to the routine tasks I discussed above, and it has to do with how the product (i.e. the model harness) presents the model outputs, requests, and all that to you the user. It has to do with how easy it is to dive in. This has always been Claude’s biggest strength in its astronomical growth. Not only has Claude been immensely useful, but it has a charm and entertainment value to it that’ll make new people stick around. GPT 5.4 has a bit of that, but the underlying model strengths of Claude still leave it feeling warmer.

Where Claude is a super smart model, with character, a turn of phrase in a debate, and sometimes forgetting something, OpenAI’s models in Codex feel meticulous, slightly cold, but deeply mechanical. I’d use Claude for things I need more of an opinion on and GPT 5.4 to churn through an overwhelmingly specific TODO list. The instruction following of GPT 5.4 is so precise that I need to learn to interact with the models differently after spending so much time with Claude. Claude, in some domains, you come to see has an excellent model for your intent. GPT 5.4 just does what you say to do. These are very different philosophies of “what will make the best model for an agent”, Claude will likely appeal to the newcomers, but GPT 5.4 will likely appeal to the master agent coordinator that wants to unleash their AI army on distributed tasks.

Outside of charm, and dare I say taste, a lot of the usability factors are actually better on OpenAI’s half of the world. The Codex app is compelling – I don’t always use it, but sometimes I totally love it. I suspect substantial innovation is coming in what these apps look like. Personally, I expect them to eventually look like Slack (when multiple agents need to talk to eachother, under my watch).

OpenAI also natively offers fast mode for their models with a subscription and very large rate limits. I’ve been on the $100/month Claude plan and $200/month ChatGPT plan for quite some time. I’ve never been remotely close to my Codex limits with fast mode and xhigh reasoning effort, where I hit my Claude limits from time to time. There’s definitely a modeling reason to this – most of OpenAI’s release blogs showcase each iterative model being substantially more concise in the number of tokens it takes to get peak benchmark performance. This is a measure of reasoning efficiency. This 2D (or more) benchmark picture is exactly where the world is going.

Here’s a plot from Cursor, which sadly doesn’t have all the GPT 5.4 reasoning efforts, but it confirms this point in a third party evaluation. What is missing across model families is the speed and price (a proxy for total compute used) to get there.

The final benefit of GPT 5.4, and OpenAI’s agentic models in general for that matter, is much better context management. In using them regularly now I feel like I’ve never hit the context wall or context anxiety point. The reasoning efficiency I suspect is the case above just lets the model do way more with its initially empty context window. Then, when GPT 5.4 does compact, it’s been less noticeable.

The one problem I’ve been having with both Claude Opus 4.6 and GPT 5.4 is a light forgetfulness. If you give the models multiple TODOs in a single message outside of planning mode, I find them often dropping them. Sometimes it feels like the models glitch and try to solve a previous problem rather than the recent ones. I’m not sure what in the model or the harness is the exact cause, but sometimes I like to queue up a few messages as I see the model working on something, to refine the task, but currently this tends to be a pretty risky outcome except in the simplest use-cases.

These days I’ve been using both GPT and Claude extensively, mostly based on my mood, and have been getting more done than ever. Having a GPT 5.4 Pro integration directly with Codex, e.g. like \ultrathink, would be a big differentiator for OpenAI. Those models have been incredible.

All in, I see GPT 5.4 as an agentic model that brings a ton more simple usability and “agentness” to the very strong software foundation of GPT 5.3 Codex. It’s a big step, and I’m unbelievably excited for which of these two companies releases an update next. On paper, listing the strengths of GPT 5.4 across better top end coding performance, better speed, better context management, better rate limits, it’s a testament to how nuanced choosing a model is. I genuinely still enjoy Claude a bit more for ways that’ll never show up on benchmarks. This makes me type claude into my terminal at the start of my day, rather than codex.

What comes next with open models

Nathan Lambert — Mon, 16 Mar 2026 13:00:51 GMT

2025 was the year where a lot of companies started to take open models seriously as a path to influence in the extremely valuable AI ecosystem — the adoption of a strategy that was massively accelerated downstream of DeepSeek R1’s breakout success. Most of this is being done as a mission of hope, principle, or generosity.

Very few businesses have a real monetary reason to build open models. Well-cited reasons, such as commoditizing one’s complements for Meta’s Llama, are hard to follow up on when the cost of participating well is billions of dollars. Still, AI is in such an early phase of technological development, mostly defined by large-scale industrialization and massive scale-out of infrastructure, that having any sort of influence at the cutting edge of AI is seen as a path to immense potential value.

Open models are a very fast way to achieve this, you can obtain substantial usage and mindshare with no enterprise agreements or marketing campaigns — just releasing one good model. Many companies in AI have raised a ton of money built on less.

The hype of open models is simultaneously amplified by the mix of cope, disruptive anticipation, and science fiction that hopes for the world where open models do truly surpass the closed labs. This goal could be an economically catastrophic success for the AI ecosystem, where profits and revenue plummet but the broader balance of power and control of AI models is long-term more stable.

There’s a small chance open models win in absolute performance, but it would only be on the back of either a true scientific breakthrough that is somehow kept hidden from the leading labs or the models truly hitting a wall in performance. Both of them are definitely possible, but very unlikely.

It is important to remind yourself that there have been no walls in progress to date and all the top AI researchers we discuss this with constantly explain the low-hanging fruit they see on progress. It may not be recursive self-improvement to the singularity (more on that in a separate post), but large technology companies are on a direct path to building definitionally transformative tools. They are coming.

The balance of power in open vs. closed models

The fair assessment of the open-closed gap is that open models have always been 6-18 months behind the best closed models. It is a remarkable testament to the open labs, operating on far smaller budgets, that this has stayed so stable. Many top analysts like myself are bewildered by the way the gap isn’t bigger. Distillation helps a bit in quality, benchmaxing more than closed labs helps perceptions, but the progress of the leading open models is flat out remarkable.

The reality is that the open-closed model gap is more likely to grow than shrink. The top few labs are improving as fast as ever, releasing many great new models, with more on the docket. Many of the most impressive frontier model improvements relative to their open counterparts feel totally unmeasured on public benchmarks.

In a new era of coding agents, the popular method to “copy” performance from closed models, distillation, requires more creativity to extract performance — previously, you could use the entire completion from the model to train your student, but now the most important part is the complex RL environments and the prompts to place your agents in them. These are much easier to hide and all the while the Chinese labs leading in open models are always complaining about computational restrictions.

As the leading AI models move into longer-horizon and more specialized tasks, mediated by complex and expensive gate-keepers in the U.S. economy (e.g. legal or healthcare systems), I expect large gaps in performance to appear. Coding can largely be mostly “solved” with careful data processes, scraping GitHub, and clever environments. The economies of scale and foci of training are moving into domains that are not on the public web, so they are far harder to replicate than early language models.

Developing frontier AI models today is more defined by stacking medium to small wins, unlocked by infrastructure, across time. This rewards organizations that can expand scope while maintaining quality, which is extremely expensive.

All of these dynamics together create a business landscape for open models that is hard to parse. Through 2026, closed models are going to take leaps and bounds in performance in directions that it is unlikely for open models to follow. This sets us up for a world where we need to consider, fund, use, and discuss open models differently. This piece lays out how open models are changing. It is a future that’ll be clearly defined by three classes of models.

True (closed) frontier models. These will drive the strongest knowledge work and coding agents. They will be truly remarkable tools that force us to reconsider our relationship to work.
Open frontier models. These will be the best open-weight, large models that are attempting to compete on the same directions as above. There will be plenty of use-cases that they don’t work for relative to the best models, but countless use-cases where they work remarkably well. For many use-cases, even ones as valuable as some subsets of coding, these will work great.

The AI ecosystem will still take years to understand what it means to have intelligence of this magnitude served in private, at the marginal cost of electricity for individuals, as assistants, coaches, companions, and more. OpenClaw provided a glimpse behind the mirror that will expand and grow. The class of models around GPT-OSS 120B, Nvidia Nemotron 3 Super, or MiniMax M2.5 are the balance of performance to price that can work as local models.
Open, small models as distributed intelligence. The most successful open models will be complementary tools to closed agents. This is a path for open models to complement and accelerate the frontier of progress.

AI is slotting in to automate many repetitive, niche tasks across the technology economy. There’s a huge pressure to shift these tasks off of the best closed models — which frankly are still better at most of the things, across my conversations with businesses trying to build with open models — to small, open models that can be 10X faster and 100X cheaper. There aren’t really people building data and fine-tuning engines for economically viable tasks on the smallest models possible.

These models need to be almost brain-numbingly boring and specific. In a world dominated by coding agents, I want to build open models that Claude Code is desperate to use as a tool, letting its sub agents unlock entirely new areas of work. This is possible, but remarkably under-explored. Small models from the likes of Qwen and co. are still marketed on general-task benchmarks. The hype of “open models catching the frontier” distracts the world from this very large area of demand.

This is the sort of model that moves open models from just a few, crucial static weights to more of an ecosystem. It requires creativity and a new approach. The goal of this piece is to illustrate why and how to build these, with added context on where open models stand today.

All three of these model classes hint at different ways to use agents. It is absolutely definitional to how AI is going to be built going forward that they’re not just model weights, but rather systems that think, search, and act. The weights only define one portion of those abilities.

Open weights as part of an AI system

To start, consider what are the most impactful and impressive things that language models can do without a suite of tools at their side. When was the last time that you were blown away by something that was just autoregressive token outputs? Unless you’re doing a substantial amount of work on mathematical proofs or competition code, it seems like that situation has changed little since GPT-4’s release in 2023. The AI systems we use today are about far, far more than weights.

In this world, closed models have a clear advantage. Closed models get to vertically integrate everything from the chips they run on, the inference software, the weights, the tools, and the user interface. Open models on the other hand need to work on every inference setup, with many tools, and in many use-cases. This vertical integration is best expressed today in the joy of using Claude Code with Opus 4.6 or OpenAI’s Codex with GPT 5.4. Open models haven’t passed this point. Some are starting to focus on specific interfaces, e.g. OpenCode, but there’s an inherent tension in making an open model work only in your blessed product roadmap.

At the same time, this change could point to more about the latest AI systems being open! If you can do less with the weights alone, maybe more labs will release them.

The way to think about AI systems today is as a mix of weights, tools, and harnesses. The weights portion is familiar. The tools are the deeply integrated environments the models act in at deployment time — best typified by search and code sandboxes — and the harness is how these two fit together with a product that the user sees.

In this world, there are two things to consider: 1) Is there an equivalent, open system to the closed products that people are using today — I mean truly equivalent, where every level of the stack can be modified and controlled (more on this later), and 2) How does this system’s view impact different future decisions in the open ecosystem?

Still looking for open model business strategies

To understand how the business and practicality of open models will evolve, let me take a tour back in time to foundational writing on the role of open-source in modern technology companies. The first is a Google blog post, The Meaning of Open, which originally was an internal memo by Jonathan Rosenberg, which sparked an intense internal debate that later resulted in it becoming public. To start, here’s a basic assessment of how open systems can work:

Open systems have the potential to spawn industries. They harness the intellect of the general population and spur businesses to compete, innovate, and win based on the merits of their products and not just the brilliance of their business tactics.

I’ve long believed that the company who will benefit most from the ecosystem of open models is the one who understands it best. This entails being deeply involved with open research and experimentation in how to use the models. So far, most of the open model company business models are not this. Rosenberg expands on this in his 2009 post, comparing the dynamics of open systems to closed products:

[Open systems] are competitive and far more dynamic. In an open system, a competitive advantage doesn’t derive from locking in customers, but rather from understanding the fast-moving system better than anyone else and using that knowledge to generate better, more innovative products. The successful company in an open system is both a fast innovator and a thought leader; the brand value of thought leadership attracts customers and then fast innovation keeps them. This isn’t easy — far from it — but fast companies have nothing to fear, and when they are successful they can generate great shareholder value.

We’ve known for some time that open weight models are not actually enough to constitute a product — models are a product in the sense that they have tools and harnesses, so we don’t actually have fully open systems, we have systems that are partially open partially closed, making moats messy. VLLM and a model like GLM 5 are pieces of a system, but it still takes more to deploy them — expensive private GPUs and some tools with local business data.

It may turn out to be that AI is too complex and expensive to have any analogous open system to previous generations of technology. If there was a fully open system, it would win by default, as many historical generations of technology have shown us. This fully open analog does not yet exist, so we have constant debates on the role of open-source AI.

Bill Gurley recounts how Google’s free products have exemplified the open or free strategies across technology. Gurley wrote on the open-source operating system, Android, and the free browser, Chrome, in 2011:

So here is the kicker. Android, as well as Chrome and Chrome OS for that matter, are not “products” in the classic business sense. They have no plan to become their own “economic castles.” Rather they are very expensive and very aggressive “moats,” funded by the height and magnitude of Google’s castle. Google’s aim is defensive not offensive. They are not trying to make a profit on Android or Chrome. They want to take any layer that lives between themselves and the consumer and make it free (or even less than free).
Because these layers are basically software products with no variable costs, this is a very viable defensive strategy. In essence, they are not just building a moat; Google is also scorching the earth for 250 miles around the outside of the castle to ensure no one can approach it.

In the same post, Gurley reflects on the limits of Google’s openness:

In this open manifesto, Jonathan opines over and over again that open systems unquestionably result in the very best solutions for end customers. That is with one exception. “In many cases, most notably our search and ads products, opening up the code would not contribute to these goals and would actually hurt users.” As Rodney Dangerfield said in Caddyshack, “It looks good on you, though.”

Essentially, Google open-sourced so much, in fact paid people to use its products (e.g. paying phone makers to use android) to keep the funnel leading to the search profit center. This is the virtuous loop that the search business still funds to this day.

AI is still nothing like this, but signs of change are emerging. The default belief on the value of models to these companies is that the model is the product. This is obvious with products like hosted APIs, where releasing the model weights would be business suicide, but this is softening as interfaces like Claude Code, Codex, Cursor, etc. get vastly popular. It could be a path to more openness, at least in parts of the stack. We can see this with the coding plans offered by Moonshot and Z.ai — where the demand is very high for the businesses, even though the model is open. Most people will just use the cheap interface with inference, instead of figuring out how to use the model themselves (as long as the business is mostly consumer or per-head services).

All of this doesn’t leave me optimistic on the direction of companies becoming more open in the coming years. I’d expect the opposite still. Nvidia has the one great reason to be open — to sell more GPUs to people building on open models and understand what they need to build next, but there’s no one else obvious on this list. Until there are more specific economic reasons to build open models, the companies building these at the frontier will have fewer resources to spend on the models and face a consolidation to the best few.

In the face of consolidation at the open frontier, the investment in the models should shift to areas where the models can have more differentiated upside relative to the best closed frontier models.

Open models that are specific, cheap, fast, and ubiquitous

There’s too much obsession with the best companies building open models to try and compete at the frontier. There’s a vastly underserved market of enterprises that want cheap, reliable models for repetitive use-cases in their systems. Picture this, one small model with a series of LoRA adapters that specialize the model to internal skills. This can be deployed very cheaply as tools and a complement to the frontier closed models that are orchestrating agents.

Every task that a frontier agentic model does tens to hundreds of times can potentially be outsourced to a small model. There are ancillary benefits to this, e.g. privacy of a local model reading your files and summarizing to Claude, but almost no one is pushing hard in this direction. The leading model family of capable, customizable small models to date is Qwen, but that’s now shrouded in uncertainty with the departures of key personnel. Gemma, Phi, Olmo, etc. are all major steps down in quality, and therefore potential for modification.

There are a few obvious examples why this can be scaled up. There was a recent thread and discussion on how the new Qwen 3.5 4B model arguably bests the original ChatGPT model. On the research side, there are already recipes for finetuning open models on specific code-bases to match performance of much bigger models. Moondream.ai is a startup made by a friend of mine Vik, who builds some of the best, small multimodal models on a tiny budget — they compete with Qwen and Llama on real world tasks. This is the tip of an iceberg.

Intelligence compression hasn’t been explored with nearly as much depth (or resources) because it is less exciting than keeping track of the progress of the best few models. Investigating these areas is the standard technological diffusion process that is slow and why we’re still early in understanding how people will build with AI. My contention is that too many people building open models are slightly deluded in their perception of their competitiveness. The best few models will win on general capabilities and there are still plenty of underserved niches elsewhere.

Taking this to the next level involves releasing open models that are scoped to be truly excellent at 1-3 tasks, as I hinted at the beginning of this piece. Too many people try to compete with Qwen and show that their small model does great on frontier AI benchmarks. The right benchmark here is savings in compute and time.

It’ll take years for this transition to slowly become reality. Part of why I am so excited about it is that it is driving innovation on open models being more about diversity, specialization, and curiosity, rather than the standard “one model to rule them all” that the frontier models presume.

Models vs. ecosystems.
Consolidation vs. creativity.

So long as the open source ecosystem for AI is defined by a bunch of model providers trying to chase after the closed labs, it will largely lose. It will face pain on funding and substantive adoption. The same consolidation that will come for closed AI companies will come for open model builders — likely even sooner.

Open systems at their best allow many people to participate and many approaches to flourish.

The world of open models needs to be more of an ecosystem. I’ve discussed in the past how China is closer to this type of environment by having a variety of companies, but the variety in approaches is still too low.

Ecosystems are self-reinforcing, whereas individual models are static artifacts in time. Ecosystems showcase clear, constant opportunities for what’s next that have growing value propositions.

The path forward for open models is to solve different problems than the frontier labs, to find places where open models are effectively free alternatives, to show ways of using specialized models that the closed labs cannot offer. The world of open models needs to embrace creativity, before building powerful AI systems grows too expensive and prices out many of the prized open labs of today.

Dean Ball on open models and government control

Nathan Lambert — Fri, 06 Mar 2026 14:03:27 GMT

Watching history unfold between Anthropic and the Department of War (DoW) it has been obvious to me that this could be a major turning point in perspectives on open models, but one that’ll take years to be obvious. As AI becomes more powerful, existing power structures will grapple with their roles relative to existing companies. Some in open models frame this as “not your weights, not your brain,” but it points to a much bigger problem when governments realize this.

If AI is the most powerful technology, why would any global entity let a single U.S. company (or government) control their relationship to it?

I got of the great newsletter onto the weekly Substack live to discuss this. In the end, we agree that the recent actions by the DoW — especially the designation of Anthropic as a supply chain risk (which Dean and I both vehemently disagree with) — points to open models being the 5-10 year stable equilibrium for power centers.

The point of this discussion is:

Why do open models avoid some of the power struggles we’ve seen play out last week?
How do we bridge short term headwinds for open models towards long-term strength?
The general balance of capabilities between open and closed models.

Personally, I feel the need to build open models more than ever and am happy to see more constituencies wake up to it. What I don’t know is how to fund and organize that. Commoditizing one’s compliments is a valid strategy, but it starts to break down when AI models cost closer to a trillion dollars than a hundred million. With open models being very hard to monetize, there’s a bumpy road ahead for figuring out who builds these models in face of real business growth elsewhere in the AI stack.

Enjoy and please share any feedback you have on this tricky topic!

Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.

00:00 Intro: is the Anthropic supply chain risk good or bad for open models?
04:03 Funding open models and the widening frontier gap
12:33 Sovereign AI and global demand for alternatives
20:55 Open model ecosystem: Qwen, usability, and short-term outlook
28:20 Government power, nationalization risk, and financializing compute

Transcript

00:00:00 Nathan Lambert: Okay. We are live and people will start joining. I’m very happy to catch up with Dean. I think as we were setting this up, the news has been breaking that the official supply chain risk designation was filed. This is not a live reaction to that. If we get any really, really interesting news, we’ll talk about it. I think one of the undercurrents that I’ve felt that this week where everything happened is gonna touch on is open models, but there’s not an obvious angle. I think I will frame this to Dean to start, which is how does-- Like, there’s two sides of open models. One is that there’s the kind of cliche like, not my weights, not your weights, not your mind, where like somebody could take it away if not an open model, which people are boosting like, “Oh, like Anthropic’s gonna take away their intelligence.” But the other side is people worried about open models existing that the Department of War can just take and use for any purpose that it wants. And I feel like both of these are a little cliche. And the core question is like, is this type of event where more control is coming towards AI and more multi-party interest, like is that gonna be good or bad for the open weight model ecosystem?

00:01:12 Dean Ball: My guess is that in the long run, this is probably profoundly good for open weight AI. And like the whole reason I got in, like, so I became interested in frontier AI governance. I did something totally different with my time before. I wrote about different kinds of policy and studied different kinds of policy. And the reason I got into this was because it immediately occurred to me that the government was gonna... I was like, okay, let’s assume we’re building super intelligence soon or whatever, like very advanced AI that seems like really important and powerful. That’s gonna be something that I depend on, like for my day-to-day life. I’m gonna need it for all kinds of things. It’s gonna profoundly implicate my freedom of expression as an American and my exercise of my liberty and all that. And yet it’s also gonna profoundly implicate national security. And so the government’s gonna have its hands all over it, and they also might not like me using it because I might use it, and others might use it to challenge the status quo in various ways, to challenge the existing power structures which the government is a part of. So we have a political problem on our hands here, in my view.

00:02:36 Dean Ball: It immediately occurred to me that we’re gonna have this huge problem of like, this is gonna be a conflict because this is something that’s gonna enormously implicate American speech and liberty, and also it’s gonna have legitimate national security issues, and also the government’s gonna want it because of bad power-seeking reasons. And so that’s always a part of the picture. And my view was this is just a fight that’s gonna play out over the coming decades, and I wanna be a part of this fight. But number two, in that fight, you have to have an insurance policy, and open weight is the insurance policy. Open weight is the way we can always say yes, but we can build the open ecosystem. We can do that. And so I think in the fullness of time, this is gonna be beneficial, but the problem is there’s a lot of coordination and economic problems that have to be solved here. It’s not just a matter of hoping that Google and Meta or whomever else, or the Chinese companies, by virtue, out of the goodness of their hearts continue to open-source things. That’s not scalable. There has to be a reason to do it. So what are the institutional dynamics open weight gonna look like in the long term? I don’t really know, but it feels deeply under theorized.

00:04:03 Nathan Lambert: I think it’s hard to fund is the thing. I mean, we saw Qwen had their turmoil this week, which is timely, and I’m not that surprised because the stakes for these companies is so high, and they all are trying to make sure their companies win in it. And people will say like, “Oh, Meta should commoditize their complements and release open models.” But no one’s ever commoditized their complements with something that costs a trillion dollars to make. Like, that’s a line item. Like, is Apple gonna commoditize... Apple commoditizing their complement would be them doing the... They could spend just as much as all the other tech companies are on CapEx and spend hundreds of billions of dollars, but they’re choosing not to. And I just like, I agree that long term it should be better, but if we never bridge that gap, does it actually materialize? Like, the crank is being turned of these models getting better and better. GPT 5.4 released today, excited to try it.

00:05:02 Nathan Lambert: But like, where does it go? Like, what I’m working on is totally falling behind the frontier. We’re the foundation of research, but it’s like I see it already slipping.

00:05:13 Dean Ball: So I kinda think, yeah, I mean, look, I think it’s gonna get bad in the short term, it’s gonna be bleak, right? There’s just no doubt about that in my view. Because we’re in this period, like I think the pace of frontier progress is gonna continue. My own view is that, like, just ‘cause I peer in and use the open weight Chinese models on a fairly regular basis, and I kinda just feel as though the gap has widened between the US frontier and the open frontier. Unfortunately, it’s so sad that US frontier and open frontier are increasingly distinct things. But I do feel as though that probably is true. And that’s probably gonna continue because in the next, like, in the early stages of a new technology, you would expect for the vertically integrated players to be the ones who do the best. And over time, the modular players can win, and part of that is ‘cause eventually you do get to good enough, right? Like, eventually, I think most people think the iPhone is good enough now. There was a time when every year the iPhone upgrade was like, “Oh my God, this is so much better.” Intelligence is maybe different, but maybe not for a lot of things.

00:06:37 Nathan Lambert: Well, like, there’s no iPhone that you can buy from anyone. Nothing you can buy from anyone but Apple is nearly as good. That’s the concern. It’s like, is it gonna be Anthropic that like, yeah, it stopped getting better, but you can’t rebuild it. Like, you can’t make the open source version.

00:06:51 Nathan Lambert: I also think I had a later question, which is like, the weights are so much less of a concern for me. So like, somebody dropping a two-trillion-parameter model that’s open weights and way better than anything else that somebody has built and released in the open, it almost doesn’t matter if you don’t understand the harness and the tools and the setup you need to make it into a Claude-like system. Like, you need what, eighty nodes of H100s that cost a hundred thousand dollars a day to run and expertise to make it a system. It’s like the shifting away from weights is also happening. I don’t think it’s happening in this open versus closed ecosystem at the surface level of the discussion. So that’s why I’m just like, I don’t know if it’s gonna exist. The thing that I could see happening is that open weights models are niche, and they help these Claude-like models, but there’s not an alternative in that universe. So it’s like, is the government capable of actually making this alternative exist? I don’t know. Like, I don’t know if you can Manhattan Project this, and I wouldn’t advocate for it.

00:07:53 Dean Ball: I actually think about it from the opposite perspective, because I think that what happens if the government follows through on what they’ve threatened with Anthropic, which is to make it so that basically any military contractor cannot have any commercial relations with Anthropic, which means NVIDIA can’t sell GPUs to them for anything. Amazon can’t sell cloud services to them. Amazon and NVIDIA also can’t be invested in them, by the way, if you take any commercial relations at its face value. Now, that’s not a power the government actually has, but nonetheless, if this harassment campaign continues, I think what it probably does... You know, I spend a lot of time in international policy, dealing, talking to foreign governments and civil society in foreign countries, and they already have major trust issues with respect to the US closed source models because they think the US government is gonna come in and disable the models. Like, the American president will get mad at Brazil, say, and in addition to putting tariffs or sanctions, the US president will say, “Yeah, we’re also gonna turn off all your public services that are dependent upon American closed source models.” Right? So people view that as this profound threat, and people are legitimately scared of that in other countries.

00:10:00 Dean Ball: I think this turns that fear up another meaningful degree, and probably not incorrectly, by the way, probably rightfully so. And so I kinda look at this and I think, well, now a lot of American companies might also have that concern, and so you certainly have a demand side of people who are gonna be like, “I get this. It is a risk to use anything where I have a commercial relationship. ‘Cause once I have a commercial relationship, the government can regulate that. Can I find some way of getting out of it?” I think there’s gonna be demand for that. Whether or not that demand produces supply, I think will depend on... It might just not be possible, that’s true. But I think you’ve never had a more favorable demand picture, and I suspect that on the margin, this probably will favor open in the longer run.

00:10:44 Nathan Lambert: Yeah. So there’s a few ways that I think about this. I have this thing, like ATOM Project and all this other stuff I do, and it’s like, how do I meaningfully advocate for this? I think there’s something, like I work at AI2, and AI2 has budgets of order of a hundred million dollars and can train decent models. But if I wanted to redo an AI2, like my method for getting that type of money, it’s mostly gonna be like befriending a billionaire. And it seems like philanthropy dice roll in the near term is a way to get it. But then, like, maybe it really is some long slog of a multi-industrial consortium that takes a couple years off the ground and slowly, like, Google’s, or all these Netflix and all these five hundred billion dollar smaller companies are gonna give millions of dollars to have somebody else do it because they can’t get the billion dollars themselves, but they know they need to have it existed.

00:11:31 Dean Ball: And sovereign wealth funds. Right. Sovereign wealth funds everywhere can do that, right? There’s trillions of dollars in sovereign wealth. There’s pension funds, public employee pension funds. A lot of people can chip into this and it’s possible. This is like, Yann LeCun thinks this is the inevitable outcome. He thinks that the future is gonna be that some sort of global consortium gets together and builds this, because no one country is gonna be able to own it, because it’s gonna be too important. I’ve always kinda doubted that, and I’ve always thought that that outcome is probably a bad outcome for the world, honestly.

00:12:06 Nathan Lambert: That’s a bad outcome for how good the AI is.

00:12:09 Dean Ball: That’s correct. It’s a socialist outcome, you know? It’s not communism, but it is democratic socialism, and I’m not a democratic socialist, so I’m not a super big fan of that. But at the same time, I have to be honest that I kinda think that this probably does increase the odds of that precise outcome coming to bear.

00:12:33 Nathan Lambert: I think something that comes sooner is that a lot of these super wealthy countries are gonna realize they can have real... Like, they can do some sort of sovereign AI and make some sort of noise, particularly starting with open models. I think there’s the Institute for Foundation Models, which is based on the UAE university system. Like, that’s--

00:12:53 Dean Ball: That’s very UAE-coded, yeah.

00:12:55 Nathan Lambert: They’ve been playing that for years, and they can keep doing this. Their models are gonna be pretty good, and I think there’s gonna be more people that do this. There’s the SWISS initiative in EU, which is on one hand doing a good job, on the other hand plagued by the most obvious European limitations of talent cycling and consortium life. I think these things are gonna become more of a thing in the next year, but I don’t know exactly how they impact the... They don’t impact the frontier of AI, but maybe they’re just like how the geopolitics and power of AI evolves. And I for some reason feel like open models need to be the thing that they’re gonna do because if they have a closed model that’s not as good, it doesn’t really give them any sort of power. But I don’t have a good enough world view for what that actually does, and if there’s more EU models, if India actually has their act together and trains a solid model. I don’t know what that does, but I feel like it’s probably gonna happen.

00:13:54 Dean Ball: Yeah. I mean, it’s really super interesting ‘cause I think the other thing-- that will be inherently... I mean, it will be a Linux compared to a macOS, you know? It will not be as good of an experience for people. But then it becomes strange. Like, I don’t think macOS is as appealing of a thing if it’s viewed to be owned by the US government, right? And in fact, part of the reason I think that Apple is able to make its case quite credibly to consumers and businesses is they have resisted US government pressure to turn things over before. People might remember about a decade ago, there was this shooter in San Bernardino, California, and the FBI tried to force Apple to release iPhone data, and Apple said, “No, we’re not gonna expose this information.” Now, I think the FBI eventually just hacked it anyway, but that’s a separate issue. It’s a matter of principle here.

00:15:01 Dean Ball: So yeah, I think it’s an interesting question: do we expect for the gap between the open frontier and the American closed frontier to widen in the near future, especially just because of how much compute they’re gonna have?

00:15:30 Nathan Lambert: A hundred percent. And data and talent. Like, a hundred percent. It’s happening.

00:15:34 Dean Ball: Data, talent. And it’s compounding, right? I mean, this has always been my view. And how much, I’m not sure, but I think it could be quite significant because these things are compounding benefits. And so if you expect them to just continue compounding, then all of a sudden it gets pretty bleak pretty quickly, would be my fear.

00:16:00 Nathan Lambert: One of the... I mean, what’s your take on this? Why has it not compounded so much faster? Like, I feel like these three companies are spending, I don’t know, 10X what the Chinese labs are spending, and you only get like a little bit better model. Like, I believed so full-heartedly that Claude and ChatGPT and all these models are much better, and I expect them to become better by increasing margin, but it’s still confusing why they’re not already more ahead.

00:16:29 Dean Ball: I go back and forth on this. Sometimes I think they are that ahead, and it’s just difficult to show up in benchmarks for the obvious reasons that benchmarks get chased. And like, I do feel that with the coding agents and with certain use cases, I do just feel like, wow, the American frontier is just way ahead, profoundly ahead of the Chinese frontier there. But there’s a lot of other things where you do kinda saturate how good you can be. I suspect that a very large fraction of AI usage is essentially glorified Google search. Even though I don’t think AI is glorified Google search, I suspect that a lot of what people use it for is that, at the consumer level. And it isn’t obvious to me how much better you can get at things like that. But my guess would be that over the next five years, I would guess the American labs really take off, in part because of compute, data, internal deployments for recursive self-improvement style stuff. And also, it’s amazing how we talk about that as just a normal thing now.

00:18:05 Nathan Lambert: I think there will be a ceiling on it. Like, they’re gonna get a ton of improvement-- The gains are insane. It’s like, personally, at my job, I’ve been a lot of a research manager and just chasing shit down to get a model out the door. But now I can take on hard engineering tasks because I’m like, “Okay, might as well do this at the same time.” Like, going from zero to a hundred software engineers at anyone’s fingertips is worth a lot in terms of exploration. But the next, like, from a hundred to ten thousand is like, people can mess that up type thing. But that’s a huge gain.

00:18:37 Dean Ball: I kind of agree. I think there’ll be a sigmoid there too. But then the other thing that will happen is, like, what I sort of wonder is will the AI companies, will the current model vendors, will they eventually become more like true infrastructure companies where what they actually do is they have models that design their own chips and models that design their own data centers and models that design their own successors. And so it’s this hugely vertically integrated thing, and what you’re really getting access to is not just the model itself, but you’re getting access to this highly optimized hardware, physical world infrastructure. And again, that’s kind of already the case, but does that become even more the case? And then that’s truly insurmountable for any open player. That’s definitionally insurmountable for an open player, and that becomes scary too. But again, this is why I’ve always felt so good about the position of the US closed source labs. This is why I’ve always been pretty bullish on them and have my concerns about open.

00:20:07 Dean Ball: But to the extent the US government makes it impossible to trust closed source models, you do provide an advantage to open there. You’re giving a shot in the arm. If you like open source, you should hope that the supply chain risk designation against Anthropic is quite broad.

00:20:09 Nathan Lambert: It’s a rough thing to hope for.

00:20:09 Dean Ball: I mean, you shouldn’t actually hope for it, but I just mean, like, if that’s the only thing you care about in the world is open source, then--

00:20:17 Nathan Lambert: I would say that anyone that only cares about open source probably is not thinking through any of these principles. It just gets really bad if you only have-- Like, AI is not gonna be meaningful lift to the economy and nor sustainable if everything is open. Like, if models are truly commoditized, things look kind of rough out there.

00:20:36 Dean Ball: I think a world where models get commoditized is a really bleak world too, actually. And yeah, this is why I’m very worried about what the US government is doing. But I think that it helps on the margin, though. It probably helps on the margin in terms of waking people up. That still is my view.

00:20:55 Nathan Lambert: I am a little surprised by the Qwen stuff, but I think there’s-- It’s like, at some point, I knew there was gonna be a year where a lot of the open model efforts just died because they’re just too expensive and too similar. But at the same time, having a lot of efforts that are somewhat similar but exploring a lot of the minor permutations in modeling space to figure out what works for people who use open models is actually quite good. I’m very bearish on the reflection style approach, which is build a lab, build an incredible model, drop it, make a bank selling it on-prem. Because on-prem is not that distinct from a business model as having a closed model. You could sell a closed model on-prem with the right IP controls. But then the person who actually wins open is by trying a whole bunch of tiny different things, understanding what is actually a meaningful differentiator in private data, in certain deployments and whatever, and then really iterating on that with a community. And that’s why I was like, Qwen is the closest to doing this by being so close to the community, and it’s so distinct from what a lot of the other labs are betting on.

00:22:05 Nathan Lambert: But I see the pressure going away and kind of reducing diversity onto standards, because standards also make inference more efficient. Using open models is really rough. I think some of the best open models have really had rough launches. I think GPT-OSS had a horrible launch in terms of usability and is now one of the most popular models of all time. Qwen 3.5, it’s like researchers I work with are like, “Oh, let’s see if we can do some basic RL baselines on it,” and all the software stack is kinda broken. It takes a few weeks to get it going. And this is ‘cause all the models change differently, and closed labs just have such an advantage there ‘cause they should conceivably ship things on day one that work. I mean, don’t talk about Claude’s runtime, but that’s fine.

00:22:42 Dean Ball: And don’t talk about the GPT-5 auto router either. But yeah, no, totally. I think that’s right.

00:22:53 Dean Ball: I think fullness of time, I’m bullish on open source in the long run, fairly bearish in the next five years. The next five years are gonna matter quite a bit. And there is a lot of cope in both open source world and also... I don’t really hear it so much in open source world. I think open source world is actually more honest about this. But where the cope is so bad is in global civil society discourse. Like, I was in India for the AI Impact Summit recently, and they are just smoking the copium, being like, “We are gonna do everything on subfrontier open source models, and we’re just gonna diffuse those, and that’s all we’re gonna need in our economy.” And I just think that’s, if you’re India, that’s really not the bet you wanna make. I understand these are resource-constrained countries. They have a lot of acute constraints that they face, but nonetheless, I think that’s probably not a good bet.

00:24:05 Nathan Lambert: Well, it’s even if those long tail models will work like manufacturing has worked, where it’s like Apple has put hundreds of billions of dollars into the manufacturing ecosystem in China to get absolute fine margins and scale. Like, if you really-- these things are gonna be used so much that that fine margin is actually gonna matter a lot, and it is not cheap to get that fine margin. You can’t just YOLO a DeepSeek V3 and spend five million dollars in compute and be done. It’s still gonna be expensive for a long time.

00:24:34 Dean Ball: Yeah, it requires-- I think the Chinese approach, in the long run, if China’s gonna continue its strategy and they want to be competitive with the American frontier, they’re gonna have to fully socialize that, I think. I don’t think DeepSeek alone is gonna be able to do this, and I don’t think even Alibaba alone is gonna be able to do this. I think they’re going to need some sort of collective effort. Especially because of the export controls, the American export controls. They’re gonna have to centralize compute. They’re gonna have to centralize all these things, and talent and data and all that.

00:25:17 Nathan Lambert: I don’t see it happening. Like, maybe someone gets officially AGI pilled, and I don’t know that much about China. But the things I know about China, it seems like that would be a big lift, and it would take a lot of time to actually do it. Like, all the companies would have to give up their biggest... All the cloud companies are like tech companies making a lot of money. They would be like, “We have to give up what?”

00:25:42 Dean Ball: No, it would be a tough sell. Obviously, if the Chinese government decides they want to do it, they absolutely will. But in total, it will be a tough sell. My experience having had diplomatic engagements of many sorts with Chinese government-- and a lot of Chinese tech policy is actually not directly set by the government. It’s actually more kind of civil society, academia and civil society adjacent to government. Had a lot of conversations with folks like that, and they’re definitely... It’s largely not a very AGI-pilled crew. I think AGI-pilled-ness probably has a rough correlation with GDP per capita, and I think China is about where you would expect based on their GDP per capita, maybe a little bit ahead, but not very so. But if they ever do get AGI pilled, that’s the kind of thing that they could consider, but then that’s still a pretty extraordinary outcome because the Chinese government would have to be willing to make these things and then give it away. And I kinda just don’t think they will.

00:27:11 Nathan Lambert: Yeah. I mean, all the politics of control with how everybody thinks AI is so powerful are pointing to very value-destructive actions economically in order to achieve the end state that people determine to be right. It’s like supporting open source to the extent that you can to avoid situations like Anthropic being labeled a supply chain risk and having interactions like that totally decimating runway of AI productivity. Like, if the companies are really gonna commit to open source for other things, then they’re gonna lose money. And I see this in-- China’s economy would be taking a gigantic hit doing this. And that’s kind of a common theme of what we’re talking about is that the interface of AI in an economic fashion is gonna make the next few years really weird.

00:28:06 Dean Ball: I hope so.

00:28:09 Nathan Lambert: I think things are gonna be weird, but I haven’t spent a ton of time thinking about how that interacts with political institutions. I thought about socially weird a lot, but I haven’t thought about power weird a lot.

00:28:20 Dean Ball: Oh, power weird is what I worry about all the time. What I worry about the most is I think it’s plausible that what we’re seeing... I’ve always had this concern. I have this dual problem of-- maybe I’m talking out of both sides of my mouth. Maybe that’s just the critique, and it’s a fair critique. But I routinely complain about how people in government aren’t really... They pretend to take AI seriously, but they don’t take it that seriously. And they don’t really own the implications of advanced, of near term advanced AI and all that. I think we basically have transformative AI right now, but they don’t own that, because it’s annoying, it’s difficult, it’s conceptually challenging.

00:29:08 Dean Ball: But the flip side of that is that if people do start to take it very seriously, there’s the risk that they sort of lash out, that they get scared, and they lash out and do things that are rash, in a rush. And that actually creates very, very bad, much worse outcomes than you otherwise might have gotten. I think that’s a very fair risk, and I think it’s possible that you might see things like that happen within the U.S. I don’t think this particular incident with Anthropic is quite an example of that. But it’s possible that you do see that in the coming years, and that is in and of itself a pretty scary outcome because if the U.S. government decides that they want to nationalize the frontier labs, I think it could be one of the most tyrannical things we ever see happen in this country.

00:30:16 Nathan Lambert: Yeah. It’s like, I don’t know how to reply to this. I think things are... It’s serious times and I see so many... It feels like such a Sisyphean task to make more open models exist, but all the broader trends seem to point to that being a more stable equilibrium in a lot of ways. Like, good enough open models and keeping up with what we all feel happening in the closed model land.

00:30:50 Nathan Lambert: So I don’t know. I stay motivated, but I feel increasingly lost in terms of achieving it.

00:30:56 Dean Ball: I don’t think you should be. I think, look, I suspect the US government will not actually do it, and the best thing about America is that our general sort of-- I don’t wanna say incompetence, but the general sort of chaos of American institutions and decentralized confusingness of it all, it can often be quite frustrating, and it can sometimes be a detriment, but it can also be really great because we tend to not execute and follow through on our very worst ideas. And so I don’t think we’re going to do that. It doesn’t feel very American to do it. I worry about it because I worry about these rash reactions, and that’s why I fight as heavily as I do on things like this, despite not insignificant cost to me to do it, politically speaking. But that’s totally worth it because I care about this. I think everything, I think that will probably be fine. But yeah, I do agree. It’s a major risk. It’s a major risk, and it’s a weird world to think about, I’ll tell you that much.

00:32:16 Nathan Lambert: Yeah. I don’t have a lot more to add. I’m sure we’ll continue this discussion. I think it warrants the space of it ‘cause that’s the... It’s one of the longer term things, but it’s not in the news cycle whatsoever, at least the open model angle. There’s just so many layers. People have to talk. Like, send feedback, people listening. I’ll even send this out as a podcast as well and just like, what do people think? How do we get to the places we want to get to?

00:32:46 Dean Ball: Well, one thing I’m particularly interested in is-- one of the items in the Trump administration action plan, which I worked on for those who don’t have that context, is this idea of financializing compute, creating a financial market, like basically a commodities market for compute so that you can buy, you know, like really robust. In the same way that you can buy electricity spot, electricity futures and electricity on the spot market and things like this, the wholesale. Could you do something like that for compute? That could really profoundly change the dynamics and the economics of AI production. It’s not gonna turn them over. It doesn’t flip them on their head, but it changes it quite meaningfully. And I’m very excited by that prospect.

00:33:48 Dean Ball: And that’s the kind of thing that I would be increasingly doing if this sort of interference of government into the frontier continues. What I suspect I’ll do is start developing some of those ideas which I developed earlier. I’m only one person. If those things start to seem relevant again, I totally will. Because anything to make it easier to produce AI for people that don’t have trillions of dollars will be extremely important.

00:34:38 Nathan Lambert: Yeah. I think that... I don’t know. I’m happy to leave it there.

00:34:43 Dean Ball: Cool.

00:34:45 Nathan Lambert: I can let you get on your trip. It’s good to catch up. I’m early in the process of potentially coming to DC in a few months, so I will let you know if I do.

00:34:52 Dean Ball: Oh, please do. It’d be great to see you. We can record an episode of my podcast live.

00:34:58 Nathan Lambert: Sounds good. Okay. Thanks everybody for listening.

00:35:03 Dean Ball: Talk to y’all later. Bye.

Olmo Hybrid and future LLM architectures

Nathan Lambert — Thu, 05 Mar 2026 16:16:44 GMT

So-called hybrid architectures are far from new in open-weight models these days. We now have the recent Qwen 3.5 (previewed by Qwen3-Next), Kimi Linear last fall (a smaller release than their flagship Kimi K2 models), Nvidia’s Nemotron 3 Nano (with the bigger models expecting to drop soon), IBM Granite 4, and other less notable models. This is one of those times when a research trend looks like it’s getting adopted everywhere at once (maybe the Muon optimizer too, soon?).

To tell this story, we need to go back a few years to December 2023, when Mamba and Striped Hyena were taking the world by storm1 — asking the question: Do we need full attention in our models? These early models fizzled out, partially for the same reasons they’re hard today — tricky implementations, open-source tool problems, more headaches in training — but also because the models fell over a bit when scaled up. The hybrid models of the day weren’t quite good enough yet.

These models are called hybrid because they mix these new recurrent neural network (RNN) modules with the traditional attention that made the transformer famous. They all work best with this mix of modules. The RNN layers keep part of the computation compressed in a hidden state to be used for the next token in the prediction — a summary of all information that came before — an idea that has an extremely long historical lineage in deep learning, e.g. back to the LSTM. This setup avoids the quadratic compute cost of attention (i.e. avoiding the incrementally expanding the KV cache per token of the attention operator), and can even assist in solving new problems.

The models listed to start this article use a mix of RNN approaches, some models (Qwen and Kimi) use a newer idea called Gated DeltaNet (GDN) and some still use Mamba layers (Granite and Nemotron). The Olmo Hybrid model we’re releasing today also falls on the GDN side, based on careful experimentation, and theory that GDN is capable of learning features that attention or Mamba layers cannot.

Introducing Olmo Hybrid and its pretraining efficiency

Olmo Hybrid is a 7B base model, with 3 experiment post-trained checkpoints released — starting with an Instruct model, with a reasoning model coming soon. It is the best open artifact for studying hybrid models, as it is almost identical to our Olmo 3 7B model from last fall, just with a change in architecture. With the model, we are releasing a paper with substantial theory on why hybrid models can be better than standard transformers. This is a long paper that I’m still personally working through, but it’s excellent.

You can read the paper here and poke around with the checkpoints here. This is an incredible, long-term research project led by Will Merrill. He did a great job.

To understand the context of why hybrid models can be a strict upgrade on transformers, let me begin with a longer excerpt from the paper’s introduction, emphasis mine:

Past theoretical work has shown that attention and recurrence have complementary strengths (Merrill et al., 2024; Grazzi et al., 2025), so mixing them is a natural way to construct an architecture with the benefits of both primitives. We further derive novel theoretical results showing that hybrid models are even more powerful than the sum of their parts: there are formal problems related to code evaluation that neither transformers nor GDN can express on their own, but which hybrid models can represent theoretically and learn empirically. But this greater expressivity does not immediately imply that hybrid models should be better LMs: thus, we run fully controlled scaling studies comparing hybrid models vs. transformers, showing rigorously that hybrid models’ expressivity translates to better token efficiency, in agreement with our observations from the Olmo Hybrid pretraining run. Finally, we provide a theoretical explanation for why increasing an architecture’s expressive power should improve language model scaling rooted in the multi-task nature of the language modeling objective.
Taken together, our results suggest that hybrid models dominate transformers, both theoretically, in their balance of expressivity and parallelism, and empirically, in terms of benchmark performance and long-context abilities. We believe these findings position hybrid models for wider adoption and call on the research community to pursue further architecture research.

Essentially, we show and argue a few things:

Hybrid models are more expressive. They can form their outputs to learn more types of functions. An intuition for why this would be good could follow: More expressive models are good with deep learning because we want to make the model class as flexible as possible and let the optimizer do the work rather than constraints on the learner. Sounds a lot like the Bitter Lesson.
Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling.

All of this theory work is a great way to go deeper, and frankly I have a lot more to learn on it, but the crucial part is that we transition from theory to clear experiments that back it up. Particularly the scaling laws for designing this model were studied carefully to decide on the final hybrid architecture. The final performance is very sensitive to exactly which RNN block is used and in what quantity.

In scaling experiments, the results showed that for Olmo, the hybrid GDN (3:1 ratio of layers) > pure GDN (all RNN layers) > standard transformer (all attention) > hybrid Mamba2 > pure Mamba2. The crucial point was that these gaps maintained when scaling to more parameters and compute. A visual summary of the different types of architectures studied is below.

In terms of this specific model, the pretraining gains were giant! Relative to Olmo 3 dense, it represents an about 2X gain on training efficiency. When you look at evaluation performance for pretraining, there was also substantial improvement in performance, particularly after long context extension (the final 2 rows of Table 2 in the paper, highlighted below).

The journey to post-training Olmo Hybrid

Most of the experience in post-training Olmo models has been climbing up a steep curve in base model capabilities with minor tweaks to architecture. Our recipes from Tulu 2, Tulu 3, and the Olmo 3 reasoning work (building substantially on OpenThoughts 3) all worked in a fairly straightforward, off the shelf manner. Olmo Hybrid is our first experience in post-training a substantially different architecture, and the results were mixed.

1. Benchmark performance

Following the Olmo 3 recipe, we got some substantial wins (knowledge) and some substantial losses (extended reasoning) relative to the dense model. All together these still represent a very strong fully open model — just that the pretraining gains didn’t translate as obviously. The results are below.

The exact reason why this happens is a research question. Our best guess is that the Olmo Hybrid base model is just a sufficiently different student model, where most of our post training data at early stages is learning from stronger “teacher” models (a recap of this method, called distillation, appeared recently in Interconnects).

There is a lot of other research ongoing in the community around what makes a strong teacher model — generally, the best overall model is not the best teacher. In other words, training on data outputted from the model with best evaluation scores today is unlikely to unlock the ceiling in performance for your new base model. A second factor, which is even less explored, is how different base models likely need different teachers to learn from. This is why Olmo Hybrid could perform very differently, where it’s behavior is downstream of an architecture-based learning change, where the pretraining data is almost identical.

There’s A LOT more work to dig into here, some empirical work in generating better data and other work in understanding how different training stages fit together. I am confident this Olmo Hybrid base model is solid and more performance can be extracted, but it takes more careful work adapting existing datasets.

2. Open-source tooling

The frank reality of new architectures for open models is that the open-source software tooling support is horrific. There’s the paper-cuts that people are familiar with, e.g. random errors in popular libraries (as people experienced with GPT-OSS) that slow adoption, but there are also deeper problems.

A large part of the potential benefit of hybrid models is the reduction in memory usage for long-context generation, which is crucial for reinforcement learning and agentic tasks. It should be a huge win for post-training! This, unfortunately, is far from the case, and will likely take another 3-6months to get right for this batch of GDN models.

The core problem is that the open-source inference tools, e.g. VLLM, are relying on far less developed kernels (and other internals) when compared to standard transformers. This comes with two challenges — throughput slowdowns and numerical issues. Numerical issues can be combatted with a variety of inference flags. Quoting the paper again:

The two key flags in VLLM we needed to get maximum performance with the post-training model were --disable-cascade-attn, which disables cascade attention (an optimization for shared prompt prefixes), and --enforce-eager, which turns off CUDA graphs. These two flags have been used in our RL setup dating back to Olmo 3, but are new additions to evaluations. Scores for the released models drop precipitously without them. We also evaluated our final models with the hybrid model cache in the richer FP32 datatype, to improve stability via --mamba_ssm_cache_dtype following NVIDIA.

Essentially, we used these to make sure the model was numerically stable. The downside is that the inference throughput plummets, so the potential gains in compute efficiency are erased. A comparison of numbers is below.

Data for this is available here.

Effectively, the 7B hybrid model today takes more compute to train with RL than our 7B dense model (that doesn’t even have a common memory saving technique, GQA). The total compute estimate from the table at different context lengths is below (more visuals in the slides from my recent CMU talk).

The good news is that these are solvable problems — and improving the tooling could even improve benchmark numbers — but it’s going to take a good bit of time and hard work in the OSS community.

This leads to my final question. If I’m optimistic about the open ecosystem evolving to support these models with ease, motivated by the better fundamental scaling of the architectures and a large cluster of leading open model builders already using it, are closed models like GPT and Claude built like this?

To be clear, this answer is a total guess (which I don’t normally do), but with the evidence I have I’d put the chance of one of the 3 frontier models being an RNN being around a coin flip. I’ll let you know if I learn for sure either way. If the scaling advantages hold at frontier scale, the economic case becomes hard to ignore, but they could already have architectures that are efficient like RNNs, but with even more benefits.

I’m going to follow up this post with more architecture discussions, particularly on why Mixture of Expert (MoE) models are a major headache to post-train, so make sure to subscribe if that sounds interesting to you!

Subscribe now

Thanks to Will Merrill and Finbarr Timbers for some discussions that helped inform this post.

and still my most-viewed interview on YouTube, as the first one I did.

Latest open artifacts (#19): Qwen 3.5, GLM 5, MiniMax 2.5 — Chinese labs' latest push of the frontier

Florian Brand — Tue, 03 Mar 2026 16:30:59 GMT

It’s been a busy month at the top end of open-weights AI — with new flagship models from all of Qwen, MiniMax, Z.ai, Ant Ling, and StepFun. Still, all eyes are on DeepSeek V4’s pending release, which rumors continue to accelerate towards. Outside of the large, frontier models, this issue is a bit lighter on the long-tail of niche modalities and model sizes.

With all these new releases, we’re tracking them with our new Relative Adoption Metrics (RAM), a measurement tool that normalizes model downloads relative to peer models in their size class. This has already been an extremely useful tool for us, highlighting underrated models like GPT-OSS, which is literally off the charts in how downloaded it is — the most popular American open-weights model since Llama 3.1. A RAM score >1 means the model is on track to be a top 10 all-time downloaded model in its size class. We’re particularly interested to see how the early adoption of the smaller Qwen 3.5 dense models will go relative to Qwen 3 — balancing Qwen’s ever growing brand with a trickier, hybrid model architecture that can push the limits of some open-source tools.

A summary of the RAM scores for some of the popular models released late in 2025 is below, highlighting Kimi K2 Thinking and some OCR models as clear winners. DeepSeek V3.2, and their other recent large models, have wildly underperformed DeepSeek’s earlier releases in 2025.

The time here is days since release.

Artifacts Log

Our Picks

Qwen3.5-397B-A17B by Qwen: The long-awaited update to Qwen is finally here. It comes in various sizes from 0.8B to 27B (dense) and 35B-A3B to 397B-A17B (MoE), some of them even with base models. All of them are multi-modal, use reasoning by default and are based on the Qwen-Next architecture with GDN layers.
We tested these models over the last few days, and they are a clear upgrade over the previous version: There are a lot of substantial improvements across the board, making them perfect workhorses for a wide range of tasks.
Their style and instruction-following have improved, and the models are even better at multilingual tasks, covering more languages.

However, at least the small models (still) tend to overthink. You can turn off reasoning by disabling it in the chat template.
Step-3.5-Flash by stepfun-ai: StepFun really stepped up its game (no pun intended), releasing a 196B-A11B MoE with strong metrics across the board. It is especially strong in math benchmarks, beating out models that are several times larger than it.
GLM-5 by zai-org: A 744B-A40B release from the Zhipu team, which has resulted in such a big increase in demand that they raised prices for their coding plan. It also comes with an accompanying tech report.
MiniMax-M2.5 by MiniMaxAI: Despite the relatively small size, Minimax-M2.5 can rival models such as GLM-5 and Kimi K2.5 and has quickly become one of the favorites of the community.
OpenThinker-Agent-v1 by open-thoughts: OpenThinkers, known for their open reasoning releases (such as OpenThoughts 3) are now tackling agentic reasoning. Their initial release includes SFT and RL data, as well as a “lite” version of terminal-based tasks to evaluate smaller models.

The subtle differences in architecture of these models are covered in detail in the similar, more technically focused, round-up from — it’s a good complement if you’re looking to go deeper:

Ahead of AI

A Dream of Spring for Open-Weight LLMs: 10 Architectures from Jan-Feb 2026

If you have struggled a bit to keep up with open-weight model releases this month, this article should catch you up on the main themes…

2 months ago · 150 likes · 7 comments · Sebastian Raschka, PhD

Models

General Purpose

Tri-21B-Think by trillionlabs: The Korean Trillion Labs is a repeated guest at the Artifacts series. This time, they are releasing a 21B reasoning model with support for English, Korean and Japanese.
MiniCPM-SALA by openbmb: An English and Chinese 8B model with sparse attention, supporting a 1M context window.

How much does distillation really matter for Chinese LLMs?

Nathan Lambert — Tue, 24 Feb 2026 16:06:43 GMT

Distillation has been one of the most frequent topics of discussion in the broader US-China and technological diffusion story for AI. Distillation is a term with many definitions — the colloquial one today is using a stronger AI model’s outputs to teach a weaker model. The word itself is derived from a more technical and specific definition of knowledge distillation (Hinton, Vinyals, & Dean 2015), which involves a specific way of learning to match the probability distribution of a teacher model.

The distillation of today is better described generally as synthetic data. You take outputs from a stronger model, usually via an API, and you train your model to predict those. The technical form of knowledge distillation is not actually possible from API models because they don’t expose the right information to the user.

Synthetic data is arguably the single most useful method that an AI researcher today uses to improve the models on a day to day basis. Yes, architecture is crucial, some data still needs exclusively human inputs, and new ideas like reinforcement learning with verifiable rewards at scale can transform the industry, but so much of the day to day life in improving models today is figuring out how to properly capture and scale up synthetic data.

To flesh out the point from the start of this piece, the argument has repeatedly been that the leading Chinese labs are using distillation for their models to steal capabilities from the best American API-based counterparts. The most prominent case to date was surrounding the release of DeepSeek R1 — where OpenAI accused DeepSeek of stealing their reasoning traces by jailbreaking the API (they’re not exposed by default — for context, a reasoning trace is a colloquial word of art referring to the internal reasoning process, such as what open weight reasoning models expose to the user). Fear of distillation is also likely why Gemini quickly flipped from exposing the reasoning traces to users to hiding them. There was even very prominent, early reasoning research that built on Gemini!

This all leads us to today’s news, where Anthropic named and directly accused a series of Chinese labs for elaborate distillation campaigns on their Claude models. This is a complex issue. In this post we unpack a series of questions, beginning with the impact, and ending with politics. The core question is — how much of a performance benefit do Chinese labs get from distilling from American models.

To start, let’s review what Anthropic shared. From the blog post, emphasis mine:

We have identified industrial-scale campaigns by three AI laboratories—DeepSeek, Moonshot, and MiniMax—to illicitly extract Claude’s capabilities to improve their own models. These labs generated over 16 million exchanges with Claude through approximately 24,000 fraudulent accounts, in violation of our terms of service and regional access restrictions.
These labs used a technique called “distillation,” which involves training a less capable model on the outputs of a stronger one. Distillation is a widely used and legitimate training method. For example, frontier AI labs routinely distill their own models to create smaller, cheaper versions for their customers. But distillation can also be used for illicit purposes: competitors can use it to acquire powerful capabilities from other labs in a fraction of the time, and at a fraction of the cost, that it would take to develop them independently.

Much like the models themselves, the benefits of distillation are very jagged. For some capabilities, particularly if you don’t have a full training pipeline setup for it, quickly distilling some data from the leading frontier model in that area can yield massive performance boosts. This can definitely help the lab distilling from the API catch up much more quickly than they otherwise would. Most distillation is rather benign, using many tokens of an LLM to help process and refine existing data — putting a lot of compute into getting a few, high quality training tokens out. This sort of raw data processing work can be done on many different APIs, but one tends to be best.

When we go into what Anthropic says the three Chinese LLM builders actually used the Claude API for — as an aside, Anthropic didn’t confirm that the attack was done through the API, the chat app, or Claude Code — the actual impact of the operations is very mixed. It’s hard to know how much untracked usage these labs deployed for other projects (or other American models).

To start, Anthropic puts DeepSeek first in their blog post because they’re the household name in the US for Chinese AI. The extent of their use is actually quite small, showing how this post is more about the big picture than the details:

DeepSeek
Scale: Over 150,000 exchanges
The operation targeted:
Reasoning capabilities across diverse tasks
Rubric-based grading tasks that made Claude function as a reward model for reinforcement learning
Creating censorship-safe alternatives to policy sensitive queries

In the scale of training a language model, 150K samples is only scratching the surface as a substantive experiment. It looks like they were experimenting with some rubrics, which could’ve been for an online RL run, but that’s extremely unlikely with how distributed the access was, and then some minor stuff on completions for sensitive queries. This usage of Anthropic’s API will have a negligible impact on DeepSeek’s long-rumored V4 model (or whichever model the data here contributed to). This was also very likely a small team at DeepSeek and unknown to much of the broader training organization.

The other two labs, Moonshot AI (makers of the Kimi models) and MiniMax reflected much broader usage.

Moonshot AI
Scale: Over 3.4 million exchanges
The operation targeted:
Agentic reasoning and tool use
Coding and data analysis
Computer-use agent development
Computer vision
MiniMax
Scale: Over 13 million exchanges
The operation targeted:
Agentic coding
Tool use and orchestration

The role of distillation is constantly changing. Distilling from Claude today for its agentic behavior is much more valuable than versions of Claude have been as a teacher in the past. Claude Opus 4.6 has a well-rounded agentic navigation that none of the other models quite match. Why not try training on some of the model outputs to see if your model absorbs it? Over the next few months, that’ll be less differentiated. It’s sort of like how all the models are way better at math today than most people need — there are plenty of places to distill from.

Estimates will vary, but if each response had 10-25K tokens per exchange, the total tokens across these two labs, mostly with MiniMax, would be 150-400 billion tokens. This is a substantial amount, which could meaningfully improve a models’ post-training. For example, in Olmo 3 we had an SFT dataset of 20 billion tokens that could be built like this, and increasing it by 10X would be very reasonable.

These numbers are just scratching the surface of total synthetic data generation across APIs hosted by US companies. At the same time, quantity is a pretty crude way to measure impact. Just taking the outputs from Claude and figuring out how to add them to your model pipeline isn’t easy. The research community has seen many cases where taking outputs from a certain teacher model unexpectedly makes the student worse — subtle interactions between the data make it variable and tricky to do this type of distillation. It’s fundamentally a research problem.

This is what I’m sure the Chinese labs are innovating at. There’s an argument that Chinese frontier labs are substantially more efficient than their Western counterparts — this is misleading.

The labs operate under different constraints. The Chinese labs are likely slightly more efficient out of necessity in being lower on resources, but overall the picture of talent access is very similar. The Chinese labs also approach benchmarks differently, making it appear that they’re a bit closer than they really are (and appearing as if they’re potentially surpassing). This is needed to get momentum and brand recognition in the AI market.

The Chinese labs likely innovate greatly on distilling from leading API models, due to their restricted access to GPUs. GPUs could be used to construct synthetic data, but for organizations with more funding than they can spend on research compute (being supply limited), using API-based models is one of the few other options for effectively getting more compute. It’s way easier to figure out getting access to “banned” API models than it is to smuggle tens of thousands of physical GPUs and get them set up.

It’s not only the Chinese labs that operate like this. Synthetic data from a model you don’t own is all arguably distillation. Distillation is a shortcut to more compute for anyone. It’s also a far less risky cost, as having a big cluster for research requires a very large financial commitment, where APIs are pay-as-you-go. For example, in Olmo 3 we used millions of GPU hours on the Frontier supercomputer and Azure credits through NAIRR for synthetic data. We didn’t have the equivalent in GPUs (or really the cash, thank you research credits!).

All together, it’s very fair for Anthropic to be concerned about this. I still wouldn’t say it is a crucial factor in these Chinese labs post-training capabilities, especially not one that’ll be easy to measure in a time gap to matching the model they’re distilling from a la the US-China performance lag.

If we take a step back, there was even a time when Claude Sonnet was the flagship model ahead of Opus (I think this was with Sonnet 3.5), much of this comes from it being well distilled internally from Opus checkpoints. Fast iteration and high-quality data can go very far, letting student models surpass the teacher. Frontier labs use this to their advantage, by having internal-only models for generating synthetic data, but saying that Chinese models could never pass the US frontier due to data distillation is like saying that Claude Sonnet could never beat Opus. It's unlikely, and it depends a lot on release times, but with AI models making dramatic progress, weirder things like this have already literally happened.

The biggest factor unaddressed here is how distillation from stronger teacher models is harder in an era when reinforcement learning at scale is needed to train the best models. You can spend compute carefully crafting and filtering prompts, but you still need to train the model yourself with substantial, on-policy inference — generation is the majority of the compute cost for RL and it can’t be generations from another model. For this reason, I expected this story to die down a bit. It’s clear from their open research that Chinese labs have excellent RL infrastructure, despite the compute shortages.

The reason I expected it to fade is that not being allowed to distill models for “competitive purposes” has violated the terms of service for API models for quite some time. Academics and open model builders in the US used to greatly worry about and debate this (and I’ve written about it multiple times in 2022 and 2023). Only later in 2024 did that worry die down in the community (and no action has been taken against any smaller model builders).

This action from Anthropic represents another continued step ratcheting up the AI geopolitical tension. Kneecapping model distillation will be far harder than restricting the shipments of physical goods like GPUs. In many ways it seems like fully restricting distillation through distributed access methods seems almost impossible, and restricting GPU sales would be far more impactful.

Anthropic and the AI industry should choose their battles. When API endpoints are available for the best models, other entities will use that to train variants of said model. This is a natural evolution of AI models. If AI models are so precious that distillation is an extreme risk, then the models will be restricted to first-party products. Anthropic has a choice to do this with their latest models. The market for API-based model alternatives may be so competitive that some companies go this path — likely in part due to Chinese models undercutting on price — but an API is a fundamental offering that no leading lab will risk walking back from anytime soon.

Open models in perpetual catch-up

Nathan Lambert — Tue, 17 Feb 2026 17:27:36 GMT

Every 4-6 months a new open-weights model comes out that causes a clamor of discussion on how open models are closer than they ever have been to the best closed, frontier models. The most recent is Z.ai’s GLM 5 model, which is the latest, leading open weights model from a Chinese company. In the last 12 months the new part of this story is that all of the open models of discussion are coming from China, where previously they were almost always Meta’s Llamas. These moments of discussion are always reflective for me — for, despite being one of open models’ biggest advocates, I always find the narrative to be overblown — open models are not meaningfully accelerating towards matching the best closed models in absolute performance. The ~6month gap is holding steady.

At the same time, it’s worth discussing what happens as open models keep getting way better. Open models are staying far closer on the heels of the best closed models than I, and many other experts following the ecosystem, would expect. On paper the top three American labs — in Anthropic, OpenAI, and Google — have vastly more resources at play for training in research. In this world, many would have expected a more obviously growing margin between the best open and closed models. Raw research compute, data purchases, user data, etc. all are providing relatively fine margins. Maybe it’s the scaling laws log-linear relationship from compute to performance coming into play?

The plot of the day is ArtificialAnalysis Intelligence Index for open vs. closed models over time. The point of this post isn’t to nitpick this index’s many limitations, or any other, but to reflect on what this chart doesn’t represent and what it means for the AI world for open weights to keep pace year in and year out.

The benchmark mixes a ton of factors into 1 score that judges model “quality.” This compresses far too many error bars, stories, and weaknesses into one metric. These metrics will always be used to inform policy and help more people understand the high-level trends of AI, but they do a poor job of capturing the frontier of AI progress.

The frontier of AI has never been harder to capture in public benchmarks. Building benchmarks is now super expensive and requires extreme knowledge regarding the latest models and what they do and do not excel at. Well known issues like SWE-Bench being almost 3/4 Django or Terminal Bench 2 being crowdsourced and a bit noisy will never be captured here.

Time and time again it has been shown that the leading frontier labs in the U.S. have a better read on the capabilities that actually matter, and the public benchmarks tend to be a bit easier to overfit to. Qwen’s recent flagship v3.5 model has been plagued again with numerous complaints of benchmaxing (while some out-of-distribution weirdness is debatably implementation errors, on Alibaba’s own API).

The combination of all these factors has pushed me to advocate for “no averaging across our evaluation suite” when communicating the value of our latest Olmo models at Ai2 (see my recent talk on evals). The best models are indeed very close together, but averages can totally hide a single eval being dramatically different from an unscrupulous reader.

All together, I’d bet that the current Artificial Analysis Intelligence Index is a bit unrepresentative of the true frontier, rather than open models being closer to the closed models than ever before (yes, I know, it’s not like I am offering any obvious ways to improve it). The one domain where I foresee open models staying close behind is coding, where public GitHub data and clever verifiable rewards present a ton of potential performance gains.

The overall balance in the ecosystem is in between the value of the most intelligent model — which many people like myself still pay for despite open models’ improvements — and the incredible cost-reductions that come once a given task is achievable by a permissively licensed open model. The best closed models keep unlocking even more valuable tasks, keeping open models in a state of perpetual catch-up. The industry continues to reinvent itself at a blistering pace.

Onto the 7 biggest other trends in open models.

1. The open model frontier is brutally competitive

2025 witnessed a sort of “Cambrian Explosion” of open weight models with very impressive benchmark scores. This market is far more populated than closed, API based models (where there are 4 substantive providers), so open model adoption is brutally concentrated. Only the most-successful models ever get any adoption. This is going to push many small and mid-sized model builders across the ecosystem to shift to a specific niche or a different business plan over the coming months or years.

As a model builder, I feel this super close to home. Even though models are fairly sticky (at least more sticky than the general coverage would indicate) — many open models are set up once if performance is good enough, and never replaced – the likelihood for most models to even get tried once goes down month over month with the ecosystem getting more competitive.

In my post on the state of open models earlier this year, I even learned that Qwen gets dominated on adoption metrics at the biggest scale of models. This continues to surprise me!

The upshot is that competition at the frontier of performance for models is most concentrated in the popular benchmarks of the day, especially with large MoE models — this will drive exploration and innovation towards other cases where open models can actually win on overall business value.

2. Specialized, small, fast, and cheap open models are missing

There’s a large underserved market in specialized models for the enterprise, particularly with tools (maybe GPT OSS’s success is somewhat related to this). Generally, the idea would be to either release the weights, or the method for creating them, that are excellent in valuable, repetitive tasks. With agents becoming more prominent, these models should be able to perform repetitive, agent sub-tasks at small percentages of the cost of large frontier models, while being faster, private, and directly owned. For example, what if one open weight model is deployed with multiple PEFT-adapters per skill, allowing high-utilization and extensibility.

I’ve specifically heard this request from multiple enterprises building agents. While the Qwen models are fantastic at small sizes, open models tend to be very jagged in performance, so multiple options would likely be needed to get this off the ground. It’s also limited by a general lack of frontier-quality, post-training recipes, especially when it comes to adapting a model to specific domain or set of tasks not covered in academic benchmarks. In this view, most of the domain-specific models of today, like math or biology models, are actually not specialized enough.

This is one of many issues that I see repeatedly in how the open model ecosystem has major blind spots. The biggest reason that the open model ecosystem seems a bit misunderstood externally, or confused in itself, is that open models take a long time to figure out and get into the world.

3. Understanding open models is massively under-indexed on

There should be more research organizations fully dedicated to understanding how open models work technically and geopolitically. There could be entire think-tanks in DC informing the public on what is happening, and uncovering information buried in hackathons and new research labs in San Francisco. For Interconnects and The ATOM Project I’m at the frontier of this work, which often entails uncovering new raw data on how open models are used. This data is always messy and imperfect, and often flat out confusing. Understanding open models is how we keep track of the direction of global diffusion for the most important technology in decades, and it feels like there is almost no public work doing so.

Here’s some new data on open model usage courtesy of OpenRouter, which largely mirrors the adoption trends we’ve been seeing. While HuggingFace downloads are obviously very noisy, almost every other adoption metric over time looks strongly correlated with them, especially on U.S. vs. China issues.

As an aside, if this work monitoring the open ecosystem sounds appealing to you, please reach out or leave a comment — I’m thinking about how to scale up our impact in this area!

4. Nations will turn to open models as the only way to get an initial foothold in sovereign AI (and sovereign AI is the real deal)

Sovereign AI has largely been unfolding slowly in the background of frontier AI discussions and the U.S.-China arms race, but it’ll only become more prevalent as AI becomes more deeply embedded in our technological reality. Every wealthy nation will see AI as a direction for influence in addition to a necessity for national security. Open models will likely be the only way to get this off the ground as a real effort, in order to have the local AI community and economy seamlessly integrate with it.

5. Futures where open-source wins the frontier are still possible, but seemingly less likely

The most likely (by far) outcome is for the status quo to continue and for the best open models to lag the best closed models by 6-9months. A large portion of the perpetual catch-up is likely due to the best open model builders constantly distilling their models on the strongest, currently available closed API models, but this direction seems less relevant with the rise of RL. Post-training today is more about the model undergoing experience rather than directly learning from the smartest teacher you can find. The paths to open models winning come through fundamental innovation. This looks like the ability to merge, rotate, and share expert models, a dramatic (100X+) cost reduction in the cost of training, etc. Predicting this before it happens is more of a sci-fi story than a faithful science, as then I’d just go build the damn thing.

6. China’s open model “ecosystem” makes it the most likely place for a discovery around who wins

China has many labs building models on top of their peers’ innovations. This intentional sharing of ideas provides immense benefits relative to Silicon Valley’s quid pro quo where it’s accepted that people go home at the end of their day and chat with some of their friends on the latest technical secrets of their models. The sort of sharing the Chinese companies do, especially considering more of them have closer ties to the nation’s scientific and academic institutions, is the sort of setup that lets new standards converge much faster and breakthroughs be shared. This is another unknown factor, like potential innovation where open models “win,” but it’s important because China has created their own conditions of potential, massive success, and the U.S. has no answer. This divergence in how the ecosystems operate could be nothing in the long-term, but U.S. AI companies cannot do much to compete with it if it takes off.

7. Open models dictate science and diffusion — slower trends than the frontier of AI

The biggest impact in AI in terms of transforming day to day life, and even the world’s power structures, will obviously come from the most powerful and intelligent models. It is fairly obvious then that the open models that end up in closest proximity to this capture the headlines — if an open-weights model does, somehow, happen to claim that title as “the world’s most powerful model,” there will be extreme economic consequences.

In the real world, the one with the highest probability of occurring, open models’ biggest influence will be in two, very slow-moving sectors: 1) fundamental research/innovation and 2) global technological diffusion. I’ve personally realized how much of the excitement I can have for open models is a bit misguided — I’m trying to understand the frontier of AI through the lens of these models, missing the bigger story in how technology slowly reshapes the world’s biggest companies.

Consider when Llama was the open SOTA model, everyone in the U.S. and China did science on Llama, which then impacted subsequent models — even if we didn’t hear directly from Meta on how-so. Now this default is Qwen. Qwen is the anchor of the Chinese ecosystem. Language model research is proceeding extremely fast, which could make the fundamental improvements made in research labs impact the frontier of the technology much faster than usual.

At the same time, the global default for using AI outside of the wealthiest few nations will be to use either free applications like ChatGPT or open weight models. ChatGPT doesn’t fit a lot of business use-cases, so open weight models are a melting pot for innovation that we largely have no visibility into. When we zoom out to a timeline closer to decades, open model’s global adoption seems like a top trend to follow in AI.

Conclusion

Opus 4.6, Codex 5.3, and the post-benchmark era

Nathan Lambert — Mon, 09 Feb 2026 14:03:12 GMT

Last Thursday, February 5th, both OpenAI and Anthropic unveiled the next iterations of their models designed as coding assistants, GPT-5.3-Codex and Claude Opus 4.6, respectively. Ahead of this, Anthropic had a firm grasp of the mindshare as everyone collectively grappled with the new world of agents, primarily driven by a Claude Code with Opus 4.5-induced step change in performance. This post doesn’t unpack how software is changing forever, Moltbook is showcasing the future, ML research is accelerating, and the many broader implications, but rather how to assess, live with, and prepare for new models. The fine margins between Opus 4.6 and Codex 5.3 will be felt in many model versions this year, with Opus ahead in this matchup on usability.

Going into these releases I’d been using Claude Code extensively as a general computer agent, with some software engineering and a lot of data analysis, automation, etc. I had dabbled with Codex 5.2 (usually on xhigh, maximum thinking effort), but found it not to quite work for me among my broad, horizontal set of tasks.

For the last few days, I’ve been using both of the models much more evenly. I mean this as a great compliment, but Codex 5.3 feels much more Claude-like, where it’s much faster in its feedback and much more capable in a broad suite of tasks from git to data analysis (previous versions of Codex, including up to 5.2, regularly failed basic git operations like creating a fresh branch). Codex 5.3 takes a very important step towards Claude’s territory by having better product-market fit. This is a very important move for OpenAI and between the two models, Codex 5.3 feels far more different than its predecessors.

OpenAI’s latest GPT, with this context, keeps an edge as a better coding model. It’s hard to describe this general statement precisely, and a lot of it is based on reading others’ work, but it seems to be a bit better at finding bugs and fixing things in codebases, such as the minimal algorithmic examples for my RLHF Book. In my experience, this is a minor edge, and the community thinks that this is most apparent in complex situations (i.e. not most vibe-coded apps).

As users become better at supervising these new agents, having the best top-end ability in software understanding and creation could become a meaningful edge for Codex 5.3, but it is not an obvious advantage today. Many of my most trusted friends in the AI space swear by Codex because it can be just this tiny bit better. I haven’t been able to unlock it.

Switching from Opus 4.6 to Codex 5.3 feels like I need to babysit the model in terms of more detailed descriptions when doing somewhat mundane tasks like “clean up this branch and push the PR.” I can trust Claude to understand the context of the fix and generally get it right, where Codex can skip files, put stuff in weird places, etc.

Both of these releases feel like the companies pushing for capabilities and speed of execution in the models, but at the cost of some ease of use. I’ve found both Opus 4.6 and Codex 5.3 ignoring an instruction if I queue up multiple things to do — they’re really best when given well-scoped, clear problems (especially Codex). Claude Code’s harness has a terrible bug that makes subagents brick the terminal, where new messages say you must compact or clear, but compaction fails.

Despite the massive step by Codex, they still have a large gap to close to Claude on the product side. Opus 4.6 is another step in the right direction, where Claude Code feels like a great experience. It’s approachable, it tends to work in the wide range of tasks I throw at it, and this’ll help them gain much broader adoption than Codex. If I’m going to recommend a coding agent to an audience who has limited-to-no software experience, it’s certainly going to be Claude. At a time when agents are just emerging into general use, this is a massive advantage, both in mindshare and feedback in terms of usage data.1

In the meantime, there’s no cut-and-dried guideline on which agent you need to use for any use-case, you need to use multiple models all the time and keep up with the skill that is managing agents.

Assessing models in 2026

There have been many hints through 2025 that we were heading toward an AI world where benchmarks associated with model releases no longer convey meaningful signal to users. Back in the time of the GPT-4 or Gemini 2.5 Pro releases, the benchmark deltas could be easily felt within the chatbot form factor of the day — models were more reliable, could do more tasks, etc. This continued through models like OpenAI’s o3. During this phase of AI’s buildout, roughly from 2023 to 2025, we were assembling the core functionality of modern language models: tool-use, extended reasoning, basic scaling, etc. The gains were obvious.

It should be clear with the releases of both Opus 4.6 and Codex 5.3 that benchmark-based release reactions barely matter. For this release, I barely looked at the evaluation scores. I saw that Opus 4.6 had a bit better search scores and Codex 5.3 used far fewer tokens per answer, but neither of these were going to make me sure they were much better models.

Each of the AI laboratories, and the media ecosystems covering them, have been on this transition away from standard evaluations at their own pace. The most telling example is the Gemini 3 Pro release in November of 2025. The collective vibe was Google is back in the lead. Kevin Roose, self-proclaimed “AGI-pilled” NYTimes reporter in SF said:

There's sort of this feeling that Google, which kind of struggled in AI for a couple of years there — they had the launch of Bard and the first versions of Gemini, which had some issues — and I think they were seen as sort of catching up to the state of the art. And now the question is: is this them taking their crown back?

We don’t need to dwell on the depths of Gemini’s current crisis, but they have effectively no impact at the frontier of coding agents, which as an area feels the most likely for dramatic strides in performance — dare I say, even many commonly accepted definitions of AGI that center around the notion of a “remote worker?” The timeline has left them behind 2 months after their coronation, showing Gemini 3 was hailed as a false king.

On the other end of the spectrum is Anthropic. With Anthropic’s release of Claude 4 in May of 2025, I was skeptical of their bet on code — I was distracted by the glitz of OpenAI and Gemini trading blows with announcements like models achieving IMO Gold medals in mathematics or other evaluation breakthroughs.

Anthropic deserves serious credit for the focus of its vision. They were likely not the only AI lab to note the coming role of agents, but they were by far the first to shift their messaging and prioritization towards this. In my post in June of 2025, a month after Claude 4 was released, I was coming around to them being right to deprioritize standard benchmarks:

This is a different path for the industry and will take a different form of messaging than we’re used to. More releases are going to look like Anthropic’s Claude 4, where the benchmark gains are minor and the real world gains are a big step. There are plenty of more implications for policy, evaluation, and transparency that come with this. It is going to take much more nuance to understand if the pace of progress is continuing, especially as critics of AI are going to seize the opportunity of evaluations flatlining to say that AI is no longer working.

This leaves me reflecting on the role of Interconnects’ model reviews in 2026. 2025 was characterized by many dramatic, day-of model release blog posts, with the entry of many new Chinese open model builders, OpenAI’s first open language model since GPT-2, and of course the infinitely hyped GPT-5. These timely release posts still have great value — they center the conversation around the current snapshot of a company vis-a-vis the broader industry, but if models remain similar, they’ll do little to disentangle the complexity in mapping the current frontier of AI.

In order to serve my role as an independent voice tracking the frontier models, I need to keep providing regular updates on how I’m using models, why, and why not. Over time, the industry is going to develop better ways of articulating the differences in agentic models. For the next few months, maybe even years, I expect the pace of progress to be so fast and uneven in agentic capabilities, that consistent testing and clear articulation will be the only way to monitor it.

The emerging frontier of coding agents is in the use of subagents (or “agent teams”, which are subagents that can work together), where the primary orchestration agent sends off copies of itself to work on pieces of the problem. Claude is slightly ahead here with more polished features, but the space will evolve quickly, and maybe OpenAI can take their experiences with products like GPT-Pro to make a Pro agent.

The GPT-Pro line of models is a major advantage OpenAI has over Anthropic. I use them all the time. As we learn to use these agents for more complex, long-term tasks, harnessing more compute on a single problem will be a crucial differentiator.

Why Nvidia builds open models with Bryan Catanzaro

Nathan Lambert — Wed, 04 Feb 2026 18:00:28 GMT

One of the big stories of 2025 for me was how Nvidia massively stepped up their open model program — more releases, higher quality models, joining a small handful of companies releasing datasets, etc. In this interview, I sat down with one of the 3 VP’s leading the effort of 500+ technical staff, Bryan Catanzaro, to discuss:

Their very impressive Nemotron 3 Nano model released in Dec. 2025, and the bigger Super and Ultra variants coming soon,
Why Nvidia’s business clearly benefits from them building open models,
How the Nemotron team culture was crafted in pursuit of better models,
Megatron-LM and the current state of open-source training software,
Career reflections and paths into AI research,
And other topics.

The biggest takeaway I had from this interview is how Nvidia understands their unique roll as a company that and both build and directly capture the value they get from building open language models, giving them a uniquely sustainable advantage.

Bryan has a beautiful analogy for open models this early in AI’s development, and how they are a process of creating “potential energy” for AI’s future applications.

I hope you enjoy it!

Guest: Bryan Catanzaro, VP Applied Deep Learning Research (ADLR), NVIDIA. X: @ctnzr, LinkedIn, Google Scholar.

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Nemotron Model Timeline

2019–2022 — Foundational Work

Megatron-LM (model parallelism framework that has become very popular again recently; alternatives: DeepSpeed, PyTorch FSDP).
NeMo Framework (NVIDIA’s end-to-end LLM stack: training recipes, data pipelines, evaluation, deployment).

Nov 2023 — Nemotron-3 8B: Enterprise-ready NeMo models. Models: base, chat-sft, chat-rlhf, collection. Blog.

Feb 2024 — Nemotron-4 15B: Multilingual LLM trained to 8T tokens. Paper.

Jun 2024 — Nemotron-4 340B: Major open release detailing their synthetic data pipeline. Paper, blog. Models: Instruct, Reward.

Jul–Sep 2024 — Minitron / Nemotron-Mini: First of their pruned models, pruned from 15B. Minitron-4B (base model), Nemotron-Mini-4B-Instruct. Paper, code.

Oct 2024 — Llama-3.1-Nemotron-70B: Strong post-training on Llama 3.1 70B. Model, collection. Key dataset — HelpSteer2, paper.

Mar–Jun 2025 — Nemotron-H: First hybrid Mamba-Transformer models for inference efficiency. Paper, research page, blog. Models: 8B, 47B, 4B-128K.

May 2025 — Llama-Nemotron: Efficient reasoning models built ontop of Llama (still!). Paper.

Sep 2025 — Nemotron Nano 2: 9B hybrid for reasoning, continuing to improve in performance. 12B base on 20T tokens (FP8 training) pruned to 9B for post-training. Report, V2 collection.

Nov 2025 — Nemotron Nano V2 VL: 12B VLM. Report.

Dec 2025 — Nemotron 3: Nano/Super/Ultra family, hybrid MoE, up to 1M context. Super/Ultra H1 2026.
Nano: 25T tokens, 31.6B total / ~3.2B active, releases recipes + code + datasets. Papers: White Paper, Technical Report. Models: Nano-30B-BF16, Base, FP8.

Nemotron’s Recent Datasets

NVIDIA began releasing substantially more data in 2025, including pretraining datasets — making them one of few organizations releasing high-quality pretraining data at scale (which comes with non-negligible legal risk).

Pretraining Data

Collection — CC-v2, CC-v2.1, CC-Code-v1, Code-v2, Specialized-v1, CC-Math-v1. Math paper: arXiv:2508.15096.

Post-Training Data

Core post-training dumps (SFT/RL blends):

Llama Nemotron Post-Training v1.1 (Apr 2025)
Nemotron Post-Training v1 (Jul 2025)
Nemotron Post-Training v2 (Aug 2025)

2025 reasoning/code SFT corpora:

OpenMathReasoning (Apr 2025)
OpenCodeReasoning (Apr 2025), OpenCodeReasoning-2 (May 2025)
AceReason-1.1-SFT (Jun 2025)
Nemotron-Math-HumanReasoning (Jun 2025), Nemotron-PrismMath (Apr 2025)

NeMo Gym RLVR datasets: Collection

Nemotron v3 post-training (Dec 2025): Collection

HelpSteer (human feedback/preference):

HelpSteer (Nov 2023)
HelpSteer2 (Jun 2024)
HelpSteer3 (Mar 2025)

And others, not linked here.

Chapters

00:00:00 Intro & Why NVIDIA Releases Open Models
00:05:17 Nemotron’s two jobs: systems R&D + ecosystem support
00:15:23 Releasing datasets, not just models
00:22:25 Organizing 500+ people with “invitation, not control”
0:37:29 Scaling Nemotron & The Evolution of Megatron
00:48:26 Career Reflections: From SVMs to DLSS
00:54:12 Lessons from the Baidu Silicon Valley AI Lab
00:57:25 Building an Applied Research Lab with Jensen Huang
01:00:44 Advice for Researchers & Predictions for 2026

Transcript

00:00:06 Nathan Lambert: Okay. Hey, Bryan. I’m very excited to talk about Nemotron. I think low-key, one of the biggest evolving stories in twenty-five of open models, outside the obvious things in China that everybody talks about, that gets a ton of attention. So th- thanks for coming on the pod.

00:00:22 Bryan Catanzaro: Oh, yeah, it’s my honor.

00:00:23 Nathan Lambert: So I wanted to start, and some of these questions are honestly fulfilling my curiosity as a fan. As like, why does NVIDIA, at a basic level, release Nemotron as open models?

00:00:39 Bryan Catanzaro: Well, we know that it’s an opportunity for NVIDIA to grow our market whenever AI grows, and we know that having access to open AI models is really important for a lot of developers and researchers that are trying to push AI forward. you know, we were really excited by efforts from some other companies around the industry to push openly developed AI forward. You know, Meta did some amazing work, obviously, with Llama and you know OpenAI released GPT OSS, which was exciting. And the Allen Institute, of course, has been, you know, really leading the charge for research, open research and, you know, also things like the Marin Project and OpenAthena. You know, like there’s, there’s a bunch of things that we’re always excited to see develop.

And, you know, as we think about where AI is gonna go, you know, NVIDIA believes that AI is a form of infrastructure. it’s.. AI is a very useful technology when it’s applied, but on its own you know, it’s kind of a foundation and infrastructure. We think that technology generally works better when there’s openness to the infrastructure so that people can build things in different ways. You know, you think about the way that the internet transformed every aspect of the world economy is pretty profound, and we’re not done yet.

But the way that, for example, retail uses the internet is different from the way that healthcare uses the internet. And the fact that you know, different sectors of the economy were able to figure out how to incorporate the internet into the beating heart of their businesses in different ways was possible because the internet was built on open technologies that, you know, allowed people to try different things. And we think AI is gonna evolve in a similar way, that organizations across every sector of the world economy are gonna find new and surprising and fun, and important things to do with AI, and they’ll be able to do that better if they have the ability to customize AI and incorporate it directly into the work that they do. and so -- and by the way, this is not to detract from any of the you know, more closed approaches to AI, you know, the APIs that we see from a number of leading labs that, you know, are just extraordinary and have amazing capabilities. We’re excited about those, too.

You know, NVIDIA loves to support AI in all of its manifestations, but we feel like right now the sort of closed approaches to deploying AI are doing pretty well but we, you know, could use some more energy in the openly developed AI ecosystem, and so that’s why we’ve been putting more effort into it this past year.

00:03:42 Nathan Lambert: Yeah. So I’m definitely gonna dig into this a lot ‘cause I have seen this. We’re sitting here recording in January twenty-six, which is in the midst of the rollout of these Nemotron three models. There’s the-- I think the Nano has released in the fall, which was probably one of the biggest splashes the org has made, and everybody’s eagerly awaiting these super and ultra-larger variants.

And it’s like how far are you, how far are you willing to push this Nemotron platform? Like, is it just depending on the users and the uptake and the ecosystem? Like, like, what is the-- is there a North Star in this? Or you hear a lot of.. if you listen to a lot of other open labs, they’re like: “We want to build open AGI,” which is like, I don’t necessarily think grounded, but there’s like a very unifying vision.

Is there something that you try to set the tone for it that goes through the organization? I mean, AI too, it’s like-

00:04:31 Bryan Catanzaro: You know, my North-

00:04:32 Nathan Lambert: .. academics is so-

00:04:34 Bryan Catanzaro: For Nemotron.

00:04:36 Nathan Lambert: Okay, go ahead.

00:04:37 Bryan Catanzaro: Oh, sorry. Go ahead.

00:04:39 Nathan Lambert: I was just, like, gonna compare to, like, AI too, where we can have such a-- like, we have a very specific vision, being so open that it’s like, I think, like, research is so needed, and there’s so little recipes to build on, like, with really credible research. So there’s, like, a research infrastructure, and then when you have something like Llama, it was, like, built on Zuckerberg’s vision, and he changed his mind, which I actually thought his vision was ex- was excellent, the way he articulated the need for open models, and it kind of faded. So it’s like, is there a way to set a vision for an org that, like, permeates every- everyone and is really compelling and exciting?

00:05:17 Bryan Catanzaro: Right. Well, we built Nemotron for two main reasons. The first is because we need to for our main product line. So what I mean by that?

Well, accelerated computing, what NVIDIA does, we build fast computers, right? But the point of building fast computers is to help people do new things. and actually every fast computer is also a slow computer. you know, the observation that it would be nice if computers were faster and could do more things isn’t new. that’s been around since the beginning of computing. So what makes accelerated computing different from standard computing is that we’re prioritizing, you know, we’re focusing, we’re deciding we’re gonna accelerate this workload. This other workload, which is like ninety-nine percent of all of the workloads, we’re gonna let somebody else do that, right?

So, like, you do not buy NVIDIA systems to do any general purpose computation. You buy them for a purpose, right? Which is these days, all about AI. But when you think about the workload, the compute workloads involved in AI there’s a, there’s a lot of diversity and there’s a lot of really important -.. parameters, hyperparameters, or algorithmic approaches that all have enormous imp- impacts on the systems that we need to build for AI.

So things like numeric precision MoE architecture, which of course, influence net-- it influences network design. you know, we’re dreaming about sparsity. We, you know, we’ve had, we’ve had sparse neural network acceleration in the GPU since Ampere. I don’t think that it’s being used enough. you know, so how do we, how do we figure out how to use that? These, these sorts of things have an enormous impact on the future of NVIDIA’s main product line, and we have to understand the answers to those questions deeply ourselves in order to know what we’re going to build.

We can’t just go to our customers and do a survey and say, “Hey “ you know, Meta, for example, since we were just talking about them, “what would you like to see in a future product line from NVIDIA?” Of course, Meta’s always trying to help us as much as they can, but there’s limits to what they can tell us because, you know a lot of the information that influences the design of these systems, it’s very expensive to derive, and so therefore, it’s, it’s very closely held. And so we need to be able to understand these questions very deeply in order to understand what kind of systems to build, in order to understand what we’re accelerating in AI and what we’re not gonna worry about. and so that’s kind of the first job for Nemotron models, is to make it possible for NVIDIA to continue to exist as a company. And I think it’s important that the community knows that because that’s the reason why NVIDIA is making the investments in Nemotron, is because we believe it’s essential for the future of our company. and so this isn’t-- and although as much, as much as it feels good to say, you know, NVIDIA believes in open openly developed AI because you know, we’re so charitable, but actually, that’s not the case. This is actually a business decision-

00:08:34 Nathan Lambert: It’s smart

00:08:34 Bryan Catanzaro: .. like, for NVIDIA, our business needs us to know about AI very deeply. And and so, you know, the amount of investment that is justified to carry on NVIDIA’s ongoing business, I think, is large. and so that’s that’s job number one for Nemotron. Now job number two for Nemotron is to support the ecosystem more broadly outside of NVIDIA. and, you know, NVIDIA has a special position in the AI landscape. of all of the big AI companies I think we’re the one that works with the most other companies. We support every company small and large, AI native company to old established enterprise.

We work with hyperscalers, we work with tiny little startups, we work with countries around the world. so we have this unique position and I think also a uni- unique responsibility and al- maybe also a unique opportunity, that whenever AI is able to grow in any sort of direction, in any capability, then you know, that’s an opportunity for us to grow our business. Obviously, it’s not automatic, right? you know, the AI market is diverse, and it’s getting more diverse, and it should be, ‘cause it’s the most important market in the history of humanity. So so we acknowledge that, and at the same time, we know that it’s in our interest to develop the AI ecosystem. The more people that are building, inventing, and deploying AI, the more opportunity that we have as a company.

So that’s job number two for Nemotron.

00:10:17 Nathan Lambert: Yeah. I really appreciate you saying it so directly ‘cause it’s like we’ve worked.. We- I launched this thing, the Adam Project, last summer, which is trying to get more investment in the US open models, and it’s like the only company that has an obvious business model for open models is something like NVIDIA, where you need to make sure that the open models and the research ecosystem plays nicely on CUDA, because then you’re gonna be able to be one-- You’re so many steps closer to research that’s happening. If not, like, if it like- There’s such an advantage to have research happen mostly on GPUs relative to AMD or anything like this, so.

00:10:49 Bryan Catanzaro: Well, you know, we are-- we’re, we’re not thinking about how to prevent competition. You know, we welcome competition. There’s lots of competition. There should be more competition in this space, but we are very self-interested in staying engaged with the community.

You know, it’s very important. You know, CUDA not many people remember this because it happened so long ago, but you know, CUDA started out with a lot of outreach from NVIDIA to the academic and industrial community saying, “Hey, we have this new way of doing computing. we’d love to see what you can do with it.” In fact, you know, I started using CUDA in 2006 when I was a grad student at Berkeley because David Kirk, who was the chief scientist of NVIDIA at the time, came over to Berkeley and said, “Hey we just released this new GPU, and it has this new programming model called CUDA. You should give it a try.” And I was-- at the time, I was working on machine learning on FPGAs, and I had been working on this one particular piece of support vector machine training on the FPGA, and I decided to take that little piece and write it in CUDA, and it took me like fifteen minutes, and then I ran it, and it was like two hundred times faster than my single-threaded CPU code, and I was like: “Whoa, that was way easier than what I was doing before. I’m just gonna go do that,” right?

So, like, my own personal involvement with CUDA and NVIDIA came about because of this outreach that NVIDIA conducted right from the beginning of CUDA. you know, of course, that led to a lot of great things for NVIDIA, including AlexNet, which was another academic project, you know, where Alex Krizhevsky and Ilya Sutskever were thinking about: “How do we train larger neural networks on more data? we’re gonna go write a bunch of GPU code that uses the GPU in a, in a kinda new and clever way, so that we can train a better image classification model.” And, you know, that had such astonishing results, it kicked off the deep learning era for the whole community. and again, not something that-.. could have been done top-down. That was a, that was a very much a result of NVIDIA supporting open development and re- research in parallel computing and artificial intelligence. And so we remember that, and we’re thinking about in twenty-six, what does it look like to help, you know, the Alex Krizhevsky of the future, who’s, who’s a grad student in a lab somewhere, invent the next technology that changes the world? It seems really difficult to do that without something like Nemotron or, or the other openly developed AI projects out there. yeah, I also wanna say in regards to this Nemotron is not trying to be the only project out there.

We’re part of the community. We love other people doing great work in openly developed AI. We learn from things that other people do and you know, so we’re, we’re trying to support the community because it’s in our interest, but we you know, we’re very happy to see other people contributing as well.

00:13:57 Nathan Lambert: Yeah, I mean, I can transition into something I wanted to ask about is like, I see multiple ways, twenty-five Nemotron mat-- in, I don’t wanna use the word maturing ‘cause I wanna ask you about how it feels in the org, but just like the output reached levels that were more noticed by the community and people building with models. And there’s a lot of ways that can happen, but one of them is like, in my niche community, I’ve been using Nemotron datasets a lot. Like we-- when we redo our post-training recipe, one of the only people we look at is like, okay, NVIDIA, Nemotron has released a lot of high-quality, openly licensed post-training data. this year, you also started releasing some pre-training data, which among AI2 got a lot of notice. Like, what is that? is that like a distinct shift within Nemotron?

Is that something that you’ve wanted to do for a while and finally just did? But it’s ‘cause it’s like-- it is just like a zero to one moment where releasing pre-training data comes with legal risk for any company, but so few people do it, where on my side of the world, it’s like pretty easy to normally say what the best pre-training dataset is, and it had, for a long time, oscillated between like Hugging Face, AI2, DCLM, and there was like literally only two or three options. So in terms of fundamental research, like I think that’s a big step from an org to support the community and take on some risk. So if you have any story you can tell and or just say like, I appreciate it, that’s, that’s all.. that’s all I got.

00:15:23 Bryan Catanzaro: Well, yeah. I mean, so I think it’d be great if more people could understand that Nemotron is not just a model, right? Like, what we’re trying to do with Nemotron is to support openly developed AI, because, again, that’s our big opportunity, right? Now, there’s a lot of organizations that are incentivized to build a model, and the model is maybe the thing that runs their business, right?

But at NVIDIA, the model is not the thing that runs our business, it’s the systems. So when we’re thinking about how do we support the ecosystem, it’s clear to us that the ecosystem needs more than just a model. There’s a lot of models out there already, you know? And of course, we want Nemotron to be awesome, but you know, if Nemotron can convince other people to work on AI because of a dataset or a technique, you know, we’re, we’re trying to be very open with all of the things we learn, you know, including..

I mean, we do a lot of expensive experiments in order to figure out how to do blending for our datasets or to figure out, you know, optimize our settings and, you know, these sorts of things. we’re very happy for other people to pick that up and run with it if it’s useful to them, you know. And so that makes Nemotron a different kind of AI effort. Of course, there is a model component, and that’s a tangible thing, and it’s, it’s easy to focus on that, but we see Nemotron as you know, an effort that includes models, but also includes datasets, techniques, all of all of the research that goes into Nemotron. And again we’re a unique kind of AI organization because of the way that we work with AI companies around the industry and because of the way that our business works, we can afford to be more open with some of these things than maybe some other organizations could be.

Now to your question about, like, does it take some courage in order to be open? Yeah, absolutely it does. and you know, I think there’s been-- one of the things that’s happened in twenty-five is that there’s been an evolving understanding within NVIDIA about the benefits of openness, and that has really enabled the company to make some investments that perhaps it was a little gun-shy to make in the past. And so that’s really encouraging for me. it’s something that I’ve you know, advocated for a while, and so it’s, it’s great to see the company kind of lining up behind it. I also, you know, to your point about like twenty-five being a, a year where Nemotron really made some strides, I want to say thank you for noticing that, and then maybe tell you a little bit about how that happened, because I think it’s instructive for me about how I think the work is gonna go forward in the future.

So you know, NVIDIA is a very decentralized company with a lot of volunteers. You know, everybody that works at NVIDIA is a volunteer. And what do I mean by that? Well, I mean, look, the industry is moving quick.

You know, people can always move from one job to the next. So the way that we think about the work that we do is like, it’s very decentralized, it’s very much let smart people figure out what they should be doing and then kind of self-organize. Now one of the challenges of self-organization in a field that’s moving quickly is that sometimes a whole bunch of people decide to-.. do similar kind of overlapping things but aren’t really coordinated. and that’s okay at the beginning because, you know in a place like NVIDIA, it’s just great to have some energy. It, it took us a while, I think, as a company to figure out that Nemotron was better together.

That rather than having, like, this group has a, has a model and that group has a dataset, and like, you know, then we end up publishing papers that kind of you know don’t really acknowledge each other and aren’t really coordinated. And then, of course along with that, we need to have k times the GPUs, where k is the number of independent efforts. we realized that, you know building AI, you really do need to figure out how to collaborate. the AI efforts that are built from teams of people focused on the overall effort succeeding rather than their own particular piece of the project succeeding, those are the ones that, you know, really change the world. And, you know, of course, NVIDIA works that way for the systems that we build, right? So, like, the people working on the memory controller on the GPU know that they also have to work with the people working on the SM that does the math, right?

Like, you can’t, you can’t make a GPU where it’s just like, “Well, we’ve got an awesome memory controller,” if the math doesn’t work, right? It all has to, has to kinda work together. And so that coordination, I think in the field of AI, it took us a little bit longer to do maybe than you could imagine that it could have. and I think that slowed the progress for Nemotron. so I give a lot of credit to the Nemotron team for realizing over the past, I don’t know, year and a half or so, that it was really time to join up and build one thing and make it awesome, and deeply understand that the success of the Nemotron project was more important than the success of any individual piece of that project. And the reason why I’m telling you all of this is because I think that’s actually true more broadly than just inside NVIDIA, and I think it’s, it’s difficult. you know, researchers like those of us with PhDs, for example, we are taught how to be independent, you know, and how to, how to build up our Google Scholar profile, and there’s, like, an incentive to go ahead and focus on that.

And a lot of successful academics and people researchers you know, they manage to push that pretty far and get some pretty amazing results. But, you know, I do believe that in 2020- in the 2020s you know, that the best research is done as part of a larger team. so how do we figure out how to work together? You know, how do we figure out how to put the success of the team first? That is a thing that is challenging to do but if we can achieve it, I think yield significant results.

And, you know, to the extent that we made progress in that part of the organization, I think we also saw progress in the technology. and that’s.. That gives me great hope for 2026 for Nemotron because the way the team is working together, I think is you know, pretty extraordinary. There’s just an enormous number of brilliant people that have decided that they’re gonna volunteer to make Nemotron awesome, and we’re, we’re starting to see some pretty great things come together.

00:22:25 Nathan Lambert: I agree with everything you said. Do you have any advice for making the orgs come together? I think we’ve seen big-- Wait, I’ve seen two class-- there’s two classes of AI companies right now. One is startup, does everything, and you have a model in six months, but you’re building from zero, and you have-- you p-- everybody agrees when they start that they do this. And then you have Google’s famous long-winded reorgs, which they actually eventually got right. Like, they got it very right with what’s going on with Gemini and Google DeepMind-.. right now. And it’s like, do you have any advice on doing this? I think, like, I’m, AI too, also advocating for this, but it’s very hard. I think personally-

00:22:58 Bryan Catanzaro: It’s-

00:22:58 Nathan Lambert: .. it’s like, I mean, I’m, I’m a special case ‘cause I’m also visible, where it’s e-- very easy for me to turn internet activity into, like, reputation points because of algorithms and size. But it’s very hard to do bottom-up technical work and get all of this and get all the culture alignment. So do you have any advice on actually, like, what works in this domain?

00:23:20 Bryan Catanzaro: You know what’s worked for us is invitation and not control. so you know, one way that, like, for a while I kinda wanted to try to implement was, like, nobody gets to publish any papers in AI unless they’re clearly part of Nemotron. So this is kind of a top-down, like, we’re gonna make you do it, right? I came to the realization that which we never implemented this, by the way, but I came to realization that this was a bad idea because it would just breed resentment, and, you know, NVIDIA is a company of volunteers. Everybody here is a volunteer.

So what we need to do is create the conditions by which it makes sense for people to volunteer to be part of Nemotron. And so the way that we went about doing that first of all it involved like, some top-level agreements between me and some of the other leaders of Nemotron, for example, John Cohen and Kerry Briski. I work very closely with the two of them. And you know, that hadn’t always been the case.

Like, we kind of had all come to this place independently. but we realized, like, Nemotron, better together, all three of us, and then we started telling our teams that: “You know, we really think Nemotron is gonna be better together.” so that top-down alignment, I think was really helpful. We-- again, we weren’t telling people exactly what to do, but we were just sending a con constant message like, you know, “Nemotron’s better together.” And then we built some structures that facilitated collaboration. So in the past decisions in the Nemotron project tended to be made in kind of a an opaque way. and the reason for that is just, you know-.. it’s hard to tell everybody about the middle of the sausage-making process. You know, it’s, like, messy and dif- difficult, and so, like, you know, it’s natural.

Like, researchers, we’re used to doing this, right? It’s a fait accompli. Like, “Here’s my ICML paper,” and like, you know, the fact that you spent, like, two years failing at that task before you finally succeeded, and then you tied a bow around it and gave it to the ICML committee, you don’t really talk about that, right? And so it’s difficult for researchers to, to be open about the middle of the process of research.

There’s a lot of failure, and it’s hard for people to feel like they’re, they’re not looking amazing. But what we, what we decided to do is we structured the project with.. There’s about twenty different areas for the project. Each of them has a clear leader, what we call a pilot in command.

Their job is to-- the job of the pilot in command is to land the airplane. You know, you just want the airplane to land, okay? So somebody, if you’re landing an airplane, there might be multiple pilots on board, but only one of them is gonna land the airplane at any time, right? Because it would be chaos if two of them tried to land at the same time, people would die.

So so this is not a committee structure; it is a delineated responsibility structure. And then the purpose of that pilot in command for each of these sections is to gather together all the best ideas, help the group of people that are interested in working on that space to come up with data-driven answers to what we should do, what technical decisions we should make, and then document that, you know, in a, in a way that other people can review. and you know, the thing that’s been really great about that is that it is inviting to people because when they see, like, okay, here’s the group of volunteers that are working on this area of Nemotron and then they want to contribute, it’s much clearer about how they could go about doing that, and it’s also clearer what the group needs because you know, these meetings are being held in the open. and we have-- we actually have a website where all of the ideas are submitted. they each get, like, a unique identifier, and then they get engaged with, you know, the PIC is trying to understand what the implications are, what kinds of experiments need to be run in order to prove or disprove the idea? how do we do what I call integration studies? You know, I, integration studies are so key for bringing researchers together, and they’re so opposite of what we are taught when we’re learning how to do ablations as a graduate student. You know, rather than, like, isolating the particular contribution of one idea, integration studies are about putting a hundred ideas together and seeing if they’re better than what we had before. so this kind of thing, doing that in a structured way and in a, in an open way internally has then made it possible for more people to volunteer, and that has then generally raised the rigor of the experiments and also the I think the outcome of the work.

00:28:15 Nathan Lambert: Yeah, this is great. I think that over the last few years, there’s been more consensus on things that work for research. And I think the- we also do integration tests very regularly of like, is this feature gonna land for the model? And that’s kind of a..

It’s a good- it’s a nice mirror to ablations, where we know research is changing so much. There’s a lot of turmoil in the academic research community, and it’s nice to have things that are tangible as ways that are a little bit different when you’re doing these large-scale projects. So people that underst- like, you still need to do ablations. But then it needs to survive, like, an additional test in order to land into the model.

So it’s like an additional type of work that needs to be done, and I just like to have words to describe what is actually happening. I think on the Nemotron-3 Nano front, I do a lot of analysis on just looking at basic adoption metrics and Nemotron we created this, what we called like a relative adoption metric, which is essentially looking at downloads over time for models, because it’s easy to know which models have a ton of downloads that are released a while ago. But to, like, look at the trajectory of downloads changing over time, this is a lot-- this is a mouthful. It’s kind of an aside, but, like, Nemotron Nano 3 was in the thirty B size range, like, on track to be one of the top ten models downloaded of all time.

The point that I bring this up, other than to just flatter you, is like, do you think last mile adoption takes a substantial amount of work other than making, like, a very functional model? Or does adoption-- like, do you need to, like, change the recipe that you’re making and put a lot of focus and evaluation and, like, change this over time so that you actually get people to really use the model, rather than, like, “Oh, the benchmarks are good,” look at NVIDIA flying high?

00:30:03 Bryan Catanzaro: Right. Yeah, I mean, wow, it has taken the whole company coming together in order to make Nano V3 have more of an impact than the models that we released before. and there’s so many different aspects to that. obviously, there’s a lot of technical aspects which frankly, I think we have more work to do. So, like you know, making sure that on day zero, when we release something, that the quantizations, all the quantizations, the best quantizations are out there, that the speed on all of the important inference frameworks is out there, that it runs on all of the edge devices that we care about fla- flawlessly, that the install experience is great. You know, this kind of work is extraordinarily important because you know, it’s a crowded world.

There’s so many different things that people could choose to work with, and any amount of friction that gets in the way of people even evaluating something that you do is gonna blunt the results, no matter how good that technology is.. I don’t think that we’re amazing at this yet, so this is something that I anticipate we’re gonna see a lot more investment in as the, you know more people at NVIDIA from all over the company, from marketing, from developer relations, from software engineering, you know as they-- as we all come together in support of this effort. so yeah, so it does, it does take an enormous amount of work. and then, you know, something that I’m particularly interested in is you know, how do we work engage-- i-in a new way, sort of engage with the community to make future Nemotron models even stronger? You know if the only things that we were to optimize for with a Nemotron model would be kind of academic benchmarks that are, you know, highly cited it’s likely the case that the model wouldn’t be general enough to really be useful. And so what we’re trying to build is a technology that other people can extend and deploy, and that means we need to have, like, other ways of understanding the strength of a model besides you know, a handful of academic benchmarks.

I think we have a lot of room to grow here. I’m hoping over time that we develop the muscle of being able to engage with the community and learn from them. Like, you know, okay, this particular thing that I tried to do with Nemotron, it didn’t work. It did this other thing that, you know, I wasn’t expecting, it was wrong. well, that can become feedback that then is used to make the next version better.

I think we’ve got a lot of work to do in that regard.

00:33:10 Nathan Lambert: Do you think there’s any magic to it? I’ve-- I’m blown away by how successful OpenAI’s two open-source models are. Like, yes, they’re obviously the number one name brand in AI, but on the same metric that I see you guys, like, overperforming, like, what I would expect. I’m like, “Wow, great job, NVIDIA.” They’re, like, totally off the charts, like, on track to like, beat Llama’s, like, most downloaded numbers ever with these two GPT OSS models.

And I feel like what they-- like, even on release, they had hiccups where people were pretty negative on it. But for whatever reason, it has just like.. People figured it out, and it just clicked, and then just, like, for a company to say so little about it. Like, we-- Meta put so much effort into Llama being adopted, and you obviously are putting a lot of effort into this.

Like, I’m just like, did OpenAI just crack the code, or is there sometimes a bit of luck?

00:33:59 Bryan Catanzaro: Well, I don’t think I, I don’t think about OpenAI as a, as a lucky company. I think of them as a visionary company that works incredibly hard and you know, I think their success is well deserved. I love the GPT OSS models. You know definitely they’re an inspiration for us here at Nemotron. and yeah, so I think OpenAI also has, like, some other ways of engaging with the community just because of the large number of people that use their services, and that helps them learn things about what are people trying to do with AI, that then they can address when they’re building models, and you know, obviously, you know, people talk about that as a flywheel. you know, I think that’s really interesting and really important.

NVIDIA is never going to have the same kind of flywheel as OpenAI does. We’re not trying to build a service like ChatGPT. What we’re trying to do is help the ecosystem, you know, be strong and enduring. we think that it’s important for there to be this openly developed AI ecosystem, and also we’re, we’re trying to build our next generation of systems, and so we have our own reasons for doing this. But we’re not ever going to have the same exact user base or flywheel that OpenAI does.

On the other hand, you know, we are able to work with institutions around the world in our own way, that I think offers us different opportunities and hopefully, that helps us make things that are, that are useful, too.

00:35:38 Nathan Lambert: Yeah, this makes me realize, I’m having a lot of conversations on.. There are many open model efforts, especially even among people that are fully open, and it’s like, how do we better coordinate? So especially at the smaller scale, it’s like AI2 and Hugging Face. So they’re not big teams.

Like, how do we make sure we’re not doing the same data project at the same-- the same exact thing at the same time? And it’s like, I wonder if there’s opportunities for open companies, like LM Arena has historically released a lot of user data to, like, better help us close this kind of what are people using models for flywheel. And but it’s just-- it’s very hard to build cross-organizational model improvement pipelines, is something that I think. I think models become pretty vertical in terms of somebody at NVIDIA getting the feedback and the model making better.

So that’s what would be something I would like to see this year, but I don’t have ideas for doing it well.

00:36:28 Bryan Catanzaro: Yeah. You know at NVIDIA, we have a tradition of working really closely with, you know, organizations that use our technology. and, you know, we really-- we have, we have teams of engineers that their job is to enable success for our customers. in fact, there’s more people at NVIDIA that care about the success of people outside of NVIDIA than I feel like sometimes there are people that care about the success of things inside NVIDIA. So, like, sometimes I’m like, I’m like: “Hey, could we use a little bit of that e-energy to support Nemotron?” And, and the answer is yes, and NVIDIA is doing that. But I think as Nemotron matures, we’re gonna find that you know, the organizations that work with NVIDIA to make Nemotron awesome for their business, for their use case are gonna have a say in how Nemotron evolves and hopefully, that helps Nemotron address their needs.

00:37:29 Nathan Lambert: .. Yeah, a basic question: how many people, like, how many employees does it take to build all the different versions of Nemotron? I haven’t brought this up because you also have other great types of models. I think our, like, open model analyst, Florian, is obsessed with the Parakeet model, ‘cause- Much faster at typing and is much faster at speaking than typing.

So there’s a lot of other-- I don’t know-- I don’t have the full list of other NVIDIA models off the top of my head, but you are releasing a lot of varieties of models. So I think it’s a bit of a there’s more context to my original question, which is I think about language models ‘cause I’m a n-- like, I just think of AI’s progress is gonna continue to go very fast, so I focus as that as the engine. So but it’s like, how many people is putting this kind of movement into place?

00:38:16 Bryan Catanzaro: Yeah. Well, it’s, it’s, it’s hard to know exactly, and as I said, NVIDIA is a company of volunteers. But and also these days, things are changing, right? Like, so the Parakeet team, which is an excellent team, by the way they I would say a year ago wouldn’t have really considered themselves so much part of the core Nemotron effort, but these days they absolutely are. for the obvious reason that, you know, LLMs these days need to be able to consume all sorts of data, right?

Including audio data. And so you know, as the pro-- as the characteristics, the capabilities of Nemotron models expand obviously, the number of people contributing is gonna expand. I’d say right now there’s about five hundred people that are working pretty much full-time on Nemotron technologies in different ways. This is everything from numerics quantization recipes to speech recognition or image understanding or, you know, pre-training, post-training, RL systems inference software. you know, there’s, there’s a, there’s a whole bunch of different dimensions, right?

So I’d say it’s about five hundred people. but also we’re having our Nemotron all-hands meeting this week, and so I took a look to see how many people were invited to that all-hands meeting, and it was about two thousand. so those are people around the company that are interested in working with Nemotron and either expanding its capabilities or helping its adoption. and so I think you know, the number is somewhere in between and it’s hopefully gonna keep growing as, as Nemotron matures.

00:40:07 Nathan Lambert: Yeah, I mean, that’s one of the greatest attestations to what you’re saying is like, if the interest outside the company-- inside the company is four times as big as the people doing it, you’re gonna, you’re gonna keep scaling up, it seems. People are gonna-.. find ways to help. - One of the other things I’m interested in, I don’t know, like, on the point of five hundred, it’s like, it sounds like a lot of people, but with how many things you have going on, it seems also very few. ‘Cause I’m transitioning to thinking about the long-standing, like, open-source software that you’ve had for NeMo, and I think Megatron, and it’s like they’ve been around for a long time. I think Megatron has gone through many eras. I have a note here.

It’s like these softwares have been going around since, like, twenty nineteen in some form. And it’s, it-

00:40:51 Bryan Catanzaro: Publicly. We had our first public release in twenty nineteen, but we started earlier.

00:40:56 Nathan Lambert: And it’s something that I’ve found is that when I started doing lang- language models, so I was a late bloomer, and we’ll transition to some career talk in a few minutes at Hugging Face. Like Megatron had, like, a bad rap of being very hard to use. But now, like three years later, I hear from anyone that’s founding a new language modeling startup, they’re like, “Just use Megatron.” like, do you pick up on things like this? Is it just, like, random-

00:41:22 Bryan Catanzaro: Well, we-

00:41:22 Nathan Lambert: .. but it’s like-

00:41:22 Bryan Catanzaro: We hard on it. You know, we’re trying really hard to make Megatron easier to use. It’s difficult. Megatron is a complicated piece of technology, and, you know, when we originally started Megatron, the point was to show the community that you could make state-of-the-art large transformer language models with NVIDIA.

I don’t know if you recall, but it-- there was some assertions by some other companies back in twenty seventeen when the transformer was invented, that they could only be made without NVIDIA. in fact, there were statements to that effect on bl-- on official blog posts, which I think got redacted later on. But it was important for NVIDIA to show up and say, “We love language models. We love transformers. Let’s see what we could do, you know, if we partitioned the work properly on lots of GPUs with an amazing interconnect, what kinds of models could we train?” And so that’s where the Megatron project started.

You know, I actually came up with the name Megatron. one of my proudest moments, I suppose. I was thinking about it, I was like: This is a really big transformer. What’s the biggest and baddest transformer? Oh, it’s Megatron.

So that’s, you know, where the name came from. but you’ll think about that had nothing to do with usability, right? Like, I wasn’t, I wasn’t thinking about, like, how do we make a platform that’s really easy for other people to use? I was just trying to show the world that, like, NVIDIA systems could be awesome for transformers. You know, that was, that was my goal.

Over the years, you know, it has evolved. We have a lot more people trying to use Megatron. We got a lot of complaints about how hard it was to use, and then we did a lot of work to try to improve the software engineering around Megatron. You know, these days Megatron software engineering is actually shared between about four different teams at NVIDIA. and we have to coordinate that work very closely.

That has also not been easy. There has been times when you know, people wanted to fork Megatron, and then there were times when we, like, had to bring it back together, and it’s like: Look, I know forking things is always tempting, but look, better together. It’s better for all of us to keep working together.. and so I feel like Megatron the-- and especially Megatron Core, which is like a subset of Megatron that’s, like, especially protected, and we try to put more software engineering into that that has gotten dramatically better since we started paying more attention to it as a company. are we done yet? No, there’s a lot, a lot, a lot more work.

00:43:52 Nathan Lambert: a ba-- a basic question: Is is Megatron or Megatron Core, like, this is what Nemotron is trained on? And also-- And it’s also something that many of the hottest, like, AI startups are training their models on. I would guess that there’s nothing else that does that. So, like, could you summarize why it’s so hard?

00:44:11 Bryan Catanzaro: Well, you know, there’s a, there’s a lot of other great frameworks out there. Megatron’s not the only one. and you know, we’re happy about that. NVIDIA doesn’t need to control the space. What we, what we do wanna do is make sure that we’re putting our products forward in the best light, you know, and it’s a challenging problem.

We’ve got so many things going on with precision and you know, the networking. Like, those questions, like, the software is so complicated. these days, you know, we’re pre-training our Nemotron-3 Super and Ultra models using FP4 which is a thing that, you know, hasn’t been done publicly anyway and something that, you know, we’re pretty excited about because our GPUs have really awesome FP4 throughput. But obviously, the numerical challenges of, like, trying to train a state-of-the-art language model using four bits is non-trivial. So, like, you know, all of that work has to go into Megatron, into Transformer Engine which is a, another open-source project that Megatron relies on and, you know coordinating all of that making sure that, you know, we can actually deliver the benefits of NVIDIA systems to people that are trying to make state-of-the-art models, that’s really important to us.

And, you know, of the five hundred or so people working on Megatron, like, a pretty good fraction.. or on Nemotron, a pretty good fraction of them are working on these kinds of systems issues, right? Because NVIDIA at its core, is a systems company. and Megatron, you know, Nemotron’s first job really is about systems, you know, and so we, we care, we care deeply about that.

00:45:51 Nathan Lambert: Yeah. I mean, from my perspective, I was at Hugging Face before AI2, and Hugging Face is, like, the best company at doing public work. But also, and switching to AI2 and focusing on, like, we’re focused on the output artifact the most. Seeing the different type-- Like, it’s such a different type of work, going from you’re trying to build a tool that’s good for training models, to build a tool that’s good for everybody else and whatever heck use case they are.

00:46:13 Bryan Catanzaro: It’s different.

00:46:13 Nathan Lambert: So I think-

00:46:13 Bryan Catanzaro: Yeah. Different work.

00:46:14 Nathan Lambert: To do both is like.. I’m, I’m happy that AI2’s repos aren’t that popular in terms-

00:46:21 Bryan Catanzaro: Oh,

00:46:21 Nathan Lambert: .. of open-source adoption because, like, we can’t handle it. We just can’t. It’s, like, so hard because it’s people-- it’s, like, it ends up being researchers that are supporting it, and we don’t have the ability to scale the organization structure. So I just think, like, that’s a, that’s a very fun turnaround for me to think of all these things happening at once.

00:46:39 Bryan Catanzaro: Yeah. Well, thanks for noticing we’re putting effort in. I would say Megatron is still not nearly as user-friendly as Hugging Face libraries. Like-.. Hugging Face libraries are legendary, and I admire the work they’ve done to make the community so productive. people, you know, are able to get so much research done thanks to the work that, you know, Hugging Face has put into to their library. So you know, my hat’s off to them as well.

00:47:06 Nathan Lambert: Yeah. One of my hot takes, you don’t have to reply, is that Hugging Face and NVIDIA have been very good partners.

00:47:10 Bryan Catanzaro: Oh, absolutely.

00:47:10 Nathan Lambert: And it’s like bringing that Hugging Face culture to the NVIDIA stuff would be so good. It’s just so hard, so I don’t know how that would work, but-

00:47:17 Bryan Catanzaro: We’re trying, you know, and you know, it is, it is challenging. NVIDIA is always a company that is gonna prioritize speed like hardware speed, above really anything else, ‘cause that’s, like, who we are. I am always trying to make the case that developer speed is important, too, right? It’s like there’s different ways of thinking about speed. and it is definitely the case that a lot of NVIDIA’s software is so cumbersome to use that you know people can’t get the actual hardware speed as fast as it should be because they just give up.

You know, they just don’t, don’t even figure out how to use that. So I think NVIDIA’s making strides there. I think the, the company is understanding more deeply how important developer experience is, and I hope we continue to push that, so that the benefits of all of the systems technology that NVIDIA works so hard on can be more widely used. but at the same time, you know, there is gonna be a tension between those things. It’s, it’s not gonna go away, and you know, to a certain extent, I think that’s just life on planet Earth.

00:48:26 Nathan Lambert: It is. I think you’re do- you’re doing a good job, and I’m gonna kind of shift gears in this interview. So I’ve.. In becoming more back in language- in becoming a person that works in language models, I’ve seen your name more and more times.

I was like, “Bryan Catanzaro, like, where have I seen this?” And then I went and did the research of the Berkeley PhD in, like.. It says April of 2021, you gave a Berkeley EECS Colloquium titled “Applications of Deep Learning and Graphics, Conversational AI, and Systems Design.” I’m not even gonna posit that I actually went, but that’s definitely where I remembered the name from in grad school. And we both have backgrounds that aren’t traditionally in AI and end up working in language models. I just wanted to, like-- what have you learned from your path th- through NVIDIA into what, like, people should be thinking about with AI or open models today?

This could be career reflections, like technical reflections. I just think that there’s-- there are actually a lot of people that come from all over the, like, STEM field to work in AI, so giving it-

00:49:29 Bryan Catanzaro: Sure

00:49:29 Nathan Lambert: .. space to think about is-

00:49:31 Bryan Catanzaro: .. useful, even if it’s just like, it was the big problem, and I wanted to go solve it. Well, I think, you know I’ve, I’ve had a lot of opportunity and a lot of luck in my career. I think in hindsight, it seems like an extraordinarily lucky thing that, you know, I did my first internship at NVIDIA in 2008, and I was, like, building machine learning models on the GPU, and I went to NVIDIA, and nobody else was really doing that. And I was like, “Hey, like, we should have more people doing machine learning on the GPU.

I think this could be an opportunity.” And you know, it took a few years for me to make any headway. NVIDIA didn’t really wanna listen to me. I was a brand-new PhD. I was in the research organization, which is very independent, but, you know, sometimes struggles to change the way that the, you know, the bigger company thinks about things.

And and yet, I just had this conviction, you know, I just was following my heart about what I think is gonna be important, what do I think could really change the world? And that has been, I think, the thread that has taken me through my whole career, is that I’m constantly trying to refine my beliefs about what matters and then hold to them. And that.. I don’t know how helpful it is to say that, but I feel like sometimes people you know, tend to follow the, whatever the thing is that people are talking about on Twitter.

And like I’ve- I’ve done a lot of unpopular things during my career because I believed in them, you know? I remember I published my first paper in 2008 on, at ICML, on training support vector machines on the GPU, and I actually had somebody at the conference, it was in Helsinki at dinner, you know, we were all telling each other what we’re doing, and, and I was like: Yeah, I wanna help people train bigger models on bigger data sets with GPUs. And, and I had you know, a couple of people just say, “Well, why are you here at ICML? That just doesn’t really feel like a good thing for us.” And in 2008, ICML was momly- mainly about new mathematical frameworks for thinking about data, and you know, maybe if you trained a model at all, you would train one on your laptop.

You know, that was the state of machine learning in 2008. So for somebody to come in and say, “I think I want to focus on, like, parallel computing, new kinds of hardware for machine learning, programming frameworks for machine learning, so that, you know, we- more people can try inventing new models on complicated machines with a lot more compute throughput on bigger data sets,” that was like a, an unpopular thing. At least it felt very unpopular. I felt very marginalized at the time by the community.

But I believed in it, you know? I just felt like, look, technology.. Like I have this sense of, like, where do I think technology is going? I knew that traditional computing was running out of steam.

You know, I had, I had done a few internships at Intel, and I was trying to help Intel make processors that ran at, like, ten gigahertz back in 2001, and, you know, it was, like, clear that th- they were running into a wall. And I was thinking: Okay, so if the compute hardware is gonna have to be different, it’s gonna be more restricted. It’s not gonna be able to be so general-purpose in order to get speed. What kinds of applications are gonna have, like, an infinite need for more computing?

And I thought, well, machine learning and AI, that could really change the world if it ever actually worked. But, you know, but, you know, back then it, back then, it kinda worked inside of Google. outside of Google, it kind of didn’t work. and so I had kinda these signals, like it was possible, but it was hard. It was a little weird. It was a little niche.

I was a little bit caught in between different fields, like the systems people didn’t think I was systems enough, and the machine learning people didn’t think I was machine learning enough. But, but I believed in what I was doing, and I found a way to keep following that belief. And, you know, ultimately it was very rewarding when all of a sudden NVIDIA decided, “Hey deep learning is changing the world. What do we know about deep learning?” And then it was like: Oh, well, Bryan’s been doing that for several years, and he’s written some libraries that we could turn into a product.

Let’s go do that. And, you know, so that all happened really quickly after many years of nothing happening, you know? And that was really obviously an amazing opportunity for me. you know, an- another thing that was important to me, I left NVIDIA in 2014 to go work at the Silicon Valley AI Lab at Baidu with a group of really talented people, including Andrew Ng and Dario Amodei and Awni Hannun and Adam Coates, and you know, this was a, a really once-in-a-lifetime opportunity, I think for me, to learn some things that would have been hard for me to learn on my own. you know, I felt at the time at NVIDIA that although I had this great opportunity to help NVIDIA become an AI company, and I was doing that, and I was succeeding at that back in 2013 2014, I also felt like I really wanted to learn from a broader community of people applying machine learning and AI to solve really important business problems. And so going to work at Baidu really gave me that chance. and I was there for a couple of years, learned a ton. very grateful to the team there especially to Andrew Ng, who, who encouraged me to, to join with him on that. and then, you know, I ran into limits of what I could do in California, working for a Chinese company.

I was thinking about, you know, what should I do next? And Jensen asked me to come back and build an applied research lab at NVIDIA in 2016. and -.. I wasn’t sure, like, if that was a good idea. I thought NVIDIA’s already grown so much, you know.

The, the years from twenty fourteen to twenty sixteen, NVIDIA actually grew a lot. these days you look back at it, and you’re like: It was still really tiny. But, but back then, I was like: I don’t know, maybe NVIDIA’s already tapped out. I don’t know if you recall, in twenty sixteen, there was already, like, ten different companies making GPU competitors, right? The TPU had already been out for a while and you know, it, it wasn’t clear that NVIDIA was gonna become as large as it, as it has.

But I believed in the opportunity. I believed in the people. you know, one of the things I loved about NVIDIA was that it’s a very stable organization. So Jensen, he’s been running it since he founded it in nineteen ninety-three. my boss, Jonah Alben, who’s an absolutely extraordinary person has been here for you know quite a, quite a long time, almost since the very beginning of NVIDIA. And these people a lot of the leadership at NVIDIA they love the work.

Their heart is in the work. Jensen and Jonah and many other leaders at NVIDIA, they don’t need to be doing this, right? They, they have earned the right to go sit on a beach and drink mai tais all day, but their heart is in the work, and they work incredibly hard. you know, the.. I feel like if there was an Olympics for email, you know Jensen would get the gold medal.

You know, like it’s, it’s unfathomable to me, like, how much information he’s able to process. and it’s a skill that he’s built up over a long time running this company, but it’s also a reflection of his commitment to the work. And I felt like working at a place where we’ve got this very stable organization that loves the work, that really wants to change the world. You know, why does, why does Jensen get up in the morning? Well, it’s-- this is his chance to do something meaningful.

I thought, associating with these people, you know, I could do worse. I could-- I think I could learn from this as well. And so I came to NVIDIA, and back then it was really hard to explain to people why I was trying to build an AI lab inside of NVIDIA. At, at the time, NVIDIA wasn’t doing very much AI, and so I had to kind of develop a vision for that and then explain it to people. that’s ended up being a really good idea for me as well.

You know, the lab, I think, has really helped NVIDIA. you know, Megatron, I think, has really shown the industry, like, how valuable NVIDIA systems can be for language modeling, which is, which is awesome. DLSS, you know I’m continuing to, to push DLSS forward. Very excited about making graphics, you know more efficient with AI. These days, you know, fifteen out of every sixteen pixels a gamer sees are rendered by AI models that, you know, my team developed, and that then makes the GPU ten times more power efficient.

This is a really exciting you know, thing for me to be involved with, something that I’ve, you know, dreamed about for years. So, so that’s the kind of thing that continues to push me forward, is that I have strong beliefs about what I think is possible, where I think technology’s going, and I’m willing to do things that are we- weird and unpopular but, you know, basically following my convictions. I’m very much always thinking about the people I’m working with, the tribe. You know, I think tribes matter enormously. like you know if I..

So, so back when I was a grad student, I was working on programming models for machine learning. I joined the Python tribe. There are other people that were in the Scala tribe, and the people that did their work in the Scala tribe, trying to make programming models for machine learning in, like, two thousand and ten you know, that work, although a lot of it was technically excellent, didn’t matter to the community as much as the people who were in the Python tribe. It ended up.. and, you know, it kind of sucks sometimes that the world is tribal like this, but it’s just the case.

You know, that like the people that you work with, the community that you work with has a big impact on the problems you think about and then the impact that your work has. So I think a lot about the people and the tribes that I’m collaborating with or that I’m part of. and you know, that’s, that’s kind of been the thread that has carried me through my career.

00:59:56 Nathan Lambert: Yeah. Than- thanks for sharing this full arc. I think you’ve said things that I tell people but in different languages, and the first one, the early days, it seems like there can be space in between fields, where people-- two fields will have their way of describing things, but both of them are probably incomplete, and there can be space there, which is a lot of what I was doing transitioning from novel robots to model-based RL, where I, like, didn’t sit and bear in the actual AI lab, but I started doing AI with my, like, total electrical engineering friends. And then the second thing is, like, I’d wholeheartedly recommend this to people, is, like, choose your work based on the people and people that sincerely are in it for-.. the, what they want to do, and a lot of-

01:00:41 Bryan Catanzaro: And follow your beliefs. You know, think about it. What do you believe in? And it’s okay to change your mind, you know, but, like, figure out what is it that you believe in.

Ask yourself every day: Do I still believe in that? If I do, what next? You know. If I don’t, well, what do I believe in?

You know, that’s been really important to me. I think too many people end up kind of just following trends. That’s not usually helpful because the trends are too late. So if you wanna, if you wanna change the world, you need to be ahead of the trends, and you need to know, you know, it-- trends-- I don’t think trends in computing are just fashion.

I think there’s truth that drives those trends. Not always, but often. You know, it’s just-- this is, it’s there’s kind of an inevitable force of gravity. It just can be really hard to par- parse out the noise and figure out what is the truth that is gonna push the industry forward, and how can you push that with it.

You know, if you can join with that, you can accomplish great things.

01:01:36 Nathan Lambert: Yeah, I agree. I think in building language models, it’s like you want to build a model that the community wants in six months. I think if you’re building a model to compete-.. with the models that are already out, you’re not gonna keep up. And I think that it’s like, what is the right thing is building open language models in six months, and like, where do you need to try to steer things is one of the hardest problems that I think about. So I don’t-- if you want to close with any predictions where you see, like, open models, like, if we’re-- if you’re gonna be here at the end of twenty-six, if there’s anything you think will be far more obvious than it is today, or any bets that you want to make, I think it’s kind of a good place to wrap.

01:02:18 Bryan Catanzaro: Well predictions are always hard, and I don’t feel like I’m very good at making predictions. But I am-- I feel like I am good at identifying what I believe in, and what I believe in right now is that compute remains one of the fundamental challenges behind AI. It has been that way for a very long time and I think it continues to be. I think as we find new ways to apply compute to AI, we discover new forms of scaling laws that help AI become more useful and therefore, it becomes more widespread.

So I’m gonna keep thinking about compute. I continue to believe that the fastest-- that, you know, the way to think about AI is not just in terms of absolute intelligence, but rather intelligence per second. You know, there’s some sort of normalization in there that relates to how fast a model can think, how fast a model can be trained or post-trained. You know, that models that kind of incorporate this compute acceleration characteristic, where they’re thinking about intelligence per unit time, those are gonna end up winning because they end up getting trained on more data, they end up getting post-trained with more cycles, they end up with more iterations during thinking when they’re deployed. and you know, of course, if they happen to fit the hardware really well whatever hardware that is then, you know, that can have a pretty non-trivial effect on the intelligence as well.

So that’s something that I really believe in. I really believe in AI as an infrastructure. You know, there’s, there’s different ways of thinking about AI. I think some people believe AI is more like the singularity, like once AGI has been declared, then the whole world is different forever, and all humans have lost their jobs and, you know, there’s a lot of like-- there’s a lot of things about AI that people believe that I personally don’t believe.

You know, I believe, first of all, that intelligence is very multifaceted that it is not easy to pin down, that as soon as we try to pin down intelligence, we find that there’s very many more forms of intelligence that aren’t covered by that. So, for example, a model that achieves gold medal status on the International Math Olympiad, that’s an extraordinary achievement, but it doesn’t make me have no job, right? Like, I’m actually not solving math problems all day, even though, like, having the ability to solve math problems is clearly very useful. And you know, it’s also the case that intelligence is, you know, is kind of like a potential energy it’s not a kinetic energy, right?

In order to transform intelligence into kinetic energy, it needs to have a platform. It needs to be applied in the proper way. and you know, that is why I believe in open models and open- openly developed and deployed intelligence. I believe every company, every organization, has secrets that only they know. They have special data, they have special ways of thinking about their problems, their customers, their solutions, and they’re gonna know how to apply AI better than anyone else.

And so AI as infrastructure that transforms companies, turbocharges them, allows them to take the things they know and multiply their impact, that’s something that I believe in more than AI as an event, that one day, when it happens, makes everyone obsolete. I don’t.. I just don’t believe in that. you know, I often joke that, like if, for example, the CEO were to retire at some point, and we needed to find a replacement you know, handing out an IQ test or asking, you know, who has the highest SAT score that would not be a very good way of finding a replacement, you know? intelligence is just far too complex for that. And so you know, so this, these beliefs, you know, you can disagree with me about anything that I just said, and I’m not offended by that.

I have a lot of friends that do. but you know, I’m asking myself, well, if I believe that intelligence has these characteristics and that AI is gonna change the world by turbocharging institutions that exist a-and also creating new applications that we haven’t even dreamed of yet rather than replacing all humans, then, you know, how do I go about building that, you know? And so that’s, that’s kind of the direction that I’m on right now.

01:07:00 Nathan Lambert: Yeah, I love it. I agree, I agree that we’re entering an interesting area where the open models are taking so many different shapes and sizes and have so many different strengths and trade-offs, that there can start to be interesting interplay as an ecosystem, where there’s just so many different things going on. And I think I like your idea of potential energy, and you have to build things that are kind of unclear of what-- It’s like you have to build the energy in a way, and you don’t really know what the goal is, but you have to do.. try to build these good models. So I appreciate it, and-

01:07:30 Bryan Catanzaro: Yeah, and then let people apply it. Let it-- let them make the kinetic energy happen.

01:07:35 Nathan Lambert: I agree. Thanks for coming on.

01:07:37 Bryan Catanzaro: Thanks so much for inviting me. It’s been a great conversation.

Latest open artifacts (#18): Arcee's 400B MoE, LiquidAI's underrated 1B model, new Kimi, and anticipation of a busy month

Florian Brand — Mon, 02 Feb 2026 13:03:33 GMT

January was on the slower side of open model releases compared to the record-setting year that was 2025. While there were still plenty of very strong and noteworthy models, most of the AI industry is looking ahead to models coming soon. There have been countless rumors of DeepSeek V4’s looming release and impressive capabilities alongside a far more competitive open model ecosystem.

In the general AI world, rumors for Claude Sonnet 5’s release potentially being tomorrow have been under debate all weekend. We’re excited for what comes next — for now, plenty of new open models to tinker with.

LFM2.5-1.2B-Instruct by LiquidAI: Liquid continued pretraining from 10T (of their 2.0 series) to 28T tokens and it shows! This model update really surprised us: In our vibe testing, it came very close to Qwen3 4B 2507 Instruct, which we use every day. And this model is over 3 times smaller! In a direct comparison against the (still bigger) Qwen3 1.6B, we preferred LFM2.5 basically every time. And this time, they released all the other variants at once, i.e., a Japanese version, a vision and an audio model.
Trinity-Large-Preview by arcee-ai: An ultra-sparse MoE with 400B total and 13B active parameters, trained by an American company. They also released a tech report and two base models, one “true” base model pre-annealing and the base model after the pre-training phase. Many more insights, including technical details and their motivation, can be found in our interview with the founders and pre-training lead:
Kimi-K2.5 by moonshotai: A continual pre-train on 15T tokens. Furthermore, this model is also multimodal! People on Twitter have replaced Claude 4.5 Opus with K2.5 for tasks that need a less capable but cheaper model. However, the writing capabilities that K2 and its successor were known for have suffered in favor of coding and agentic abilities.
GLM-4.7-Flash by zai-org: A smaller version of GLM-4.7 which comes in the same size as the small Qwen3 MoE with 30B total, 3B active parameters.
K2-Think-V2 by LLM360: A truly open reasoning model building on top of their previous line of models.

Models

Reading through the rest of this issue, we were impressed by the quality of the “niche” small models across the ecosystem. From OCR to embeddings and song-generation, this issue has some of everything and there really tends to be open models that excel at any modality needed today — they can just be hard to find!

Thoughts on the job market in the age of LLMs

Nathan Lambert — Fri, 30 Jan 2026 15:49:25 GMT

There’s a pervasive, mutual challenge in the job market today for people working in (or wanting to work in) the cutting edge of AI. On the hiring side, it often feels impossible to close, or even get interest from, the candidates you want. On the individual side, it quite often feels like the opportunity cost of your current job is extremely high — even if on paper the actual work and life you’re living is extremely good — due to the crazy compensation figures.

For established tech workers, the hiring process in AI can feel like a bit of a constant fog. For junior employees, it can feel like a bit of a wall.

In my role as a bit of a hybrid research lead, individual contributor, and mentor, I spend a lot of time thinking about how to get the right people for me to work with and the right jobs for my mentees.

The advice here is shaped by the urgency of the current moment in LLMs. These are hiring practices optimized for a timeline of relevance that may need revisiting every 1-2 years as the core technology changes — which may not be best for long-term investment in people, the industry, or yourself. I’ve written separately about the costs of this pace, and don’t intend to carry this on indefinitely.

The most defining feature of hiring in this era is the complexity and pace of progress in language models. This creates two categories. For one, senior employees are much more covetable because they have more context of how to work in and steer complex systems over time. It takes a lot of perspective to understand the right direction for a library when your team can make vastly more progress on incremental features given AI agents. Without vision, the repositories can get locked with too many small additions. With powerful AI tools I expect the impact of senior employees to grow faster than adding junior members to the team could.

This view on the importance of key senior talent has been a recent swing, given my experiences and expectations for current and future AI agents, respectively:

Every engineer needs to learn how to design systems. Every researcher needs to learn how to run a lab. Agents push the humans up the org chart.

On the other side, junior employees have to prove themselves in a different way. The number one defining trait I look for in a junior engineering employee is an almost fanatical obsession with making progress, both in personal understanding and in modeling performance. The only way to learn how the sausage gets made is to do it, and to catch up it takes a lot of hard work in a narrow area to cultivate ownership. With sufficient motivation, a junior employee can scale to impact quickly, but without it, it’s almost replaceable with coding agents (or will be soon). This is very hard work and hard to recruit for. The best advice I have on finding these people is “vibes,” so I am looking for advice on how to find them too!1

For one, when I brought Florian Brand on to help follow open models for Interconnects, when I first chatted with him he literally said “since ChatGPT came out I’ve been fully obsessed with LLMs.” You don’t need to reinvent the wheel here — if it’s honest, people notice.

For junior researchers, there’s much more grace, but that’s due to them working in an education institution first and foremost, instead of the understatedly brutal tech economy. A defining feature that creates success here is an obsession with backing up claims. So a new idea improves models, why? So our evaluation scores are higher, what does this look like in our harness? Speed of iteration follows from executing on this practice. Too many early career researchers try to build breadth of impact (e.g. collecting contributions on many projects) before clearly demonstrating, to themselves and their advisors, depth. The best researchers then bring both clarity of results and velocity in trying new ideas.

Working in academia today is therefore likely to be a more nurturing environment for junior talent, but it comes with even greater opportunity costs financially. I’m regularly asked if one should leave a Ph.D. to get an actual job, and my decision criteria is fairly simple. If you’re not looking to become a professor and have an offer to do modeling research at a frontier lab (Gemini, Anthropic, OpenAI is my list) then there’s little reason to stick around and finish your Ph.D.

The little reason that keeps people often ends up being personal pride in doing something hard, which I respect. It’s difficult to square these rather direct pieces of career advice with my other recommendations of choosing jobs based on the people, as you’ll spend a ton of your life with them, more than the content of what you’ll be doing. Choosing jobs based on people is one of the best ways to choose your job based on the so-called “vibes.”

Working in a frontier lab in product as an alternative to doing a Ph.D. is a path to get absorbed in the corporate machine and not stand out, reducing yourself to the standard tech career ladder. Part of what I feel like works so well for me, and other people at Ai2, is having the winning combination of responsibility, public visibility, and execution in your work. There is something special for career progression that comes from working publicly, especially when the industry is so closed, where people often overestimate your technical abilities and output. Maybe this is just the goodwill that comes from open-source contributions paying you back.

If you go to a closed lab, visibility is almost always not possible, so you rely on responsibility and execution. It doesn’t matter if you execute if you’re doing great work on a product or model that no one ever touches. Being in the core group matters.

This then all comes back to finding the people hiring pipeline.

There are many imperfect signals out there, both positive and negative. For individuals building their portfolio, it’s imperative to avoid negative signals because the competition for hiring is so high. A small but clear negative signal is a junior researcher being a middle author on too many papers. Just say no, it helps you.

The positive signals are messier, but still doable. It’s been said that you can tell someone is a genius by reading one Tweet from them, and I agree with this. The written word is still an incredibly effective and underutilized communication form. One excellent blog post can signify real, rare understanding. The opposite holds true for AI slop. One AI slop blog post will kill your application.

The other paths I often advise people who reach out asking how to establish a career in AI are open-source code contributions or open research groups (e.g. EluetherAI). I’ve seen many more success cases on the former, in open-source code. Still, it’s remarkably rare, because A) most people don’t have the hardware to add meaningful code to these popular LLM repositories and B) most people don’t stick with it long enough. Getting to the point of making meaningful contributions historically has been very hard.

Doing open-source AI contributions could be a bit easier in the age of coding agents, as a lot of the limiting factors today are just bandwidth in implementing long todo lists of features, but standing out amid the sea of AI slop PRs and Issues will be hard. That’ll take class, creativity, humanity, and patience. So, to be able to run some tiny models on a $4000 DGX Spark is an investment, but it’s at least somewhat doable to iterate on meaningful code contributions to things like HuggingFace’s ML libraries (I’ve been writing and sharing a lot about how I’m using the DGX Spark to iterate on our codebases at Ai2).

Back to the arc of hiring, the above focused on traits, but the final piece of the puzzle is alignment. The first question to ask is “is this person good?” The second question is, “will this person thrive here?” Every organization has different constraints, but especially in small teams, the second question defines your culture. In a startup, if you grow too fast you definitely lose control of your culture. This isn’t to say that the company won’t have a strong or useful culture, it’s to say you can’t steer it. The culture of an organization is the byproduct of how all the individuals interact. You do not want to roll the dice here.

Personally, I’m working on building out a few more spots in a core post-training methods team at Ai2. Post-training recipes have gotten very complicated, and we’re working on making them easier to run while doing research on fundamentals such as post-training data mixing and scaling laws. To be a little vague, getting the post-training recipes done for both Olmo 3 and Olmo 2 was... very hard on the team. At the same time, post-training hasn’t gotten much more open, so hiring through it and doing the hard work is the only way.

Ideally I would hire one engineer and one researcher, both fairly senior, meaning at least having a Ph.D. or a similar number of years working in technology. Junior engineers with some experience and the aforementioned obsession would definitely work.

This callout serves as a good lesson for hiring. It is intentional that people should self-filter for this, no one likes when you way overreach on selling yourself for a job. I also intentionally make people find my email for this as an exercise. The art of cold emailing and approaching people in the correct pipelines is essential to getting hired. Many people you look up to in AI read their emails, the reason you don’t get a response is because you didn’t format your email correctly. The best cold emails show the recipient that they learned from it or obviously benefitted from getting it. Platitudes and compliments are of course nice to receive, but the best cold emails inspire action.

Two of the most recent people I helped hire at Ai2 I learned of through these side-door job applications (i.e. not found through the pile of careers page applications). I learned of Finbarr through his blogs and online reputation. Tyler sent me an excellent cold email with high-quality blog posts relating to my obvious, current areas of interest and had meaningful open-source LLM contributions. Both have been excellent teammates (and friends), so I’m always happy to say the system works, it’s just intimidating.

All together, I’m very torn on the AI job market. It’s obviously brutal for junior members of our industry, it obviously feels short sighted, it obviously comes with tons of opportunity costs, and so on. At the same time, it’s such a privilege to be able to contribute to such a meaningful, and exciting technology. My grounding for hiring is still going to be a reliance on my instincts and humanity, and not to get too tied down with all the noise. Like most things, it just takes time and effort.

Other posts in my “life thoughts” series include the following. I send these to people when they ask me for career advice generally, as I don’t have time to give great individual responses:

Apr 05, 2023: Behind the curtain: what it feels like to work in AI right now
Oct 11, 2023: The AI research job market shit show (and my experience)
Oct 30, 2024: Why I build open language models
May 14, 2025: My path into AI
Jun 06, 2025: How I Write
Oct 25, 2025: Burning out

Some companies hire heavily out of Twitter, some hire from communities such as GPU Mode or NanoGPT speedrunning.

Arcee AI goes all-in on open models built in the U.S.

Nathan Lambert — Tue, 27 Jan 2026 22:47:24 GMT

Arcee AI is a the startup I’ve found to be taking the most real approach to monetizing their open models. With a bunch of experience (and revenue) in the past in post-training open models for specific customer domains, they realized they needed to both prove themselves and fill a niche by pretraining larger, higher performance open models built in the U.S.A. They’re a group of people that are most eagerly answering my call to action for The ATOM Project, and I’ve quickly become friends with them.

Today, they’re releasing their flagship model — Trinity Large — as the culmination of this pivot. In anticipation of this release, I sat down with their CEO Mark McQuade, CTO Lucas Atkins, and pretraining lead, Varun Singh, to have a wide ranging conversation on:

The state (and future) of open vs. closed models,
The business of selling open models for on-prem deployments,
The story of Arcee AI & going “all-in” on this training run,
The ATOM project,
Building frontier model training teams in 6 months,
and other great topics. I really loved this one, and think you well too.

The blog post linked above and technical report have many great details on training the model that I’m still digging into. One of the great things Arcee has been doing is releasing “true base models,” which don’t contain any SFT data or learning rate annealing. The Trinity Large model, an MoE with 400B total and 13B active tokens trained to 17 trillion tokens is the first publicly shared training run at this scale on B300 Nvidia Blackwell machines.

As a preview, they shared the scores for the underway reasoning model relative to the who’s-who of today’s open models. It’s a big step for open models built in the U.S. to scale up like this.

I won’t spoil all the details, so you still listen to the podcast, but their section of the blogpost on cost sets the tone well for the podcast, which is a very frank discussion on how and why to build open models:

When we started this run, we had never pretrained anything remotely like this before.
There was no guarantee this would work. Not the modeling, not the data, not the training itself, not the operational part where you wake up, and a job that costs real money is in a bad state, and you have to decide whether to restart or try to rescue it.
All in—compute, salaries, data, storage, ops—we pulled off this entire effort for $20 million. 4 Models got us here in 6 months.
That number is big for us. It’s also small compared to what frontier labs spend just to keep the lights on. We don’t have infinite retries.

Once I post this, I’m going to dive right into trying the model, and I’m curious what you find too.

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Guests

Lucas Atkins —X,LinkedIn — CTO; leads pretraining/architecture, wrote the Trinity Manifesto.

Mark McQuade — X, LinkedIn — Founder/CEO; previously at Hugging Face (monetization), Roboflow. Focused on shipping enterprise-grade open-weight models + tooling.

Varun Singh — LinkedIn — pretraining lead.

Most of this interview is conducted with Lucas, but Mark and Varun make great additions at the right times.

Chapters

00:00:00 Intro: Arcee AI, Trinity Models & Trinity Large
00:08:26 Transitioning a Company to Pre-training
00:13:00 Technical Decisions: Muon and MoE
00:18:41 Scaling and MoE Training Pain
00:23:14 Post-training and RL Strategies
00:28:09 Team Structure and Data Scaling
00:31:31 The Trinity Manifesto: US Open Weights
00:42:31 Specialized Models and Distillation
00:47:12 Infrastructure and Hosting 400B
00:50:53 Open Source as a Business Moat
00:56:31 Predictions: Best Model in 2026
01:02:29 Lightning Round & Conclusions

Subscribe now

Transcript

Transcript generated with ElevenLabs Scribe v2 and cleaned with Claude Code with Opus 4.5.

00:00:06 Nathan Lambert: I’m here with the Arcee AI team. I personally have become a bit of a fan of Arcee, ‘cause I think what they’re doing in trying to build a company around building open models is a valiant and very reasonable way to do this, ‘cause nobody really has a good business plan for open models, and you just gotta try to figure it out, and you gotta build better models over time. And like open-source software, building in public, I think, is the best way to do this. So this kind of gives you the wheels to get the, um... You get to hit the ground running on whatever you’re doing. And this week, they’re launching their biggest model to date, which I’m very excited to see more kind of large-scale MoE open models. I think we’ve seen, I don’t know, at least ten of these from different providers from China last year, and it’s obviously a thing that’s gonna be international, and a lot of people building models, and the US kind of, for whatever reason, has fewer people building, um, open models here. And I think that wherever people are building models, they can stand on the quality of the work. But whatever. I’ll stop rambling. I’ve got Lucas, Mark, um, Varun on the, on the phone here. I’ve known some of them, and I consider us friends. We’re gonna kind of talk through this model, talk through building open models in the US, so thanks for hopping on the pod.

00:01:16 Mark McQuade: Thanks for having us.

00:01:18 Lucas Atkins: Yeah, yeah. Thanks for having us. Excited.

00:01:20 Varun Singh: Nice to be here.

00:01:20 Nathan Lambert: What- what should people know about this Trinity Large? What’s the actual name of this model? Like, how stoked are you?

00:01:29 Lucas Atkins: So to- yeah.

00:01:29 Nathan Lambert: Like, are you, like, finally made it?

00:01:32 Lucas Atkins: Uh, you know, we’re recording this a little bit before release, so it’s still like, you know, getting everything buttoned up, and inference going at that size is always a challenge, but we’re-- This has been, like, a six-month sprint since we released our first dense model, which is 4.5B, uh, in, in July of last year, 2025. So, um, it’s always been in service of releasing large. I- it’s a 400B, um, thirteen billion active sparse MoE, and, uh, yeah, we’re, we’re super excited. This has just been the entire thing the company’s focused on the last six months, so really nice to have kind of the fruits of that, uh, start to, start to be used by the people that you’re building it for.

00:02:16 Nathan Lambert: Yeah, I would say, like, the realistic question: do you think this is landing in the ballpark of the models in the last six months? Like, that has to be what you shop for, is there’s a high bar- ... of open models out there and, like, on what you’re targeting. Do you feel like these hit these, and somebody that’s familiar, or like MiniMax is, like, two thirty total, something less. I, I don’t know what it is. It’s like ten to twenty B active, probably. Um, you have DeepSeeks in the six hundred range, and then you have Kimi at the one trillion range. So this is still, like, actually on the smaller side of some of the big MoEs- ... that people know, which is, like, freaking crazy, especially you said 13B active. It’s, like- ... very high on the sparsity side. So I don’t actually know how you think about comparing it among those. I was realizing that MiniMax is smaller, doing some data analysis. So I think that it’s like, actually, the comparison might be a little bit too forced, where you just have to make something that is good and figure out if people use it.

00:03:06 Lucas Atkins: Yeah, I mean, if, if from raw compute, we’re, we’re roughly in the middle of MiniMax and then GLM 4.5, as far as, like, size. Right, GLM’s, like, three eighty, I believe, and, and thirty-four active. Um, so it-- you know, we go a little bit higher on the total, but we, we cut the, uh, the active in half. Um, it was definitely tricky when we decided we wanted to do this. Again, it was July when... It, it was July when we released, uh, the dense model, and then we immediately knew we wanted to kind of go, go for a really big one, and the, the tricky thing with that is knowing that it’s gonna take six months. You, you can’t really be tr-- you can’t be building the model to be competitive when you started designing it, because, you know, that, obviously, a lot happens in this industry in six months. So, um, when we threw out pre-training and, and a lot of our targets were the GLM 4.5 base model, um, because 4.6 and 4.7 have been, you know, post-training on top of that. Um, and, like, in performance-wise, it’s well within where we want it to be. Um, it’s gonna be... Technically, we’re calling it Trinity Large Preview because we just have a whole month of extra RL that we want to do. Um- But-

00:04:29 Nathan Lambert: I’ve been, I’ve been there.

00:04:31 Lucas Atkins: Yeah, yeah. But i- you know, we’re, we’re in the, um, you know, mid-eighties on AIME 2025, uh, GPQA Diamonds, uh, seventy-five, um, at least with the checkpoint we’re working with right now. We’re still doing more RL on it, but, um, you know, MMLU Pro, uh, eighty-two. So we’re, we’re, we’re happy. We’re really-- Like, for it being our first big run, like, just getting it trained was, was an extreme accomplishment, but then for it to actually be, like, a, a genuinely useful model is a, a cherry on top.

00:05:03 Nathan Lambert: Yeah, let’s go big picture. Uh, like, let’s recap. We have all of the... We have this full trinity of models. I think that there’s a fun note. Uh, did I put it in this doc? Yeah, on Nano Preview, which was the smallest- ... you’re, like, charming and unstable. The model card’s really funny. Um, ChatGPT, doing deep research on this, I was like, ChatGPT Pro just tagged next to it, “charming and unstable.” And I was like: Is this a hallucination? And then in the model card, you have, like: “This is a chat-tuned model with a delightful personality and charm we think users will love. Uh, we think- ... it’s pushing the boundaries, eight hundred million, um, active parameter, and as such, may be unstable in certain use cases.” This is at the smallest scale- ... which is like, I appreciate saying it as it is, and that’ll come up multiple times in the conversation. And then you have Mini, which is like, um, I think it was, like, 1B active, 6B total type thing. In my-- I, I don’t have it, the numbers right in front of me. I have it somewhere else. Um-

00:05:52 Lucas Atkins: Yeah, Nano was, Nano was the 6B, uh, 1 active.

00:05:55 Nathan Lambert: Oh, yeah, yeah.

00:05:55 Lucas Atkins: And then, and the Mini was twenty-six, 3B active.

00:05:58 Nathan Lambert: Yeah. So, like-

00:06:00 Lucas Atkins: Um, yeah.

00:06:00 Nathan Lambert: -are these based on more of, like, you need to build out your training chops, or are you trying to fill needs that you’ve-... heard from community, and like, I think for context, previously, your first open model was a base and post-trained model, which was Arcee 4.5B, which was a dense model- -which people like. And prior to that, you had, like, a long list of, like, post-training fine tunes that you had released. So before that, it was like a post-training shop, and I think that kind of history is i- important to fill in, ‘cause I think most people-- a lot of people are gonna meet you for the first time listening to this.

00:06:34 Lucas Atkins: Yeah, it, it, um, we chose those sizes for Mini and Nano, uh, specifically Mini, um, the 26B, 3B Active, because we wanted to de-risk, uh, large. Like, th- this has all been in service of getting to a model of, of, you know, the 400B class. So, um, we, you know, learned from doing the original 4.5B, that you might have everything on paper that you need to train a model, but i- inevitably, there’s tremendous, you know, difficulties that come up, and, um, it, it’s-- we, we definitely knew we wanted to make sure that we, you know, solved some of... E- especially when it came to just doing an MoE model performance, uh, you know, like a, like an efficient, fast train of an MoE. So, um, we thought that that was a good ground where we could, you know, it wasn’t crazy expensive, uh, but gave us a lot of data, uh, going into large. And then Nano just came about because we had some extra compute time, and we really want to do more research on, like, smaller models that are very deep. Um, and we hadn’t really seen that in an MoE before, so that one was very much we started training it, and then it, you know, early benchmarks were good, so we said, “Well, we’ll just do the whole dataset.” Um, and, uh, but most of the love for those releases went into, to Mini. So I, I definitely think that long term, uh, from an ROI perspective, the smaller models are going to be where we shine, just because there’s a tremendous amount of, of cost savings a company can get from, from optimizing on a, on a smaller model. Um, but, but we, uh, w- we’re definitely gonna be trying to push the, the large frontier, too.

00:08:26 Nathan Lambert: Yeah. Um, I’d like to kind of double-click on training before going back to the small model that’s useful for companies, ‘cause we’re gonna have-- we’re gonna end up talking for, like, twenty minutes plus about open ecosystem. So I kind of am curious, like, philosophically, how your company feels about, like, sharing scientific details. So if I ask you, like, what are the things you’re technically most excited about in the model, or, like, what are the pain points? Like, uh, like, are you willing to talk about these things? Like, I- Do you feel like it’s kind of orthogonal to the company? Like, I feel like a lot of it is just, like, things that happen. I think your framing of all of this is in service of getting the big model going. And particularly, of, like, you have to be thinking about your model as landing in six months, is probably... Like, for people not training models, it’s hard to think about, ‘cause even I- ... like, I’m thinking about trying to refresh our post-training stack for OLMo 3, and I’m like, the thinking model, the, um, we are pretty SFT heavy right now, and it makes it not very dynamic in terms of the thinking time. But it’s just like, I can’t see people deploying this model, or probably will have a hard time fine-tuning it. And it’s like to think about where tool use models are going in six months, like, seems pretty hard. Um, it’s a very hard task to do, so it takes a lot of gumption to actually set out and do it. So I, I would just appreciate the framing, kind of self-reflecting on what I go through. So if you have anything that you think was, like, particularly hard to actually land the six-month outlook, because you use Muon as an optimizer, or is it Muon? And some of these things. I think the data, it’s well known that Datology is cranking a lot of this, and you probably provide-- I think of it as like you’re kind of driving and working with these partners, and I’m sure you provide a lot of feedback on what’s working and what’s not. So- ... anything you’re willing to share, I think it’s useful.

00:10:08 Lucas Atkins: Uh, I, I think, um, I mean, on the data side, like Datology, I-- at least for these models, that, that partnership has very much been almost an extension of our own research team. Like, we’ve worked very closely with them, and, um, obviously, our model’s doing well, you know, i- is, is, is good for them. So, um, but it, it-- there was definitely, you know, and you know this better than most, like, small-scale ablations, when you throw them at scale, sometimes, you know, uh, the-- i- it doesn’t always turn out how you want. So there was quite a lot of iterating there to at least get the dataset we used for Large. Um, I, I would say that as far as looking out six months and then figuring out how we wanted to... Obviously, the big one was compute. We don’t, um, you know, we, we never raised as, like, a foundation model company, so we’ve ne- we haven’t signed massive commits for, you know, thousands of GPUs before. Um, we didn’t have a, a, a massive cluster that was always active, uh, for a lot of our post-training. So if they came before, um, you know, we had sixty-four, uh, H100s, that was pretty sufficient for that kind of work, but obviously, this necessitated quite a bit more. Um, but the first thing was-

00:11:29 Nathan Lambert: That’s still less than people would guess. Like, you’re releasing models- ... that weren’t like, your models weren’t catching national news, but people in the community knew about them. And, like, uh, i- I think of, like, Moondream when I think about that. Like, vik has- ... such little compute, and he puts it to so use. Like, you, like, see how successful he is? And he tells you that he has, I don’t know, thirty... Like, l- it might be, like, sixty-four GPUs. Like, uh- ... there’s, uh, uh, that’s a whole separate conversation on building- ... actual good ML output on little compute. I, I should ta- I should chat with vik about this, but aside

00:12:03 Lucas Atkins: No, it’s, it is-- I think it was... Yeah, it, it, it was very much a gift going into the pre-training side because-... we were kind of already thinking, All right, how do we do the mu- you know, the most with the, the least amount of compute? But, um, you know, we-- it took us quite a while to get the cluster that we have been training large on, which is twenty-two thousand forty-eight B300s. Um, and once we figured out when we were going to get that, get access to that cluster, everything else kind of became clear as far as, like, timelines for Mini and Nano and, and when we wanted to do that. Uh, obviously, you know, five hundred and twelve H100s was easier to come across, um, for Mini and Nano. So once we figured that out, um, it really became, uh, this game of, okay, how can we find, like, the best research on the topic of, of pre-training, and what is kind of... What are the, the, the papers and publications that are coming out, um, that have enough potential and enough precedence, either because, uh, another lab used them, it comes from a reputable team, uh, the ablations and the, the evaluation setup, like in the paper, was sufficient enough to give us confidence. Uh, and then we basically spent, I don’t know, it was probably about two months just figuring out what we wanted our architecture to be for the MoE, then figuring out, okay, now that that’s what we want to do, how do we implement all of that in the actual training pipeline? Uh, how can we-- you know, at that time, there had been many people who’d done Muon, but, um, for post-training, and, and then other-- some Chinese labs had used it, but there wasn’t, like, a widely available distributed Muon, um, to do it that scale.

00:13:54 Nathan Lambert: What do you think that, like, looks like in decision-making? ‘Cause that seems like a risky decision, if you ask me. I think for one, the ti-

00:14:00 Lucas Atkins: Muon?

00:14:00 Nathan Lambert: ... the timing, the, the, like, timing sharing that you’re saying is good. Like, you said this for two months, and then, like... But, like, even Muon is like, that’s a bet that would even take-- like, somewhere like AI2, that would take some serious evidence to go with it. We would want to ablate it. So like- ... on a single track, it’s like y- you had probably had a process for becoming fairly confident in it then.

00:14:24 Lucas Atkins: It- yes, but it, it was also, like, Kimi had, had just come out, and we knew that that one used Muon, and so we knew that it, at least, if implemented correctly, could deliver a good model. There weren’t outstanding ablations done around like... You know, there wasn’t a Kimi scale model done with Adam, and then compared to Muon and see the difference. But, um, that at least gave us enough confidence that if-

00:14:50 Nathan Lambert: What does Muon give you? Does it give you, like, memory saving, uh, in-

00:14:55 Lucas Atkins: No, it’s actually a little bit more memory. It’s, it’s, it’s mostly-

00:14:58 Varun Singh: It’s, uh-

00:14:58 Lucas Atkins: ... like the loss converges a bit quicker.

00:15:00 Varun Singh: It’s, it’s less memory, actually. It’s, uh, uh, only one momentum buffer instead of Adam’s two, uh, beta buffers, and then it’s also better convergence.

00:15:10 Nathan Lambert: Okay. So it’s, like, mostly designed around convergence, and then I know the math is different, which is where this momentum term changes.

00:15:15 Lucas Atkins: Well, it, it kind of came out... I mean, it had its, its, its big, you know, uh, explosion of popularity in the kind of nanoGPT speedrunning community. So it was kind of all built around converging to a certain, you know, validation loss faster, and, uh, that, that, that was, um... As for why we chose it as opposed to Adam, we’d used Adam for 4.5b, uh, but we also knew that if we wanted to move this fast, that we were going to have to make some pretty big bets, educated. Um, but, but still, we would have to make some, some, some risky decisions, um, beyond just, you know, training in general. So, um, there were a few that Muon we went with, uh, I think was, was one of our bigger bets. Uh, we ended up not doing, like, multi-token prediction or, or, or FP8 because we were throwing so many new things into the run at once, um, that-

00:16:12 Nathan Lambert: Do these apply for-

00:16:12 Lucas Atkins: ... if something were to go wrong-

00:16:13 Nathan Lambert: um, Mini and Nano? Are those also Muon, or are those- ... Adam as well? Okay, so then you- ... you get some de-risk from that. Do you know off the top of your head how many days it take to train each of those? Like, a, a good-

00:16:25 Lucas Atkins: Uh-

00:16:25 Nathan Lambert: ... ballpark for people, before-

00:16:27 Lucas Atkins: Yeah, so-

00:16:28 Nathan Lambert: going into the bigger run.

00:16:29 Lucas Atkins: So, so Mini, uh, so Nano on it was five hundred and twelve H200s, uh, took a little over thirty days. Um, and then Mini was about forty-five days.

00:16:45 Nathan Lambert: Okay. I think another thing- ... off the top of my head is I know that, like, a OLMo 1B dense would take us, like, eleven days on a hundred and twenty-eight H100s for a dense model. So, like, sixteen. So, like, the numbers- ... just go up from there. ‘Cause then it’s like the question is like, I’m guessing i- if those are forty-five days, and then you have-- you up the number of GPUs, it’s gonna be like a similar amount of time, or forty days for the big model, but much more stressful.

00:17:16 Lucas Atkins: Yeah, the big model was... But again, that was- we knew that we, we wanted- we felt confident that we could deliver a competitive and exciting model in January 2026. Like, we knew that it would-- we could... Who knows kind of where the research and what, what class and, and, and, and skill and performance of model is gonna come out in the next three months? Um, so we also knew that we really wanted to land sometime in January, and that’s also why we also took- we went with B300s, even though definitely the largest public train of that size on B300s and, and the, um, you know, a lot of the software was not-- did not have, like, out-of-the-box B300 support. It was the only way we were gonna be able to train a model of this size in-

00:18:06 Nathan Lambert: Did you have to do this? Did you have to implement the... like, help solve version issues or other issues on B300s? ‘Cause I’ve heard that-

00:18:13 Lucas Atkins: W-

00:18:14 Nathan Lambert: ... the rollout has been rough.

00:18:16 Lucas Atkins: We had to add-... a, a bit. There, there were a couple days where the, the data center had to take it offline to implement some bug fixes. It was, it was definitely, like, a very cool experience being on the bleeding edge, but, um, also, like, a little frightening ‘cause you just know, like, “Oh, we’re not getting the most out of these that we possibly could.” So, um, a little bit of both.

00:18:40 Nathan Lambert: Uh, was your final training run stable, or did you have to do interventions through it?

00:18:46 Lucas Atkins: Uh, it was very stable, actually. Uh, it took-- the beginning of it was not. The, the, the first ten days were absolute, um... It, it would start very well and, and looked, you know, uh, the dynamics and the logs, and the graphs looked very similar to Mini and Nano, and then after, uh, around a trillion tokens, it- the- we- you know, you’d get collapsing, experts would start to go crazy. Uh, part of this is just, again, we are very sparse compared to what you, you, you have. So, um, you know, four hundred billion total, um, thirteen billion active, two hundred and fifty six experts. Like, it was, it was-

00:19:26 Nathan Lambert: Did you do a, uh, expert routing loss or some sort of balancing loss?

00:19:30 Lucas Atkins: Yeah. Yeah, yeah. Yeah.

00:19:32 Varun Singh: We did, um, we used DeepSeek’s, uh... We, we modified DeepSeek’s Auxiliary-loss-free, um, uh, loss balancing with our own, like, uh, with some tweaks, and then we also added a sequence loss like they, uh, did as well.

00:19:47 Nathan Lambert: Uh, was there Auxiliary-loss-free one from DeepSeek V3, or was that a later model?

00:19:51 Varun Singh: That was V3.

00:19:52 Lucas Atkins: It was V3.

00:19:52 Varun Singh: They did a separate paper on it as well. Yeah.

00:19:55 Nathan Lambert: Yeah. Yeah, that makes sense. I think a lot of people have derived from there. Um, have you- ... had issues on post-training as well? So I have a theory that the new algorithms we’re getting from the Chinese labs, like GSPO and SysPO, are primarily for problems that you solve when you have big MoEs and you have expert problems when trying to do the RL. And that’s the whole reason that, like, I think our very serious AI two RL setup, like, we’re doing it on dense models, and we’re just like, “It’s fine. We don’t have this big clipping problem, and as much like we don’t have as much of a need to get the batch size as big to ac- activate all the experts.” So you’re saying you have so many experts and so much sparsity, that potentially sounds like you’re making RL harder.

00:20:36 Lucas Atkins: Um, yes. I will also... I will say that from just, like, a purely post-training side, we added as much as we po- we used- we... So our code base started from TorchTitan. We’ve had to make a ton of modifications to it to get it where we need it to be, but that was an excellent base. And from one of the bigger learnings from Mini and Nano was treating, uh, at least the SFT side of it, as a s- as a separate phase. Um, ‘cause with, with Mini and Nano, we finished the pre-training, we did context extension, then we took those and then ran those on, like, the sixty-four H100s we usually would do post-training on. Um, that presented a lot of challenges, uh, with the MoEs. They, they really... And that’s kind of been a thing in the open space, is post-training MoEs, like, really, um, can be frustrating, even for SFT. So for Large, we added, uh, like, fine-tuning directly to TorchTitan, um, and did it all on the same cluster. So, um, from a performance standpoint, like, SFT was very, um... actually ended up being totally different.

00:21:42 Nathan Lambert: What is the actual difference between the q- the, the implementations then? Is it just kinda like you end up with different batch sizes and parallelism and stuff? Like why-

00:21:50 Lucas Atkins: Uh, I mean, we ended up, we... Yeah, we ended up needing to get it to do really, like, to get context parallelism really well, really good, ‘cause we’re obviously going at a higher sequence length, and then, um, just adding the proper loss masking. Um, it, it, it, it ended up being a relatively easy implementation, especially ‘cause we did all the pre-processing, uh, outside of TorchTitan.

00:22:13 Nathan Lambert: Interesting.

00:22:14 Lucas Atkins: Uh, and then on the RL side, yes, I would say it’s not, um, it didn’t present itself as, as, as significantly harder than, than, um, Mini and Nano. However, that many GPUs does, so we didn’t end up using, uh, two thousand of the B300s for that. That ended up being, uh, a thousand. So two, we just split the nodes in half.

00:22:39 Nathan Lambert: Yeah. That makes sense.

00:22:40 Varun Singh: On the dense model side of things, uh, you mentioned that you didn’t need to use all the tricks and stuff. I, I think it is, uh... I think the, the, it- MoEs are just, in general, harder to RL, but I think it’s also, like, uh, b- because of, like, the KL mismatch between trainer and inference engine, right? Um, where you have, like, uh, sometimes the inference engine can pick different experts compared to, like, the trainer, uh, when you, like, do a forward pass on the same tokens. So I think there is definitely some, like, inherent instability with, with RL on MoEs.

00:23:13 Nathan Lambert: Yeah, that makes sense. Are, are... Okay, um, another question of, like, how much do you want to say? How do you feel about the state of public post-training recipes? Like, do you... Like, I, I feel like there’s so little out there, and there’s an opportunity to be seen as technical leaders by sharing just, like, more of what you’re doing. ‘Cause I feel like we’ve seen for years how complicated things can be, but also at, kind of at the same time... Like, we see this from the likes of Llama, has these really complicated recipes. But at the same time, I feel like just executing on a simpler recipe can get pretty close. But it’s just, like, very uns- I feel, uh, currently unsatisfied with how much I know about what are the actual core trade-offs of doing post-training well. And I think you could do a lot with SFT, but there’s definitely, in this RL regime, more trepidation of kind of narrowing your model to either downstream use or, like, being able to do this multi-week RL run where you get the most performance.

00:24:06 Lucas Atkins: Yeah, I mean, I, I, from-- since RL has become such a pivotal part of the process beyond what, you know, DPO and, and, uh, and kind of your, your typical RLHF was in the past, like, we used to get quite, uh-... sophisticated with, with how we would do SFT and, and even our, our RL. We, we obviously, we make MergeKit, so we, we utilized merging, and we used to do a lot of distillation, um, to eke out as much performance as we could. Now that RL is such a massive part of the entire post-training stack, I, I have almost reverted us to just really solid but simple SFT. Um, like in, in large, I mean, we’ve-- our post-training data set for, uh, Trinity Large is, uh, two hundred and thirty billion tokens. Like, like, it just like a really, really, really large-

00:25:09 Nathan Lambert: That’s ten X what we did. At least in SFT.

00:25:10 Lucas Atkins: And even that-- and even, even your tenant, like that was bef- before this kind of w- going at this scale and even kinda thinking and, and reasoning models. Like our largest SFT before that was five billion to-- we’d do, like, three epochs, but it was like five billion, you know, tokens, so- Um-

00:25:28 Nathan Lambert: Our non-reasoning model is, like, te- another ten X. So, like, our most latest instruct model is, like, two billion.

00:25:34 Lucas Atkins: Yeah, which is, uh, already a lot, you know. So, um, I, I’ve definitely... We-- you know, simplicity’s key because it also makes debugging anything easier, and then, um, devoting a lot of that sophistication to the RL. Our RL part is, like, really important. I do think that, I mean, the next, uh, phase of reinforcement learning for models of this scale is, is just scale. Is, is... Okay, we went from, you know, twenty billion SFT to two hundred and thirty, now we’re going from, you know, ten environments to a hundred. I think that that really is where you’re gonna get the biggest benefit. I also think that’s why, you know, MiniMax and, and, and other players like GLM are so performant and just, like, have that extra bit of, of usefulness that goes beyond just what you see in the benchmarks, is they’ve, they’ve really embraced, like, long-form, uh, RL. And, and so, um, yeah, I mean, to be quite frank, our, our RL pipeline’s rather... immature might be the wrong word. Like, it’s, it’s, uh, there’s definitely a lot more work we could do and a lot more work we need to do, but, um-

00:26:43 Nathan Lambert: Have you started the tool use side of RL?

00:26:46 Lucas Atkins: That-

00:26:46 Nathan Lambert: Or are you mostly... Well, um, beyond like, if you’re training on code, just verifying the code answer, I don’t count yet as tool use. I would say, like, search and code integrated reasoning is what I think is gonna be like minimum table stakes, but do it- to do it well is really hard. Like, we have to, like- ... like, you, you really, like, uh... That’s what I want to do. I want all of our models to have that this year. Search is prob- you have to have, like, a partner to do search or just, like, illegally scrape Google if you’re gonna- ... you’re gonna serve this model onto a customer, and it’s gonna- ... what? Go, go to Google, like, what?

00:27:16 Lucas Atkins: Yeah. Yeah, no, I mean, I, I... Beyond, like, like, really kind of like long-form, like deep research or, um, you know, even like GPT-OSS style or, or G- GPT 5 style, where, you know, it’s doing a hundred tool calls before it gives you a response. Not there yet, um, but that is kind of... Once we get past the, the final kind of RL of Trinity Large, and, and we kinda look at where we go next, like, that is the next major hurdle, um, for sure, and it’s intimidating.

00:27:56 Nathan Lambert: How big is your, your team of- of... Like, how many people are spending the majority of their time on the model? And then I think we c- start to wrap up technical talk and zoom out a bit to ecosystem and company strategy.

00:28:09 Lucas Atkins: Uh, there’s thirteen at Arcee- ... that are just, like, every, every single day is working on it. Yeah.

00:28:16 Nathan Lambert: And I guess that’s a good number because these people are talking about data, but there’s also, like, the whole data thing that’s coming somewhere else. But also somebody else that wanted to pre-train a model, like they could just download the best fully open data set. And I don’t think it’s gonna be quite as good, particularly in the fact that, um, like, if you look at OLMo’s models, we don’t have a lot of tokens, so we need to, like, acquire- ... more tokens in the open still. But to, like, get a number of thirteen, where some are spending a bit of time on data, but there’s the whole data abstraction, is actually kind of nice for somebody that’s like... To do a serious modeling effort, you need to have this many people, I think.

00:28:50 Lucas Atkins: It, it was-

00:28:51 Nathan Lambert: It’s reasonable to me.

00:28:52 Lucas Atkins: It was, it was a good number. I mean, I would say that, um, it, it was helpful to be able to, you know... This was like, how do we alleviate as many concerns as possible? Or how do we check off as many boxes, right? And it’s like, if we’re trying to do this in the shortest possible amount of time, like, we need to focus on what we’re good at, which is we- pretty good at post-training, and how do we get to the point where we’re able to do that? Well, we have to have a pretty strong base model. How do we get a strong base model? We’ll-- we have to, you know, figure out how to do it, perform, you know, efficiently across many, many GPUs, and then data’s, you know, extremely important, so getting a partner that could, you know, help us with that, and we could offload some of that. It, it- there ended up being, obviously, as you, you know, alluded to earlier, like, a lot of, uh, working with Datology and, and, and others to make sure that the data accomplished what we needed it to. Um, I think that that is gonna be an interesting... You know, as we, as we- now that we have Large and we’re looking at, you know, kind of going further, it’s like, okay, you know, the, the pre-training data really has to be in service of what you wanna do in the post-training, uh, work.

00:30:10 Nathan Lambert: How did you identify this?

00:30:11 Lucas Atkins: Like, like-

00:30:11 Nathan Lambert: Like, like- ... did, did you identify this through Mini and Nano, or, like, how’d you come to think that this was so important?

00:30:19 Lucas Atkins: Data in general or, or just-

00:30:20 Nathan Lambert: Or like this in form of post-training

00:30:21 Lucas Atkins: ... of optimizing it for the post-training? Um, I- really ob- observing other, other players, I think. I mean, it’s, it’s... You know, the, the true base model has kinda stopped really being a thing.... around Qwen2, but definitely around Qwen 2.5, um, where you started to see how much post-training data was making its way into the, the, the base models themselves. Um, and then you start to see the models that have done that, how malleable they are with RL, Qwen 2.5, Qwen3 being a good example. And you start to see like, oh, yeah, like they are, uh, doing as much in the last probably thirty percent of training to make it so that when they go to do RL or post-training, they’re gonna have a really good time. Um, you know, they’re just complete-- they’re way easier, way more malleable, way more performant than what you had in Llama 2 or Mistral 7B. So, um, I knew that i-in-intuitively, kind of going into this, but it wasn’t until after Mini and Nano, yeah, where, where we kind of... Well, definitely 4.5B, where we were like, “Yeah, we definitely need to juice our mid-training quite a bit.”

00:31:31 Nathan Lambert: Yeah, I agree. Okay, this was fun. We could- we’ll probably revisit themes from this. I think that, um, I can definitely go over time and keep chatting because I’m enjoying this. And for context, Mark and I had coffee at some point when I was at some conference in SF, and I was like: Damn straight, this is a fun bet that you’re making. So I’m trying to recapture as much of this as you can. Um, for context, it’s like in July, which is similar to when you decided to start this model, which is when, like, Qwen Coder came out, Kimi came out, um- ... GLM 4.5 came out, and I was just, like, looking- and Llama had kind of been, like, become a meme of going away. And that’s why I launched the Adam Project, where I was like: Come on, we need to have some people doing this. And I think that it’s, like, hard in the US because I think there’s so much money to be made on AI. Like, the company- the big tech companies are like: “We see it, and we’re gonna take it, so I don’t need to bother with, like, caring about open models ‘cause we don’t need it.” But from, like, an ecosystem co- perspective and a long-term tech perspective, I don’t think that works very well for the country. So it’s kind of this weird middle ground of like, how do you convince people to actually build open models? I was on... Like, I have calls with people in government asking me, like, what would I actually do? So it’s, like, very hard to think about this. And I have this- and then it’s just, like, to hear that you guys are just making this bet on this is very fun to me, but it’s also, like, based on actual learning from trying to do this. So you’ve been trying to train open models. I think Mark and I have both been at Hugging Face in our past, and you’re, you were trying to sell people on using open models, and there is a market for this, but it wasn’t enough to not have the base models. So I think, like, talking about your experience in selling on-prem open models and why you needed to train your own end-to-end, and why you needed to train bigger, is great because I hope there are more stories like this, and it kind of fills a void and inspires people to work in it. So how- however you want to take this prompt.

00:33:24 Mark McQuade: Yeah, I can jump in. Um, I mean, yeah, I mean, wh- when I started Arcee in 2023, right, uh, it was... All we did was post-training. Uh, and we worked with, uh, a lot of large organizations and did model customization, you know, for their use case on their data. Um, and we were using Llama-based models, Mistral-based models, and then, you know, some Qwen. I don’t even know if we actually did much Qwen, right, Lukas, at that time, but-

00:33:54 Lucas Atkins: No, we did. Yeah, we, we- Later on, but and then-

00:33:56 Mark McQuade: Later on, right? Uh-

00:33:57 Lucas Atkins: We did, and then we ended up not, because after a lot of Chinese models started to come out, then the companies didn’t wanna use Chinese models, so then we kind of went... Yeah, it was kind of just tricky.

00:34:08 Mark McQuade: Yeah, and people don’t realize that that’s real.

00:34:10 Nathan Lambert: People don’t realize that that actually happened.

00:34:13 Mark McQuade: Yeah, no, that’s, that’s a real thing. That’s why we, we started going down to pre-training was because, well, you know, Meta did their thing and kind of got out of it, right? So there was the, the main US player got out of it, and, and we were working with a lot of US-based enterprises that were not comfortable using Chinese-based architectures. And if you wanted to use the best open models of the day, it started to really trend towards, you know, the Chinese labs. Um, and to the point where we are now, where it’s like, you know, ninety-plus percent of the top mo- open models are coming out of China, um-

00:34:47 Nathan Lambert: Yeah, like, Cursor’s building on it and stuff. Like, people are building on these things.

00:34:52 Mark McQuade: Yeah. So, um, we said, “Okay, let’s...” Instead of we were so reliant on the Metas of the world, the Mistrals of the world, and Mistral largely stopped open sourcing, uh, you know, fully. So we said: You know what? We’ll just go down the stack, and we feel we’re capable enough to, to, to train our own models from scratch, and then we control the, you know, the stack. We can, you know, we, we control the core of, of... as opposed to relying on others to release great models. And, um, and then during this time, you know, it just happened to be that, um, you know, there wasn’t a tremendous amount of US companies doing it. So, um, from our perspective, it was kind of a, a win-win, in that we were able to own more of the stack by going down to pre-training and creating our own models, as well as we were entering into a, like, a space that there wasn’t a tremendous amount of competition, to be honest. Um, and, you know, I-- Lukas and I had said this yesterday, I, you know, I think as a startup, every startup doesn’t want to directly compete with, you know, X or OpenAI, or Anthropic, or Google because they have more money than God, and they can do whatever they want. Um, but when you’re doing open weights, you don’t-- it’s, it’s a different kind of compe- they, they don’t sit in there, right? You’re kind of going into your own path, where there isn’t a tremendous amount of players, and you can kind of find your, your way and, and build your niche and, and kind of go from there and, and become something big. So, um, it kind of happened to all coincide for us back in, in July, and, and we went all in.

00:36:23 Nathan Lambert: Yeah, yeah, like, uh, the, the all-in thing is real because this is expensive. I think that- ... I could dig up in my research the cost of daily, um, twenty-four T8 B300. So I think I’ve seen this type of cost at AI too, where we have long rentals, and we’re like: I know exactly how much this costs, and it’s like, it’s not cheap. Are you... A, a way to transition this is like-... do you see the demand? Like, you were selling open models, like, does this kind of be continuous, where people are like: “You helped us deploy this model, but it’s not good enough.” Like, is, is that something that’s happening, and you’re like: “Well, we have this, and we can help you do it coming in this time?” Or is it like you need to build it... It’s like, is it a we will build it, and they will come type of situation? Like, how much- ... continuity is there in this?

00:37:17 Mark McQuade: Yeah, I think it’s largely-

00:37:19 Nathan Lambert: I-

00:37:19 Mark McQuade: I, uh, from my perspective, I think it’s largely if you build it, they will come. Because we stopped, you know, focusing on that whole revenue generation side of the house when we started to go all in on being this, you know, frontier lab in the open source side. So, um, there’s a couple pieces to that, that, that I think we should all be very proud of inside of Arcee, is that we not only went all in by committing a significant amount of capital. Like, we, we committed, you know, sixty-five, seventy percent of our capital to these models, which is a large amount for a startup. I mean, we didn’t... So that’s not like a dip your toe in, that’s like, we’re all the way in.

00:37:55 Nathan Lambert: Yep.

00:37:55 Mark McQuade: Um, but we did that at the same time as abandoning essentially the whole revenue angle to go all in on it, because we couldn’t focus on both. So we said, “We know how to make revenue on open models. We’ve been doing it for two years. Now, let’s take a step back, because it wasn’t, uh, in a repeatable or sustainable way that we h- the way we had that business set up. Let’s take a step back, let’s build these models from scratch, let’s come up with the, the Trinity family, then let’s go back to generating the revenue side of the house and the monetization piece,” which I think we are in a good position to capitalize on even more now, but we, we took a... We, we, we kind of walked away from it to do what we’re doing here.

00:38:36 Nathan Lambert: Yeah, I love this.

00:38:36 Lucas Atkins: Yeah, I mean, when you have... When there’s only, like, thirteen, you know, uh, researchers who would... Well, we’re, we’re doing obviously our own products and own models, but when you’re working with customers, like, inevitably, those are the same people that need to help train those models for customers, and we got to a point where we were really beginning to, like, do mini and nano. We were getting down to, like, the start date of the cluster, where, um, having myself or Mark, or even, you know, Varun and others, like, pulled into customer or, or, or, uh, conversations or contracts, like, it was not-- we would not be where we are if we had continued, you have know, working with, you know, ten customers at once. So-

00:39:19 Nathan Lambert: But-

00:39:19 Lucas Atkins: ... we, we scaled that down pretty drastically. I do think that when... You know, Mark and I put a lot of thought into, “Okay, well, we’re gonna spend all this money to train these models, like, you know, w- how do we not...” I think, uh, one of the things that makes the idea of, of going all in on training open weight models hard, is that you’ve seen other people try it. And, and like M-

00:39:42 Nathan Lambert: Um, like, like do you think Meta or do you think Meta or Mistral went all in?

00:39:46 Lucas Atkins: I, I think, well-

00:39:48 Nathan Lambert: Meta obviously did.

00:39:48 Lucas Atkins: I think they, they both... Yeah. I think, I think that when I say all in, I mean more like Mistral was, was one of the core ones I’m thinking of, where- ... they were a venture-backed company that, like, had a, a, a fiduciary responsibility to bring in money, but were also trying to release open weight models, uh, for, you know, the West, and for their communities, and for the world. And, um, they tried doing closed versions, and then monetizing off of that. They, they also kind of have more recently, luckily, for all of us, gotten back to their kind of Apache 2.0 roots, and-

00:40:30 Nathan Lambert: Oh, my God.

00:40:30 Lucas Atkins: And-

00:40:30 Nathan Lambert: Have you seen the download numbers on Mistral 3 Large?

00:40:33 Lucas Atkins: I haven’t. No, what is it?

00:40:35 Nathan Lambert: Oh, s- no bueno, sir.

00:40:38 Lucas Atkins: Hey.

00:40:39 Nathan Lambert: Carrying on. Sorry.

00:40:41 Lucas Atkins: But, I mean, yeah, you know-

00:40:42 Nathan Lambert: Um, Mist- the, the Large Instruct model has downloads in the last month. I honestly don’t know what’s going on. Maybe there’s some, like, quantized version out there. I, I was confused.

00:40:50 Lucas Atkins: Maybe. Well, I mean, yeah. But I think that we-

00:40:52 Nathan Lambert: It’s, it’s hard to get adoption. The competition is insane.

00:40:55 Lucas Atkins: Hmm. Well, that’s, that’s- ... yeah, I mean, and that could be a whole conversation also, is, like, how do you actually get people to use it?

00:41:00 Nathan Lambert: I was gonna ask you, like, how do you get people... How do you get people to- - really sell into this? You said you’re good at it.

00:41:06 Lucas Atkins: Yeah, I think that the-

00:41:08 Nathan Lambert: Continue your point, we can come back to it.

00:41:11 Lucas Atkins: No, no, but they... I think they all kind of tie into it, is, is... We knew that the, the market was there for, for custom models. It was two years ago, frankly, and it’s even more so now, because RL has drastically, uh, increased the areas that you can hill climb and become really powerful with a tiny model. Um, and but, but also, people are beginning to see how powerful, you know, uh, te- uh, cust- or, or training in a, a, a product is. Like, you see Claude Code, you see Codex, you see, um... I think Deep Research was kind of one of the first ones that really kind of opened my eyes to what was possible, when you kind of are kind of training in the same environment that you’re serving your users. So we knew that, that people wanted it. We’d, we’d had good success with, with customers in the past using other people’s open models. So, um, it was less of a question of, like, could we monetize it, or will we? And it was just a matter of, um, could we get a model, you know, that pe- that, that we would feel that, you know, given a, a wide suite of basically being able to pick any model in the world, would, would our researchers and, and would our teams re- reach towards our own? And, uh, luckily, I think we’re there. Um, on, on the-

00:42:31 Nathan Lambert: Uh

00:42:31 Lucas Atkins: ... on the topic of, like, how do you get people to use it? How do you get adoption? You know, I’ve never wanted Trinity, uh, or our biggest advertising thing to be, like, US. You know-

00:42:45 Nathan Lambert: Yeah, I know

00:42:45 Lucas Atkins: ... like, if, if your entire-

00:42:47 Nathan Lambert: I know, man, it hurts me.

00:42:48 Lucas Atkins: Yeah, if your-

00:42:48 Nathan Lambert: I spent months reckoning with this.

00:42:50 Lucas Atkins: Yeah. If, if your entire, uh, you know, value prop is that you’re an American company-... great, but ultimately people are gonna use the best. Um, and so I think that we’re gonna be able to serve and, and the people like that need a US-based model because their compliance or legal teams won’t let them use something out of China, it’s gonna be a fantastic option. But I think, you know, kind of the next phase of what we’re doing as a company is, all right, now we’ve, we’ve proved to ourselves and maybe the, the wider industry that like we deserve to be in the conversation, and we can train models of this scale. Um, then it’s like, okay, how do we train the best one? Uh, ‘cause really, I mean, people’s loyalties are very fickle, and, and, yeah, you, you go to what’s the best. I guess it’s like, how much do you think

00:43:41 Nathan Lambert: you’ve learned about being able to tune a model narrowly by going and building the whole stack? Um, something we talk about is like ability- ... to specialize models, and I kind of, of opinion that you just make a better general model right now ‘cause the pace of progress is so high. And but the question is like, can we tune a OLMO that’s very good at science or something? And I- ... w-would guess that training the entire model, you’re going to be able to actually do a better job at what you were doing, but I don’t know how to articulate why or what that looks like.

00:44:18 Lucas Atkins: Um, I mean, the, the, the simplest answer to that being yes is just that... or the simplest reason why that’s the answer to the question is yes, is because we know what went into the model. Like, we know what it actually saw at the later stages of training during the decay. Um, and so that all- that helps influence, A, what are we tr- what kind of data and what topics and, and what format are we giving these models, uh, in post-training? But it also allows you to know like, okay, where, where do I absolutely wanna crank, you know, how, how many- how much of this, say, 230 billion dataset, do we want it to be math or, or, or, or coding? And a lot of that’s influenced by what you’re able to put in-

00:45:06 Nathan Lambert: How, how much of your post-training-

00:45:07 Lucas Atkins: ... post-training

00:45:07 Nathan Lambert: -do you expect to redo? Like, uh, how much can you say about when you’re serving something on-prem? Um, you- you’re not gonna redo the pre-training. You might, for a very big customer, redo mid-training or do continued pre-training- ... in which, in that case, you do need the pre-training data to keep, keep it being stable. Which is a use case where like I’m- I would love to see a paper that’s like, “Because of OLMO being open, we continued to pre-train on biology, and we mixed half of their exact mid-training dataset in with our dataset, and it, and it worked,” yadi, yadi. Like, you could obviously- ... do that, but how much do you think is gonna be like the standard, you fine-tune the last instruct model, or do- are you gonna have to retouch the post-training for a customer? Because that, like, I, I really feel like-

00:45:48 Lucas Atkins: Um

00:45:48 Nathan Lambert: ... it’s just at the end.

00:45:50 Lucas Atkins: It, I think, I think-

00:45:50 Nathan Lambert: But it would be fun if you had to change it.

00:45:52 Lucas Atkins: For the most part, um, I think a lot of tasks will be fine just starting from our, our, our, po- uh, like the released, you know, official post-trained version. Um, now, that’s for maybe simpler tasks, is the wrong way to frame it, but if it’s like, “Oh, hey, we’re doing a deep search agent. We want it to do 30 calls and, before...” That would be a good use for just starting with the finished model that we released that’s already post-trained. Now, if we’re going into something along the lines of, um, a very low-resource programming language or, um, something that it didn’t see a lot of in, in, in pre-training, um, or it’s kind of like a, you know, we’re wanting to train this thing to be really good at humanities last exam, but tools. Um, once we get into the world where we’re having to, especially... Actually, I have a much better answer to this question as I was thinking through it, but most of that holds the same. I think that the, the, the world where we’re gonna be doing a lot of extra instruct and, and SFT and, and post-training is gonna be when we’re trying to distill capabilities from large, like into mini or nano. So say like, oh, you know, this large is, is, is really great at invoice processing, but it’s also 400b, and the, you know, the company doesn’t wanna be hosting that on-prem, you know-

00:47:24 Nathan Lambert: Ah

00:47:24 Lucas Atkins: ... let’s go out generate a new one.

00:47:25 Nathan Lambert: Do you have costs off the top of your head for, like, what the hosting costs are for each of the model? Like, do people... Are people all gonna host these models in the same way, or is there actually-

00:47:32 Lucas Atkins: Uh

00:47:32 Nathan Lambert: ... a wide variance? And if you have, like, the same three models- ... do almost all of your customers end up hosting the same way, or do you end up doing a lot of, like, how do you configure the model to fit in the right hosting for them? Like, is that part of-

00:47:44 Lucas Atkins: It depends

00:47:44 Nathan Lambert: ... the business model?

00:47:45 Lucas Atkins: It, it, it, it kind of... And we tried to move a, a, a little bit further away from that because you get into the risk of being like, like a consultancy, and it’s- that becomes tricky, where there’s not a very clear separation of concern. But, um, for the mo- it would change depending on, were they using AWS? Did they have a commit with Azure? Um, if not, okay, then we, we can go to, you know, someone like Prime Intellect or Parasail and, and get a, you know, maybe a, a cheaper rack of eight. Uh, it just really depended. Uh, there’s quite a bit, um, of, of people that were also serving them, just using, like, Llama CPP. So, like, on CPU-

00:48:25 Nathan Lambert: Uh, is the 400b designed to be, to fit onto one rack of eight 80 big gigabytes in FP8? Is that how you designed it? ‘Cause Llama- ... Llama four, whatever, Llama 405b was the same. It was like one rack in FP8 works pretty well.

00:48:41 Lucas Atkins: It’ll do- we... well, you’ll be able to get really good throughput, a little bit lower concurrency on a, a rack of eight H100s at FP8, and then for, like, our, you know, what we’re serving, we’re serving them on, uh, a series of H200s, but we’re not doing, like, multi-node inference. Uh, but that’s just to add more, you know, replicas and- ... other kinds of things.

00:49:03 Nathan Lambert: Hopefully, eventually. I think that the-... Do you have anything else to say about selling open models? I think that generally, like, how do you think about the market for AI? ‘Cause I see the market as being so big, but the- with specifically with open models, it’s so hard to measure. I think I’ve started talking to some of the Chinese labs at all- as well, and I like to ask them, like, this is very US-centric and like Fortune 500 or whatever, and it’s just like, who the heck uses these models? I think- I guess another question is, like, what license or do you know the licenses you’re gonna use for the biggest models? And I think they’re, like, you’re, you’re playing with fire ‘cause people can use it for free, obviously, but potentially- ... you’ll get to hear like, “Oh, shit, somebody actually used our model for this.” And I think any successful business, you’re gonna want... You, you, you know that this model is not gonna be very relevant in a year with the pace of progress. So like- ... how do you think about your license decisions?

00:49:55 Lucas Atkins: Uh, we- you know, with the 4.5B, we tried to do like a, like a, a reve- one of those revenue-gated licensing. So it’s like, oh, it’s completely free for you to use for commercial and whatnot, but if you or your company made over, I think it was like $1.7 million last year, then you need to come to us and get a license. And what we ultimately found was like, it, it didn’t... Maybe for some people who are just only trying to train the model, release it on Hugging Face, and then just call it a day, maybe that is a huge requirement. But when so much of our, our, our company is built around, you know, training custom versions of the models, and, and not even just ours, but in general, even before we did pre-training. Like, at the end of the day, i- as long as we were using it, a- and we knew that we were in full control of, of whether- if we really succeed, it’s because we trained the models, we did them well, and we executed on it well. If we fail, it’s because we, uh, didn’t execute, instead of, oh, some company just stopped releasing good open models. Um, so we eventually switched to just Apache 2.0, and Trinity Large is also gonna be Apache 2.0. Um, you know, I’m- I think it is-

00:51:23 Nathan Lambert: I think this is the right approach. I have a big investor-

00:51:25 Lucas Atkins: Yeah, I think it-

00:51:25 Nathan Lambert: Without, without naming other companies, it’s easy- like, raising a lot of money, whe- or being Meta and releasing open models, and do it- and you could release it with non-commercial, and you could get all these, like... You could talk to, I don’t know, fucking Adobe, whoever. Oh, Adobe’s too big. They’ll have good AI. Some... I don’t know, a bank. Bank of America. You could run Llama on Bank of America and make good money on this. But I just feel like the cultural home of open source AI, and I don’t think- it’s impossible to know who wins it, and I don’t think that you’re in the prime position, and I don’t think that it’s easy to win, but you’re doing a thing that aligns with it. It’s the person that just, like, commits to building the models and learning how the ecosystem works, and to rebuild the models based on the feedback th- that you get from people, and to just kind of commit to an evolving process. And if the whole thing works out, there will be a lot of value, and the person who understands it best should be able to learn how to extract said value. And I think that I’m personally, like, sometimes frustrated with Hugging Face, ‘cause I feel like they have sat on that s- a sort of position like this, and they- ... haven’t figured it out. Not that it is easy to figure it out, but I think that has to be the ideal of open source AI, of like, if it’s really gonna work, that’s, that’s what I hope it looks like. And it’s like, I, I don’皮 know, maybe you guys could do some of that. Like, I have a question of like, could you figure out how to make models that are more fine-tunable- ... after all this post-training? Because you need to sell it to a- you need- ... you, you know the customer’s not gonna want it off the shelf. And I don’t know how to train to post-training to make sure that you don’t, you don’t cook it. Maybe you just learn that you need to warm up the model in a l- in the right way, and you just learn the technique of training downstream. But when you talk to people doing research, the different base models have such different characteristics. I think one of them is character training. I did this paper, and the guy was like: “Qwen and OLMo love their character,” and I’m like, “I have no idea why.” And but it’s like Llama and Gemma, you can change them so much. And I’m like, “Dog, like, please figure out why this is the case.” And for one thing, it’s really cool, but also, like, in your case, that would unlock a lot of value to be like, we know exactly what the model’s gonna do, and we know exactly how to change it. So.

00:53:35 Lucas Atkins: Yeah-

00:53:36 Nathan Lambert: Uh

00:53:36 Lucas Atkins: ... it, it, that’s- no, you’re, you’re, you’re right on the money. I think that even, uh, going into the post-training at large, we, uh, one of our researchers came out with, like, a pretty cool, um, experiment and ablation run that they did on drastically reducing catastrophic forgetting. And I almo- I mean, this was, like, three days before we were gonna start doing SFT, and then we ultimately just... I, I ended up pausing on it because it was just throwing something in that wasn’t tested. But, um, yeah, I think-

00:54:08 Nathan Lambert: A good research lead. You did the right thing.

00:54:10 Lucas Atkins: Yeah, I think, I think one of the most important things long term, you know, as we look at kind of what our research priorities are for this year is, is there’s obviously just how to scale RL and, and make these- the end result of the model as good in as many situations as possible. Um, but I think the other half of that is, you know, how do we make the, the, the speed and efficiency and, and performance of customizing them as, as fast as possible, and as easy as possible.

00:54:42 Nathan Lambert: Yeah. Do you learn in making open models from your experience just kind of running these open software things in MergeKit and DistillKit? I know there was a whole license journey on one of those as well.

00:54:52 Lucas Atkins: Yeah, DistillKit.

00:54:52 Nathan Lambert: Do you feel like they’re kind of isolated?

00:54:54 Lucas Atkins: Or MergeKit. Um, yeah, I mean, I think so. I think that, that, um, you kind of have to play the tape out. With MergeKit-... it was by far our most popular piece of software we’d ever released, but it was so popular because it took something that isn’t fundamentally very complicated, but we ma- but it’s time-consuming, and standardization is great for things like that, and we made it, uh, you know, streamlined and easy to do and fast, and you could experiment and ablate really quickly for, you know. And, and so I, I think that when we switched that to, like, a, you know, a, a similar, uh, revenue-based licensing, like, it, it didn’t end up having the value prop that was important because are you gonna pay Arcee, you know, thousands of dollars, or are you just gonna have one of your researchers-

00:55:52 Nathan Lambert: You’re gonna have clone code in a week, right?

00:55:52 Lucas Atkins: recreate it in a week, right? Yeah, so it’s-

00:55:55 Nathan Lambert: In a day.

00:55:55 Lucas Atkins: It’s, it’s kind of... It, it’s remi- it’s remembering like, okay, what is- what problem is this solving, and is this even a prob... Like, is the solution to this monetizable? Um, and so MergeGit, we brought it back to the original license, but I think with even viewing the models in the same way, it’s like it’s... Open source is an unbelievable marketing tactic. Like, there’s no one would care about Arcee if we weren’t open sourcing stuff, ‘cause as soon as you do something closed source, if you’re not the best or the cheapest for your price point, I mean, your performance point, no one’s gonna use it. Because-

00:56:30 Nathan Lambert: Um, another question on this. Um, do you think that open models are kind of at a disadvantage when progress is so high? Because it’s potentially easier to swap APIs than open model configurations, especially if, like, model weights are changing sizes or something like this. Where it’s like, “Oh, I can just upgrade to the new Opus, and I do this.” Like, does that, like, uh, decentivize people from using it? Or do you think most of the people are like: “I can only use open models, therefore, I’m gonna use open models?”

00:56:56 Lucas Atkins: Uh, I think for the people who are using, like, s- either self-hosted or, you know, um, uh, bespoke, uh, you know, engines to, to run it, where they have complete... You know, in a VPC or they have complete control over, like, data in and out, egress, ingress. I don’t think that’s really gonna be so much of a problem because they’re obviously doing it for a reason. Um, like, they’re either for privacy or security or, or HIPAA or SOC 2. For whatever reason they’re doing it, um, I, I don’t think that that’ll be, um, so much of a blocker, but I definitely do think that, um, you know, by far, e- even, even with some of the, the larger open... You know, like inference players, like Together and Fireworks, that, that host a lot of open models. Like, being feature- being on feature parity with a lot of these, these larger labs’ APIs is gonna be extremely important, um, o- of being able to serve, you know, um, with features that they’re used to, like prompt caching, that kind of stuff.

00:58:03 Nathan Lambert: Yeah, are- like, I, I think I saw that you guys are setting up an API as well. Is that kind of what the vision there is, is being able to o- offer parity at least, or, like, make it easy for people to consider it?

00:58:13 Lucas Atkins: I think so. I, I- we’re- we very... Yeah, we are doing our own API. We are hosting it. Um, we haven’t- we, we push a lot of that through Open Router just because it’s such a great place to get, like, discovered. Um, as... If we see, like, tremendous growth there, that would obviously be where we’ll, we’ll invest very heavily. Um, whereas the right move might be to let other people host it, and we invest super hard on the infra for, like, make- taking advantage of the models, um, and, and customizing them. There’s, there’s, there’s a few avenues we have ahead of us then, and we have, you know, projects going kind of toward to poke at each one. Um, and we’re just kinda getting as much data as we can before we... I mean, we’re gonna have to go all in on another direction soon. Not, not like pivoting away from pre-training, but now that we’ve done that, now w- what’s the next big bet we’re gonna make, and how do we go fully into that? So we’re trying to figure out what that is.

00:59:12 Nathan Lambert: Yeah. My two last kind of, like, real questions are, like, one is... I guess I can start with, like, where do you see the open model ecosystem? Do you think- where would you see it changing substantially in the next six or twelve months? I, like... Or, or do you? Or you just kinda think we’re marching along for a while?

00:59:31 Lucas Atkins: No, I think we’ll, I think we’ll, we’ll be... I, I, I don’t think it’s an unrealistic prediction to make that by the end of 2026, like, the best model in the world is, is some degree of open. Uh, I think that’s very, very possible, especially with, like, what I’ve seen GLM and, and MiniMax do recently. Um, they have started to find that secret sauce that takes you out of just being good on benchmarks and, like, genuinely useful in people’s day-to-day workflows. And, um, I wouldn’t- like, if, if I, you know, came back, and I... Someone came from the future and told me that the best model in the world was, uh, an open-weight model, I wouldn’t be surprised. I actually think we’re on a, a, a super good trajectory, and, and, and fostering and, and promoting that kind of work and adoption here in the United States is gonna be extremely important.

01:00:24 Nathan Lambert: And where do you see the company going? ‘Cause like, like, I have my guess. Like, you kind of hopefully-

01:00:31 Mark McQuade: What’s, what’s your guess? I wanna hear your guess.

01:00:31 Nathan Lambert: Um, you can hopefully do a mix and kind of oscillate into trading when you get... Like, you need to start having the feedback of the real world. I think that’s obvious. Like, it’s o- like, it’s... Well, obviously, you need to make money to survive as a company, but then you need to start using that as the feedback to guide training. And then it’s like, you need to figure out how to balance and do some of them at each time, and you can plan your cluster at different times, and then you kind of... Hopefully, they become a, a loop across each other, and they kind of make it so obvious of why you each need them, ‘cause it, it seems somewhat natural.

01:01:03 Mark McQuade: Yeah, I mean, exactly. You know, you kinda hit, hit it right on the head. Um, you know, getting feedback and then kinda steering the ship from there, um, is, is probably-

01:01:15 Lucas Atkins: ... exactly what we’ll do, but we have a good idea already. I mean, first and foremost, you know, we talked about it earlier, w- we’ve spent a tremendous amount of money. So, uh, we need to go raise some money after we - after we get, you know... We need people to back the, the, the mission and the vision of US open source and, and, you know, so, um, because, uh, you know, we, i- i- Lucas had mentioned about, like, MergeKit and how we flopped the license and, you know. I mean, we’re a smaller-sized start-up. We have-- we’re-- we gotta think of kinda unique ways to try and generate revenue because we don’t have the money of the large labs. So, uh-

01:01:52 Nathan Lambert: Well, I think it’s a benefit to the employee. I think a lot of these labs have over-raised.

01:01:56 Lucas Atkins: Yeah, I like, uh- uh, I-

01:01:57 Nathan Lambert: OpenAI, Anthropic, and all of them are fine. Like, with the OpenAI, Anthropic, Cursor scale, like, let it rip. They should, they should really rip the raising. But all the other companies that are stuck at the, like, the one to two billion range without, like, obvious traction, like, the risk goes to the... I mean, you could-- a lot of them do secondary, so a lot of the founders get out. But it’s like, the risk is the employees get nothing.

01:02:21 Lucas Atkins: Yeah. Yeah.

01:02:22 Nathan Lambert: There is a lot of money, but that’s also why I like the approach, ‘cause it’s like, “Oh, you’re doing the actual start-up thing.”

01:02:28 Lucas Atkins: Yeah, yeah. Yeah, I mean, I think... W- what I was gonna add to what Mark... is just like, what- whatever we do from, uh, uh, uh, scaling and, and speeding things up and growing, um, my goal is to keep our research and engineering teams pretty small. I think, I think that one of the reasons we’ve been able to, to move as quickly as we have is it’s been, like, a small group of, like, highly intelligent, smart, and opinionated people sitting in a room, debating in good faith on decisions. And I think that that’s, uh, uh, under the constraints of, “Hey, we don’t have five hundred million dollars to go and, you know, to rip on, on, you know, X, Y, and Z.” So and I think that’s kind of where creativity comes from, and I think that fostering a culture like that over time is how you can kind of make it so that excellence is less of like a, um, an accident, and it’s actually, like, a by-product of the way that you work. So, so we’re gonna stay small, we’re gonna stay lean, but, um, I, I do think that, like, the, the major, um, kind of challenge for us over the next probably six months, beyond any other models we might have, kind of, uh, think or we’re thinking about, is, is getting up to, like, post-training parity with the likes of DeepSeek, and GLM, Qwen, and others.

01:03:47 Nathan Lambert: Yeah. I, I hear lots of horror stories about this, where it’s usually and-- it’s-- you end up having people that are going after different important abilities, but, uh, like, doing each of the abilities alone is pretty easy to hill climb, but then you just end up with such a mess. It’s like you’re- ... building a custom puzzle, and you’re building all these custom pieces, and they’re magnificent, and then you’d have to, like, pick up these pieces and assemble this unknown thing at the end. And it’s like-

01:04:12 Lucas Atkins: Like they didn’t have the same designer, right? Yeah.

01:04:15 Nathan Lambert: As AI2 is barely scratching the surface of this. Like, you talk to the people at the frontier labs, and it’s like, holy cow, like, post-training is really the Wild West. But a lot of it works. I think, like, we find-- like, even like model merging gives a ton of performance across the whole- ... training pipeline. It’s like- ... you merge at pre-- you merge after each pre-training stage, you merge in post-training. It’s like-

01:04:35 Lucas Atkins: Roon can tell you.

01:04:36 Nathan Lambert: But merging post-training becomes a lot more complicated because you- ... can have all these domains and things, uh.

01:04:41 Lucas Atkins: Well, in, in merging, you know, it, it actually, it used to be very YOLO, um, the way we used to do it, and, and Charles, who, who created MergeKit, I call him, like, chief alchemist, and, like, you’d kinda just send him ten promising checkpoints, and he’d come back a day later with, like, some insane, you know, model that was really good at all of them. And, and you can’t do that as much in post-training anymore because of, uh, of just the, the formatting and the way that RL is done. Like, you do have to be a little bit more surgical about it, but yeah, everyone can tell you, like, any time we start to see anything worrisome at all in training or, or, or even something going really good, you know, “Lucas, what do we do?” I’m like: Merge it. I’m like, just-

01:05:21 Nathan Lambert: Merge.

01:05:21 Lucas Atkins: ... I’m like: “Just take it, just merge it. Let’s see.” And more often than not, it fixes it, so...

01:05:27 Nathan Lambert: Um, do you merge during RL? Like, you could just, like, merge the last few checkpoints and resume or something?

01:05:32 Lucas Atkins: We’ve ex-- we’ve, we’ve dabbled in that, not, not for what we’ve done. You know, again, a, a lot of the, the mini, nano, and large story for Trinity is, like, getting to a level of... what was my level of complexity I was comfortable with us undertaking, and then, uh, not introducing anything more. So, um, not yet. But we, I mean, we, we, uh, regularly merged. We didn’t do it for LARP, but we used to merge a lot, um, during just, like, your standard, uh, um... When we’d do, like, RLHF, we used to do a bunch of merging. We’d do it, like, every five checkpoints. We would-

01:06:11 Nathan Lambert: Online RLHF or D-DPO?

01:06:13 Lucas Atkins: There’s DPO.

01:06:15 Nathan Lambert: Yeah. It’s so much easier to get started. One of my goals is to have somebody figure out how to do actual online RLHF, pure LM feedback, obviously, for scaling. But it’s just like- ... it’s, it’s unsavory to it’s just, like, doesn’t look like DPO-

01:06:28 Lucas Atkins: Yeah, I mean, if, if, you know, if GRPO and kind of op-- in, in the, the present day RL regime, like, if that hadn’t materialized when it did, I think that would’ve been a big topic in 2025. But I do think that, you know, GRPO and just the overall, um, DeepSeek and o1 style reasoning and thinking and RL kind of... Any, a- any person who is thinking of doing that for, like, performance reasons, realize that there was something that had fifty thousand papers released every day on how to do it. Um- ... that was kind of probably right where you’d get the same amount of performance.

01:07:07 Nathan Lambert: Um, do you force dog feeding? Do you make yourself-- do you guys use your own models to understand them? Like, do you, like, make that a thing?

01:07:14 Lucas Atkins: Uh, Mini was the first one we could actually start doing that with, um, a- at least for, uh, a more general day-to-day tasks. So a lot of our, like, internal Slack, we have stuff that, like, monitors Twitter and LinkedIn for feedback on Trinity and, and, and that kind of stuff. That all runs on Trinity Mini now. Um, and then, uh-... you know, we, we put a good amount of work into, into large being, um, you know, good in, in a bunch of your, like, OpenCode and, and Cline, uh, and, and Kilo Code. So, um-

01:07:45 Nathan Lambert: Uh, what does that, what does that work look like?

01:07:49 Lucas Atkins: Uh, working with those guys to get data. And then, um-

01:07:53 Nathan Lambert: That’s, I mean- Good for me to know.

01:07:55 Lucas Atkins: I mean-

01:07:55 Nathan Lambert: I should do that, I guess.

01:07:58 Lucas Atkins: Yeah. Yeah, working with, uh... Or, or I mean, it- the way it started was us, like, using open models and then, like, passing those through as the base URL, and then, like, getting the logs from that. Um, and then realizing that, like, that translated pretty well. Um, and then over time, obviously turning this-

01:08:16 Nathan Lambert: Um, can you expand on this? So I was gonna ask you-

01:08:19 Lucas Atkins: So-

01:08:19 Nathan Lambert: -if you’re, like, using these open models regularly, ‘cause I, I’m just, like, Claude Code psychosis, man. I’m like, “Can’t take that away from me.”

01:08:26 Lucas Atkins: Yeah, I, I use, I use four... I’ve used 4.7 a lot. I think 4.7 from GLM was one of the first ones that could replace a lot of my day-to-day. Uh, I’ll still reach for Claude Code or even 5.2 Pro if it’s, if it’s, like, something that’s, like, really... I- if I do not know how to measure what success looks like for something, I’ll usually use those. Um, but, uh, yeah, I mean, it, it- even using DeepSeek before, um, kind of their May update was hit or miss. But, um, yeah, w- the reason I decided to, like, start talking to these people and working on, like, how can we get data and, and start making our models good in these systems was I would use them. I had a, um, you know, something that would grab the logs, like, it, you know, inter- as a proxy, so it’d like grab the logs and then format them in the messages format. And then I saw that and went, “Yeah, that’s... You can make a pretty good filter for just, like, standard stuff that you don’t want, and kind of hit a scale.”

01:09:30 Nathan Lambert: Yeah, it makes sense. So, so you’re like, uh, open code will let you look at the data, and then you’re probably gonna get a sense for... Like, I don’t even actually know how the, on the back end, the code agents in open code format data, which I think is actually something I should just go look at, ‘cause then you can design around.

01:09:44 Lucas Atkins: Uh, they’re all different. Yeah. Yeah, but you just have to- you just- basically, it all starts from like, what do you want your format to be? And then how can you take what, what those look like to, you know, to... How do you force it into that? The hard thing, though, is, is with newer models like MiniMax and 4.7, the way they do interleaved thinking is, is like... You know, I’m a big believer in post-training. Like, if you’re gonna do interleaved thinking, like, every sample in your data set should be that. Um, it, you know, it should follow that same format and that same behavior. So, um, that gets tricky if you’re trying to, like, take a bunch of Nemo tr... Or, or, or, well, like, uh, DeepSeek data and Qwen data, and then, oh, we’re also trying to mix in MiniMax, and at that point, you’re- it, it gets really difficult ‘cause they all handle thinking slightly differently.

01:10:34 Nathan Lambert: Yeah, I can buy this. Um, okay, this was fun. Any last predictions or things you want people to know about the model? I will say that, um, when you debuted the Trinity models, you had a great blog post that was very to the point, that covered a lot of this. So I’ll definitely link to the, um, what is it? The Trinity manifesto. I enjoyed reading it. So I’ll link to that in the show notes, and, oh, hopefully you have a new one for me to read when you’re done with the model.

01:10:58 Lucas Atkins: Yeah, we’ll do- we will have a tech report. We’ll have a tech report for you, too. So we, we never, we never did a tech report for 4.5B Mini or Nano because we were so focused on just getting to large, but we also thought it’d be very interesting to write it under the, the... How do you go from 4.5B to a 400B MoE in six months, and, like, what did we learn-

01:11:19 Nathan Lambert: That’s right

01:11:19 Lucas Atkins: ... when you’re viewing it as a whole, so.

01:11:21 Nathan Lambert: That’s about the timeframe that, um, Ant Ling took, too, as well. Ant Ling, uh, the anchor, we talked about, they’re like... It took us about six months to do, um, Ring-1T and their 1T models, which, like, it sounds like a lot more, but I think that’s about the same. It, it depends on compute and configs and stuff to go from, like- ... basic modeling to big MoE, which is pretty interesting to see a lot of people speedrun this sort of thing.

01:11:46 Lucas Atkins: Yeah, it’s, it’s a really, uh... It is a logistical nightmare, but, like, I think everyone on the team has had a tremendous amount of fun over the last, uh, six months. So now the fun begins.

01:11:58 Nathan Lambert: Yeah. Congrats on the milestone. Congrats on the model existing. That has gotta be an almighty relief, and I’ll look forward- ... to see what you all are up to soon. I’ll stop by at some point next time I’m in the Bay.

01:12:10 Lucas Atkins: Yeah. Yeah, come by. Yeah, come by.

01:12:12 Nathan Lambert: Thanks for-

01:12:12 Lucas Atkins: Thanks for having us.

01:12:14 Nathan Lambert: Yeah. Thanks, guys.

Interconnects AI

Reading today's open-closed performance gap

Task evolution and LLM paradigms

Economic pressure to reinvent “the frontier”

How long can open models keep up?

My bets on open models, mid-2026

What I’ve been building: ATOM Report, post-training course, finishing my book, and ongoing research

1. The ATOM Report: Measuring the Open Language Model Ecosystem

2. RLHF Book is done & ready for pre-order!

3. A post-training course I’m making

4. Recent technical research

The inevitable need for an open model consortium

Claude Mythos and misguided open-weight fearmongering

Gemma 4 and what makes an open model succeed

Latest open artifacts (#20): New orgs! New types of models! With Nemotron Super, Sarvam, Cohere Transcribe, & others

Artifacts Log

Our Picks

Models

General Purpose

Multimodal

Special Purpose

RAG

Lossy self-improvement

GPT 5.4 is a big step for Codex

What comes next with open models

The balance of power in open vs. closed models

Open weights as part of an AI system

Still looking for open model business strategies

Open models that are specific, cheap, fast, and ubiquitous

Models vs. ecosystems.Consolidation vs. creativity.

Dean Ball on open models and government control

Chapters

Transcript

Olmo Hybrid and future LLM architectures

Introducing Olmo Hybrid and its pretraining efficiency

The journey to post-training Olmo Hybrid

1. Benchmark performance

2. Open-source tooling

Latest open artifacts (#19): Qwen 3.5, GLM 5, MiniMax 2.5 — Chinese labs' latest push of the frontier

Artifacts Log

Our Picks

Models

General Purpose

How much does distillation really matter for Chinese LLMs?

Open models in perpetual catch-up

1. The open model frontier is brutally competitive

2. Specialized, small, fast, and cheap open models are missing

3. Understanding open models is massively under-indexed on

4. Nations will turn to open models as the only way to get an initial foothold in sovereign AI (and sovereign AI is the real deal)

5. Futures where open-source wins the frontier are still possible, but seemingly less likely

6. China’s open model “ecosystem” makes it the most likely place for a discovery around who wins

7. Open models dictate science and diffusion — slower trends than the frontier of AI

Conclusion

Opus 4.6, Codex 5.3, and the post-benchmark era

Assessing models in 2026

Why Nvidia builds open models with Bryan Catanzaro

Nemotron Model Timeline

Nemotron’s Recent Datasets

Pretraining Data

Post-Training Data

Chapters

Transcript

Latest open artifacts (#18): Arcee's 400B MoE, LiquidAI's underrated 1B model, new Kimi, and anticipation of a busy month

Our Picks

Models

Thoughts on the job market in the age of LLMs

Arcee AI goes all-in on open models built in the U.S.

Guests

Links

Chapters

Transcript

Models vs. ecosystems.
Consolidation vs. creativity.