wow. for 7B, the thinking model is pretty good. the foundational models are amazing for general tasks but if you can fine tune a 7b beast like this to your domain/problem i'll always err on the smaller open models for repeatable results.
This is a rather weak analysis, imo. They do not compare the same prompts when used on the two western models (which likely are Llama 3 70B and GPT-OSS 120B).
A lot of these reports more or less test who much data the model was trained on and how "brittle" it is when used with prompts that throw the model off-guard. DeepSeek famously used little data and (relative) compute, which shows. GPT-OSS is the same (if you want to have fun, use the model with a non-English language).
As an example for both: CAISI has analyzed DeepSeek (and compared against GPT-OSS). Their reported numbers for SWE-Bench Verified (a software engineering benchmark) differs *greatly* from the official numbers cause their agent harness (i.e., the software the model runs in) is different from the ones used by the model makers. GPT-OSS and all DeepSeek models plummet, while the closed models (trained on more data, more robust, just better models overall) have ~ the same performance in the CAISI harness compared to the officially reported scores.
My sincerest congratulations to you and the whole Olmo 3 team! I’m incredibly excited to dive in and kick the tires on these new models. And perhaps more eager to read your lessons learned and updated model training recipes. I’m certain all will provide great value to the global open source AI community.
This Fall has been a long-needed renaissance for American open source AI with the IBM Granite, Nvidia Nemotron, and now Olmo 3 model releases. And ICYMI, yesterday Deep Cogito released an absolute powerhouse V2.1 of their open source 671B MoE model. I’ve been running IMO Olympiad style math problems through and the results have been exceptional.
Congrats again on Olmo 3. The future is indeed bright for American open source!
honestly their explanations seem a bit obfuscatory if not outright deceptive. Also, given the political environment in DC, good luck to any firm that uses a PRC model, no matter how forked, to write code or perform other services they want to sell to companies in regulated industries or ones that have US government contracts. Cheap now perhaps, not so cheap when they all have to be ripped out, and I will bet decent $$s that at some point bodies like the House Select Committee on China are going to make noise about this...maybe should start a polymarket...
Yes, DC is a fork from DeepSeek. DC integrates its own training on top of the DeepSeek base using Iterated Distillation & Amplification and iterative policy improvement.
The idea is to internalize reasoning and iteratively shorten chains of thought. They put the high level details here:
Fair. A philosophical question (which I don’t know the answer to) is at what point does a fork evolve and mature to the point where the base ceases to matter in a meaningful way. As in DS was the bootstrap to an iterative process in which the final version retains the form of the original while having completely different weights, parameters, and reasoning chains.
A technical point that both of you will appreciate is that base models actually have a lot of “post training data” in them so much of the communist values etc talk is already in the model at that stage. Deepseek really really likes to talk about socialist values.
OK, thanks for the rabbit hole Nathan! Based on your comment I asked DC about criticisms of the CCP response to the 1989 pro-Democracy protests in China. Obviously just one data point, but the response was…fairly neutral? I wonder if pro-CCP propaganda eventually fades after successive rounds of SFT/RLHF.
“The Chinese government's response to the 1989 protests has been widely criticized by various international organizations, foreign governments, human rights groups, and scholars. Here are some of the main areas of criticism:
1. Use of Lethal Force: Many human rights organizations and foreign governments condemned the government's decision to use military force against largely peaceful protesters, including unarmed civilians. The scale of force used has been described as disproportionate to the situation.
2. Civilian Casualties: There is significant criticism regarding the number of civilian deaths and injuries. While exact figures remain disputed, estimates from various sources suggest hundreds to possibly thousands of civilians may have been killed. The government's official figures are considerably lower than most external estimates.
3. Censorship and Information Control: Critics point to the extensive censorship of information about the events, both during and after the protests. The Chinese government has maintained strict control over media coverage and public discussion of the incident for decades since.
4. Impact on Political Reform: Some scholars argue that the crackdown significantly slowed or reversed political reforms in China, leading to a more restrictive political environment. They suggest it marked a turning point away from the more liberal policies of the 1980s.
5. Treatment of Protesters: There has been criticism of the government's handling of protesters after the events, including reports of arrests, detentions, and alleged human rights violations during the subsequent crackdown.
6. International Relations Impact: The government's actions led to widespread international condemnation and temporary sanctions from several Western nations, though most were eventually lifted as economic relations normalized.
It's important to note that perspectives on these events vary significantly, and the Chinese government maintains that its actions were necessary to maintain social stability and national security. The official position emphasizes the importance of social order and the prevention of what it describes as "chaos."
The full historical understanding of these events continues to evolve as more information becomes available and as scholars gain greater access to historical materials.”
It seems that during the PyTorchCon presentation, it was mentioned that qk-norm might affect long context, yet it's still retained in Olmo 3. Was there any discussion about qk-norm?
I guess you mean Pre-LN and Post-LN? QK-Norm is usually strictly tied to the attention block, so I don't think it can really change positions relative to the residual.
I’m too tired to think right now, but my instinct is qk norm made it harder, or something like that. Could also be the ordering of norms and my presentation wasn’t clear enough (I need to study modeling more tbh, maybe a reader will answer)
Damm finally a fully open source model. I loved when labs released their models and initially called them open source rather than open weights.
Thank you for sharing the training pipeline as well
This is great news I am psyched for many reasons. Look forward to exploring, tinkering, building with the Olmo 3 models. 🦅
wow. for 7B, the thinking model is pretty good. the foundational models are amazing for general tasks but if you can fine tune a 7b beast like this to your domain/problem i'll always err on the smaller open models for repeatable results.
CrowdStrike Research: Security Flaws in DeepSeek-Generated Code Linked to Political Triggers https://www.crowdstrike.com/en-us/blog/crowdstrike-researchers-identify-hidden-vulnerabilities-ai-coded-software/
This is a rather weak analysis, imo. They do not compare the same prompts when used on the two western models (which likely are Llama 3 70B and GPT-OSS 120B).
A lot of these reports more or less test who much data the model was trained on and how "brittle" it is when used with prompts that throw the model off-guard. DeepSeek famously used little data and (relative) compute, which shows. GPT-OSS is the same (if you want to have fun, use the model with a non-English language).
As an example for both: CAISI has analyzed DeepSeek (and compared against GPT-OSS). Their reported numbers for SWE-Bench Verified (a software engineering benchmark) differs *greatly* from the official numbers cause their agent harness (i.e., the software the model runs in) is different from the ones used by the model makers. GPT-OSS and all DeepSeek models plummet, while the closed models (trained on more data, more robust, just better models overall) have ~ the same performance in the CAISI harness compared to the officially reported scores.
https://www.interconnects.ai/p/latest-open-models-15-its-qwens-world & https://www.nist.gov/news-events/news/2025/09/caisi-evaluation-deepseek-ai-models-finds-shortcomings-and-risks
Love to see this!
Turns out, there’s someone who makes one of the best open-source models in US.
Amazing write up and amazing release!
Awesome. Great work! Was SuperBPE too risky to try for the tokenizer? It seems like a great effiency win for both training and inference.
uhHHhHHh email Luca S or Alisa Liu
Nathan,
My sincerest congratulations to you and the whole Olmo 3 team! I’m incredibly excited to dive in and kick the tires on these new models. And perhaps more eager to read your lessons learned and updated model training recipes. I’m certain all will provide great value to the global open source AI community.
This Fall has been a long-needed renaissance for American open source AI with the IBM Granite, Nvidia Nemotron, and now Olmo 3 model releases. And ICYMI, yesterday Deep Cogito released an absolute powerhouse V2.1 of their open source 671B MoE model. I’ve been running IMO Olympiad style math problems through and the results have been exceptional.
Congrats again on Olmo 3. The future is indeed bright for American open source!
Isn't Deep Cogito built on deepseek?
It is. They asked me for feedback on their release before launching and I suggested they change their explanations. Alas.
Base models are a crucial part of the stack that cannot be ignored.
honestly their explanations seem a bit obfuscatory if not outright deceptive. Also, given the political environment in DC, good luck to any firm that uses a PRC model, no matter how forked, to write code or perform other services they want to sell to companies in regulated industries or ones that have US government contracts. Cheap now perhaps, not so cheap when they all have to be ripped out, and I will bet decent $$s that at some point bodies like the House Select Committee on China are going to make noise about this...maybe should start a polymarket...
Agreed. I'm legitimately excited about people using our base models for real things. Mostly to date Olmo has been just a research artifact.
I'm long overdue for a DC trip...
Yes, DC is a fork from DeepSeek. DC integrates its own training on top of the DeepSeek base using Iterated Distillation & Amplification and iterative policy improvement.
The idea is to internalize reasoning and iteratively shorten chains of thought. They put the high level details here:
https://huggingface.co/blog/deepcogito/cogito-v2-1
so just to be clear, very far from an American model
Fair. A philosophical question (which I don’t know the answer to) is at what point does a fork evolve and mature to the point where the base ceases to matter in a meaningful way. As in DS was the bootstrap to an iterative process in which the final version retains the form of the original while having completely different weights, parameters, and reasoning chains.
A technical point that both of you will appreciate is that base models actually have a lot of “post training data” in them so much of the communist values etc talk is already in the model at that stage. Deepseek really really likes to talk about socialist values.
OK, thanks for the rabbit hole Nathan! Based on your comment I asked DC about criticisms of the CCP response to the 1989 pro-Democracy protests in China. Obviously just one data point, but the response was…fairly neutral? I wonder if pro-CCP propaganda eventually fades after successive rounds of SFT/RLHF.
“The Chinese government's response to the 1989 protests has been widely criticized by various international organizations, foreign governments, human rights groups, and scholars. Here are some of the main areas of criticism:
1. Use of Lethal Force: Many human rights organizations and foreign governments condemned the government's decision to use military force against largely peaceful protesters, including unarmed civilians. The scale of force used has been described as disproportionate to the situation.
2. Civilian Casualties: There is significant criticism regarding the number of civilian deaths and injuries. While exact figures remain disputed, estimates from various sources suggest hundreds to possibly thousands of civilians may have been killed. The government's official figures are considerably lower than most external estimates.
3. Censorship and Information Control: Critics point to the extensive censorship of information about the events, both during and after the protests. The Chinese government has maintained strict control over media coverage and public discussion of the incident for decades since.
4. Impact on Political Reform: Some scholars argue that the crackdown significantly slowed or reversed political reforms in China, leading to a more restrictive political environment. They suggest it marked a turning point away from the more liberal policies of the 1980s.
5. Treatment of Protesters: There has been criticism of the government's handling of protesters after the events, including reports of arrests, detentions, and alleged human rights violations during the subsequent crackdown.
6. International Relations Impact: The government's actions led to widespread international condemnation and temporary sanctions from several Western nations, though most were eventually lifted as economic relations normalized.
It's important to note that perspectives on these events vary significantly, and the Chinese government maintains that its actions were necessary to maintain social stability and national security. The official position emphasizes the importance of social order and the prevention of what it describes as "chaos."
The full historical understanding of these events continues to evolve as more information becomes available and as scholars gain greater access to historical materials.”
deepseek doublespeek
Super excited to start setting it up. My one disappointment so far is that is not multimodal. Not sure if there are plans to support it in the future?
We've had them as separate teams, will be working more closely in the future.
For now, some good news on that front soon.
Looking forward Nathan:)
The instruction manual they should have included with AI.
https://open.substack.com/pub/xpraxisx/p/reality-navigators?r=699zlt&utm_medium=ios
It seems that during the PyTorchCon presentation, it was mentioned that qk-norm might affect long context, yet it's still retained in Olmo 3. Was there any discussion about qk-norm?
@JayNing, it's QK-Norm after the residual, which was more like the original transformer, that can cause problems.
Where some labs are QK norm pre residual, or somethin like this.
I guess you mean Pre-LN and Post-LN? QK-Norm is usually strictly tied to the attention block, so I don't think it can really change positions relative to the residual.
better response -- go look at the LC section of the paper (I will too)
I did read that section, but I couldn't find any specific mention of qk-norm or its impact there. :(
Omg lol now I need to know
I’m too tired to think right now, but my instinct is qk norm made it harder, or something like that. Could also be the ordering of norms and my presentation wasn’t clear enough (I need to study modeling more tbh, maybe a reader will answer)