wow. for 7B, the thinking model is pretty good. the foundational models are amazing for general tasks but if you can fine tune a 7b beast like this to your domain/problem i'll always err on the smaller open models for repeatable results.
It seems that during the PyTorchCon presentation, it was mentioned that qk-norm might affect long context, yet it's still retained in Olmo 3. Was there any discussion about qk-norm?
I’m too tired to think right now, but my instinct is qk norm made it harder, or something like that. Could also be the ordering of norms and my presentation wasn’t clear enough (I need to study modeling more tbh, maybe a reader will answer)
My sincerest congratulations to you and the whole Olmo 3 team! I’m incredibly excited to dive in and kick the tires on these new models. And perhaps more eager to read your lessons learned and updated model training recipes. I’m certain all will provide great value to the global open source AI community.
This Fall has been a long-needed renaissance for American open source AI with the IBM Granite, Nvidia Nemotron, and now Olmo 3 model releases. And ICYMI, yesterday Deep Cogito released an absolute powerhouse V2.1 of their open source 671B MoE model. I’ve been running IMO Olympiad style math problems through and the results have been exceptional.
Congrats again on Olmo 3. The future is indeed bright for American open source!
honestly their explanations seem a bit obfuscatory if not outright deceptive. Also, given the political environment in DC, good luck to any firm that uses a PRC model, no matter how forked, to write code or perform other services they want to sell to companies in regulated industries or ones that have US government contracts. Cheap now perhaps, not so cheap when they all have to be ripped out, and I will bet decent $$s that at some point bodies like the House Select Committee on China are going to make noise about this...maybe should start a polymarket...
Yes, DC is a fork from DeepSeek. DC integrates its own training on top of the DeepSeek base using Iterated Distillation & Amplification and iterative policy improvement.
The idea is to internalize reasoning and iteratively shorten chains of thought. They put the high level details here:
Fair. A philosophical question (which I don’t know the answer to) is at what point does a fork evolve and mature to the point where the base ceases to matter in a meaningful way. As in DS was the bootstrap to an iterative process in which the final version retains the form of the original while having completely different weights, parameters, and reasoning chains.
A technical point that both of you will appreciate is that base models actually have a lot of “post training data” in them so much of the communist values etc talk is already in the model at that stage. Deepseek really really likes to talk about socialist values.
OK, thanks for the rabbit hole Nathan! Based on your comment I asked DC about criticisms of the CCP response to the 1989 pro-Democracy protests in China. Obviously just one data point, but the response was…fairly neutral? I wonder if pro-CCP propaganda eventually fades after successive rounds of SFT/RLHF.
“The Chinese government's response to the 1989 protests has been widely criticized by various international organizations, foreign governments, human rights groups, and scholars. Here are some of the main areas of criticism:
1. Use of Lethal Force: Many human rights organizations and foreign governments condemned the government's decision to use military force against largely peaceful protesters, including unarmed civilians. The scale of force used has been described as disproportionate to the situation.
2. Civilian Casualties: There is significant criticism regarding the number of civilian deaths and injuries. While exact figures remain disputed, estimates from various sources suggest hundreds to possibly thousands of civilians may have been killed. The government's official figures are considerably lower than most external estimates.
3. Censorship and Information Control: Critics point to the extensive censorship of information about the events, both during and after the protests. The Chinese government has maintained strict control over media coverage and public discussion of the incident for decades since.
4. Impact on Political Reform: Some scholars argue that the crackdown significantly slowed or reversed political reforms in China, leading to a more restrictive political environment. They suggest it marked a turning point away from the more liberal policies of the 1980s.
5. Treatment of Protesters: There has been criticism of the government's handling of protesters after the events, including reports of arrests, detentions, and alleged human rights violations during the subsequent crackdown.
6. International Relations Impact: The government's actions led to widespread international condemnation and temporary sanctions from several Western nations, though most were eventually lifted as economic relations normalized.
It's important to note that perspectives on these events vary significantly, and the Chinese government maintains that its actions were necessary to maintain social stability and national security. The official position emphasizes the importance of social order and the prevention of what it describes as "chaos."
The full historical understanding of these events continues to evolve as more information becomes available and as scholars gain greater access to historical materials.”
wow. for 7B, the thinking model is pretty good. the foundational models are amazing for general tasks but if you can fine tune a 7b beast like this to your domain/problem i'll always err on the smaller open models for repeatable results.
Turns out, there’s someone who makes one of the best open-source models in US.
Amazing write up and amazing release!
Awesome. Great work! Was SuperBPE too risky to try for the tokenizer? It seems like a great effiency win for both training and inference.
uhHHhHHh email Luca S or Alisa Liu
It seems that during the PyTorchCon presentation, it was mentioned that qk-norm might affect long context, yet it's still retained in Olmo 3. Was there any discussion about qk-norm?
better response -- go look at the LC section of the paper (I will too)
I did read that section, but I couldn't find any specific mention of qk-norm or its impact there. :(
Omg lol now I need to know
I’m too tired to think right now, but my instinct is qk norm made it harder, or something like that. Could also be the ordering of norms and my presentation wasn’t clear enough (I need to study modeling more tbh, maybe a reader will answer)
Nathan,
My sincerest congratulations to you and the whole Olmo 3 team! I’m incredibly excited to dive in and kick the tires on these new models. And perhaps more eager to read your lessons learned and updated model training recipes. I’m certain all will provide great value to the global open source AI community.
This Fall has been a long-needed renaissance for American open source AI with the IBM Granite, Nvidia Nemotron, and now Olmo 3 model releases. And ICYMI, yesterday Deep Cogito released an absolute powerhouse V2.1 of their open source 671B MoE model. I’ve been running IMO Olympiad style math problems through and the results have been exceptional.
Congrats again on Olmo 3. The future is indeed bright for American open source!
Isn't Deep Cogito built on deepseek?
It is. They asked me for feedback on their release before launching and I suggested they change their explanations. Alas.
Base models are a crucial part of the stack that cannot be ignored.
honestly their explanations seem a bit obfuscatory if not outright deceptive. Also, given the political environment in DC, good luck to any firm that uses a PRC model, no matter how forked, to write code or perform other services they want to sell to companies in regulated industries or ones that have US government contracts. Cheap now perhaps, not so cheap when they all have to be ripped out, and I will bet decent $$s that at some point bodies like the House Select Committee on China are going to make noise about this...maybe should start a polymarket...
Agreed. I'm legitimately excited about people using our base models for real things. Mostly to date Olmo has been just a research artifact.
I'm long overdue for a DC trip...
Yes, DC is a fork from DeepSeek. DC integrates its own training on top of the DeepSeek base using Iterated Distillation & Amplification and iterative policy improvement.
The idea is to internalize reasoning and iteratively shorten chains of thought. They put the high level details here:
https://huggingface.co/blog/deepcogito/cogito-v2-1
so just to be clear, very far from an American model
Fair. A philosophical question (which I don’t know the answer to) is at what point does a fork evolve and mature to the point where the base ceases to matter in a meaningful way. As in DS was the bootstrap to an iterative process in which the final version retains the form of the original while having completely different weights, parameters, and reasoning chains.
A technical point that both of you will appreciate is that base models actually have a lot of “post training data” in them so much of the communist values etc talk is already in the model at that stage. Deepseek really really likes to talk about socialist values.
OK, thanks for the rabbit hole Nathan! Based on your comment I asked DC about criticisms of the CCP response to the 1989 pro-Democracy protests in China. Obviously just one data point, but the response was…fairly neutral? I wonder if pro-CCP propaganda eventually fades after successive rounds of SFT/RLHF.
“The Chinese government's response to the 1989 protests has been widely criticized by various international organizations, foreign governments, human rights groups, and scholars. Here are some of the main areas of criticism:
1. Use of Lethal Force: Many human rights organizations and foreign governments condemned the government's decision to use military force against largely peaceful protesters, including unarmed civilians. The scale of force used has been described as disproportionate to the situation.
2. Civilian Casualties: There is significant criticism regarding the number of civilian deaths and injuries. While exact figures remain disputed, estimates from various sources suggest hundreds to possibly thousands of civilians may have been killed. The government's official figures are considerably lower than most external estimates.
3. Censorship and Information Control: Critics point to the extensive censorship of information about the events, both during and after the protests. The Chinese government has maintained strict control over media coverage and public discussion of the incident for decades since.
4. Impact on Political Reform: Some scholars argue that the crackdown significantly slowed or reversed political reforms in China, leading to a more restrictive political environment. They suggest it marked a turning point away from the more liberal policies of the 1980s.
5. Treatment of Protesters: There has been criticism of the government's handling of protesters after the events, including reports of arrests, detentions, and alleged human rights violations during the subsequent crackdown.
6. International Relations Impact: The government's actions led to widespread international condemnation and temporary sanctions from several Western nations, though most were eventually lifted as economic relations normalized.
It's important to note that perspectives on these events vary significantly, and the Chinese government maintains that its actions were necessary to maintain social stability and national security. The official position emphasizes the importance of social order and the prevention of what it describes as "chaos."
The full historical understanding of these events continues to evolve as more information becomes available and as scholars gain greater access to historical materials.”