The latest open artifacts (#9): RLHF book draft, where the open reasoning race is going, and unsung heroes of open LM work
Artifacts Log 9.
A few weeks out from Llama 4’s bumpy release and the open community seems more focused on rumors of Qwen 3 coming soon than racing to build on these models — mostly due to their size if not for the confused launch. Next week is LlamaCon, so we should expect more models soon.
This lull is a good time to ask — where are all these open reasoning models going? It’s no longer that exciting to have yet another model hillclimbing on AIME or another DeepSeek R1 distill for something. Some models are breaking into coding and agentic tasks, but it feels like systems to use these models will be more impactful than the individual datasets. This is only reinforced by how different and expectation-shifting the release of o3 with tools natively felt.
We’ll continue to cover these developments and include more project links than just HuggingFace links if that’s the way that open artifacts are being used. It aligns with our coverage of how open models are crossing important capability thresholds.
Reading this just reiterates how the biggest, few best open releases are still the most impactful. It’s hard to break through and become an industry standard a la R1 or Llama 3 or Qwen 2.5.
Also, the first draft of the RLHF Book I (Nathan) have been working on is done (web version, Arxiv version). It has a lot more than just RLHF, such as AI feedback, Constitutional AI, an overview of distillation, and related open questions. It’s designed to be enjoyable and useful rather than complete. Sometime later this year you can expect an email announcing that pre-orders are open for physical copies, so stay tuned.
In this issue we’ve been impressed by Nvidia’s recent contributions to the open ecosystem, seeing lots of RAG/embedding work that is essential to enterprise applications of open models, and are unsurprised by the quantity of Chinese companies releasing amazing models (it’s the norm now).
Our Picks
Llama-3_1-Nemotron-Ultra-253B-v1 by nvidia: A reasoning model building upon the pruned version of Llama 3.1. After pruning, they do multiple rounds of post-training: Starting with SFT (by using Llama, Qwen, QwQ and R1), followed by multiple rounds of RL, both for capabilities and alignment. Aside from the 253B variant, they also release a 49B [from Llama 70B] and a 8B model.
Kimi-VL-A3B-Instruct by moonshotai: A vision MoE from Moonshot AI / Kimi, released under the MIT license. The performance of the model looks really solid, being on par with Qwen2.5 VL-7B, while using half the amount of activated parameters (but twice the total params). They also release a thinking version, boosting the scores even further. In their technical report, they state that they will train bigger models on more data. Given that Moonshot AI is one of the superstars in the Chinese LLM space, this team and model series is one to keep an eye on. It also continues the trend we observed: The big Chinese labs and companies continue to release very capable models under permissive OSS licenses, usually MIT or Apache 2.0.
Nemotron-H-56B-Base-8K by nvidia: A hybrid transformer-mamba model trained on a whopping 20T tokens. The report goes into greater detail. The models perform on the same level as an attention-only variant, while being more than twice as fast. The long-context benchmarks also look competitive, making this architecture a serious contender to replace models which use sliding window attention or similar variants. The released models, however, support only 8K context.
GLM-Z1-Rumination-32B-0414 by THUDM: A model trained for (deep) research. It is trained by the team behind GLM and CogView, which has renamed itself to Z AI. The model can be accessed on their website to try them. This particular model is trained to search and click through websites with multiple function calls during its reasoning part.
mrcr by openai: OpenAI had an eventful week with the release of the GPT-4.1 series (see our coverage here) and o3 / o4-mini (which we covered as well). This, however, isn't everything they dropped. They also released an open source competitor to claude-code, named codex (not to be confused with the coding model of the same name) and two long-context benchmarks: MRCR, an open replication of the Google benchmark, and GraphWalks.
Links
Our friends at General Reasoning posted a blog post on what it will take to scale RL compute — discussing the fundamental research questions of today, such as:
What is the role of priors like base models, RL hyperparameters and cold start data?
What is the role of parallel versus sequential generation?
How do we scale RL compute in domains where solutions are harder to verify?
A giant LLM pricing tool that lets you look at model pricing across pretty much every provider.
Dan Shipper did a review with some creative usage examples of o3 (also linked in our o3 post).
This story on the Alan Turing Institute — How not to build an AI Institute by
— is a great counterpoint to all the success we here about in AI these days.- — one of our favorite thinkers in the AI policy space — started a blog. She has a few initial posts, on shrinking timelines, the changing meaning of alignment, and nonproliferation. As long as the posting schedule is consistent this is on a fast-track to the Interconnects Recommendations.
A good podcast from
on the cases that’ll define the AI copyright rules over the next few years. This is an important issue, but it’s normally ignored in between major court rulings.An interesting post on The Alignment Forum discussed a bunch of negative results for SAEs used in mechanistic interpretability. TLDR:
To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts
Negative result: SAEs underperformed linear probes.
SpeechMap.ai measures the refusal rates of different LLMs to a wide range of hotly debated topics, such as governments, religion or ethnicity.
Anthropic has some tutorials for Claude Code.
Reasoning
Models
openhands-lm-32b-v0.1 by all-hands: An RL-tuned version of Qwen2.5 Coder on agentic tasks to edit codebases. The data was generated by using the model itself - successfully applied edits from one model is used to train the next iteration.
Skywork-OR1-Math-7B by Skywork: A reasoning model trained in multiple stages with increasing context length, improving the performance compared to keeping the context length constant. The blog post goes into greater detail, including their entropy experiments of the actor model.
cogito-v1-preview-llama-70B by deepcogito: A series of models with optional reasoning capabilities which were trained by generating the training data with the same model, i.e., the 70B model is trained with data from the 70B model.
DeepCoder-14B-Preview by agentica-org: A RL-trained version of the R1-version of Qwen2.5 14B. They use the insights from DAPO for an improved version of GRPO, which we covered here.
Kimina-Prover-Preview-Distill-7B by AI-MO: A model from Moonshot AI specialized in proving math problems by formalizing them into Lean. They also release the Autoformalizer, which is capable of converting natural language into Lean 4 code. This is helpful for systems like AlphaProof.
ZR1-1.5B by Zyphra: A small coding model, building upon the R1-distilled version of Qwen2.5 1.5B by training it with PRIME.
Datasets
reasoning-v1-20m by glaiveai: A reasoning dataset generated by R1-Distill Llama 70B.
Multi-subject-RLVR by virtuoussy: A translation of the Chinese ExamQA into English, each instance converted into a free-form QA pair.
OpenCodeReasoning by nvidia: 735K Python coding solutions with traces generated by R1.
Llama-Nemotron-Post-Training-Dataset by nvidia: The SFT and RL data to train the Llama-Nemotron models.
DeepMath-103K by zwhe99: A challenging math dataset with verifiable final answers.
Links
A nice nanoGPT style library, McGill-NLP/nano-aha-moment, for training LMs with RL crossed our timelines. We haven’t tested it out yet, but these sorts of efforts are hugely impactful for access to controlled experimentation on small models.