Olmo Hybrid and future LLM architectures

Nathan Lambert

Mar 5

The latest Olmo model and discussions at the frontier of open-source post training tools.

Read →

4 Comments

Jerry Ma

Mar 5Edited

This is *FINALLY* answering the question everyone has been thinking after reading the new model releases this year: why these hybrid attention layers? And apparently, more expressiveness is the answer?

I am more than excited to hear the post-training challenges for MoE models: if we can inference with sharding, couldn't we do the same with post-training as well?

Thank you for reading the article rather than text-to-speech. This has been more than refreshing.

my voice overs aren't perfect but the humanity is nice, and they help me proof read 🫡

Thank you for your time and attention concerning this recent discovery.

https://substack.com/@sublius/note/c-224332811?r=724p51&utm_medium=ios&utm_source=notes-share-action

"Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling."

A fairly interesting paper you've linked. It's hard to tease out these at once endlessly expressive decoder models and at once not expressive in the ways we'd like

Reply

Share