This is *FINALLY* answering the question everyone has been thinking after reading the new model releases this year: why these hybrid attention layers? And apparently, more expressiveness is the answer?
I am more than excited to hear the post-training challenges for MoE models: if we can inference with sharding, couldn't we do the same with post-training as well?
Thank you for reading the article rather than text-to-speech. This has been more than refreshing.
"Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling."
A fairly interesting paper you've linked. It's hard to tease out these at once endlessly expressive decoder models and at once not expressive in the ways we'd like
This is *FINALLY* answering the question everyone has been thinking after reading the new model releases this year: why these hybrid attention layers? And apparently, more expressiveness is the answer?
I am more than excited to hear the post-training challenges for MoE models: if we can inference with sharding, couldn't we do the same with post-training as well?
Thank you for reading the article rather than text-to-speech. This has been more than refreshing.
my voice overs aren't perfect but the humanity is nice, and they help me proof read 🫡
Thank you for your time and attention concerning this recent discovery.
https://substack.com/@sublius/note/c-224332811?r=724p51&utm_medium=ios&utm_source=notes-share-action
"Why does expressive power help with efficiency? This is where things are more nuanced. We argue that more expressive models will have better scaling laws, following the quantization model of neural scaling."
A fairly interesting paper you've linked. It's hard to tease out these at once endlessly expressive decoder models and at once not expressive in the ways we'd like