Discussion about this post

User's avatar
Jerry Ma's avatar

This is *FINALLY* answering the question everyone has been thinking after reading the new model releases this year: why these hybrid attention layers? And apparently, more expressiveness is the answer?

I am more than excited to hear the post-training challenges for MoE models: if we can inference with sharding, couldn't we do the same with post-training as well?

Thank you for reading the article rather than text-to-speech. This has been more than refreshing.

3 more comments...

No posts

Ready for more?