I am curious regarding this "MoE already isn’t handled well by the TPU". What is the technical detail behind this? Is it because TPUs don't have high bandwidth interconnect or HBM?
Essentially it has to do with how information is passed between compute instances during training. The TPU is a much simpler architecture than the GPU, specialized for NN training that has been popular. GPUs have a software layer for making their many meh cpus into efficient NN training devices.
When it comes to MoE, the trait of how the models are connected together matters. GPUs are wired within an entire cluster network. TPUs, as part of their specific design just talk to their neighbors. MoE is hard for this because the expert layers can have so many parameters that are nearby that they don’t all fall within the local communication range of the TPU, dramatically reducing the speed up that they would’ve gotten. GPUs handle this in software, so it’s not a downside, and modern GPU arrays handle the sparsity well so MoE ends up being an effective tool at training time with NVIDIA.
I feel like I wrote this somewhere but couldn’t find it...
absolutely stonking great post recapping Mamba/StripedHyena et al with the right amount of detail for people to take away and hyping-but-not-overhyping. thanks for all the work!
I am curious regarding this "MoE already isn’t handled well by the TPU". What is the technical detail behind this? Is it because TPUs don't have high bandwidth interconnect or HBM?
Essentially it has to do with how information is passed between compute instances during training. The TPU is a much simpler architecture than the GPU, specialized for NN training that has been popular. GPUs have a software layer for making their many meh cpus into efficient NN training devices.
When it comes to MoE, the trait of how the models are connected together matters. GPUs are wired within an entire cluster network. TPUs, as part of their specific design just talk to their neighbors. MoE is hard for this because the expert layers can have so many parameters that are nearby that they don’t all fall within the local communication range of the TPU, dramatically reducing the speed up that they would’ve gotten. GPUs handle this in software, so it’s not a downside, and modern GPU arrays handle the sparsity well so MoE ends up being an effective tool at training time with NVIDIA.
I feel like I wrote this somewhere but couldn’t find it...
absolutely stonking great post recapping Mamba/StripedHyena et al with the right amount of detail for people to take away and hyping-but-not-overhyping. thanks for all the work!
Great explanation!