MS: the ‘emergent behavior’ begs the question getting there more efficiently. Certainly orders of magnitude fewer calculations (10^22-10^24 in the graphs) would be a target.

Biological systems seem to do this many calculations, but use only a few examples to learn. Perhaps the ML equivalent of exquisitely tuning the WX layer coefficients and activation function. Or this could be a red herring in that any specific layer is a blunt function and the buildup of layers fine tunes the function. And then there is the counter example of reducing coefficient widths to improve calculation power, while not reducing model performance, implying that no single layer is very important in the tuning process.

Will more scaling of language models lead to further emergent abilities? Why does scaling unlock emergent abilities? Are there ways to unlock such abilities other than scaling?

MS: the ‘emergent behavior’ begs the question getting there more efficiently. Certainly orders of magnitude fewer calculations (10^22-10^24 in the graphs) would be a target.

Biological systems seem to do this many calculations, but use only a few examples to learn. Perhaps the ML equivalent of exquisitely tuning the WX layer coefficients and activation function. Or this could be a red herring in that any specific layer is a blunt function and the buildup of layers fine tunes the function. And then there is the counter example of reducing coefficient widths to improve calculation power, while not reducing model performance, implying that no single layer is very important in the tuning process.

Will more scaling of language models lead to further emergent abilities? Why does scaling unlock emergent abilities? Are there ways to unlock such abilities other than scaling?