6 Comments
Dec 29, 2022Liked by Nathan Lambert

MS: the ‘emergent behavior’ begs the question getting there more efficiently. Certainly orders of magnitude fewer calculations (10^22-10^24 in the graphs) would be a target.

Biological systems seem to do this many calculations, but use only a few examples to learn. Perhaps the ML equivalent of exquisitely tuning the WX layer coefficients and activation function. Or this could be a red herring in that any specific layer is a blunt function and the buildup of layers fine tunes the function. And then there is the counter example of reducing coefficient widths to improve calculation power, while not reducing model performance, implying that no single layer is very important in the tuning process.

Expand full comment

Will more scaling of language models lead to further emergent abilities? Why does scaling unlock emergent abilities? Are there ways to unlock such abilities other than scaling?

Expand full comment
author

People don't really know why, but the expectation is a) yes, scaling will continue and b) no one has figured out hacks to bypass this.

The great mystery of deep learning right now.

Expand full comment
Dec 29, 2022Liked by Nathan Lambert

Hypothesis: each layer more precisely splits the N-dimensional representation of the ML mapping function 'better'. Whatever mappings exist in the data are amplified with more and more data because the 'average' values in the data center the model layers (scaling matrix and activation function) are better centered on the underlying decision boundaries in the data.

Discussion: overtraining with a small set of data makes the function fit this data tightly, but the results are not generalizable. Hence drop out layers force imprecision into the model. Allowing for more averaging in the data model coefficients, and somewhat faster convergence (I seem to remember). For large data sets you are doing the tight fitting but at the same time averaging over the wider data set to smooth out that tight fitting tendency.

I suppose deep learning of the MNIST character boundaries is a poor judgement of a 'digit', but perhaps there is a better (Platonic) ideal digit representation that would make more sense in digit recognition. The large data set is in some ways a compensation for the poor representation, where we eventually map out conserved locations in the images. Hmm, there must be hotspots in the layers that identify with particular digits.

But humans look at the number 3, and we know that slight modifications of width, slope, rotation and other distortion don't change it from being a 3.

Nathan's comment is more cogent.

Expand full comment
author
Dec 29, 2022·edited Dec 29, 2022Author

I don't have much technical explanation to add. I follow, mostly. I'm just excited for when educators / communicators can make a curriculum around what you're trying to explain.

At Berkeley, we were brainstorming a course on "practical fundamentals of deep learning" that never came to be. The ideas on crucial intuitions was so fun.

Expand full comment
Dec 29, 2022Liked by Nathan Lambert

Saw this about a local congressman yesterday: https://www.washingtonpost.com/dc-md-va/2022/12/28/beyer-student-artificial-intelligence-degree/

Expand full comment