[Initial posting on Nov 29 2020]

[Updated on Nov 30 2020] added a section about the scaling law w.r.t. the model size, per request from Felix Hill.

[Updated on Dec 1 2020] added a paragraph referring to Dauphin & Bengio’s “Big Neural Networks Waste Capacity“.

{Update on Feb 8 2021] see “Learning Curve Theory” by Marcus Hutter for a better exposition of the scaling law and where it might be coming from.

this is a short post on why i **thought** (or more like imagined) the scaling laws from <scaling laws for autoregressive generative modeling> by Heninghan et al. “[is] inevitable from using log loss (the reducible part of KL(p||q))” when “the log loss [was used] with a max entropy model“, which was my response to Tim Dettmers’s tweet on “why people are not talking more about the OpenAI scaling law papers“. thanks to João Guilherme for brining it this to my attention. it’s given me a chance to run some fun thought experiments over the weekend, although most of, if not all of, them failed as usual with any ideas and experiments i have. anyhow, i thought i’d leave here why i thought so particularly from the perspective of dataset size.

- The scaling law for Bernoulli w.r.t. the dataset size
- The scaling law for Bernoulli w.r.t. the model size
- The scaling law for Bernoulli w.r.t. the compute amount
- Final thoughts

## The scaling law for Bernoulli w.r.t. the dataset size

instead of considering a grand neural autoregressive model, i’ll simply consider estimating the mean of a Bernoulli variable after $N$ trials, and compare the log loss at this point against the log loss computed after $N+\Delta$ trials. let’s start by writing down the loss value after $N$ trials:

$$

-L(N) = p^* \log \frac{N_1}{N} + (1-p^*) \log \frac{N-N_1}{N} =

p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N,

$$

where $p^*$ is the true ratio of heads and $N_1 < N$ is the number of heads from the $N$ trials.

let’s now consider tossing the coin $\Delta$ more times. i will use $\Delta_1 < \Delta$ as the number of additional heads after these additional trials. what’s the loss after $N+\Delta$ trials?

$$

-L(N+\Delta) = p^* \log (N_1 + \Delta_1) + (1-p^*)(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta).

$$

so far so good. now, what kind of relationship between these two quantities $L(N)$ and $L(N+\Delta)$ do i want to get? in my mind, one way to say there’s a power law like structure behind $L$ is to show that the amount of improvement i get by running $\Delta$ more trials decreases as the number of existing trials $N$ increase. that is, there’s diminishing return from a unit effort as more efforts have been put.*

then, let’s look at their difference by starting from the loss at $N+\Delta$, while assuming that $\Delta \ll N$ (and naturally $\Delta_1 \ll N_1$ as well) so that i can use $\log (1+x) \approx x$ when $x$ is small:

$$

\begin{align*}

-L(N+\Delta) =& p^* \log (N_1 + \Delta_1) + (1-p^*)\log(N+\Delta – N_1 – \Delta_1) – \log (N+\Delta)

\\

=&

p^* \log N_1 (1+ \frac{\Delta_1}{N_1}) + (1-p^*) \log(N-N_1)(1 + \frac{\Delta – \Delta_1}{N-N_1}) – \log N(1+ \frac{\Delta}{N})

\\

\approx

&

\underbrace{p^* \log N_1 + (1-p^*) \log (N-N_1) – \log N}_{=-L(N)} + p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

\end{align*}

$$

The decrease in the loss by running $\Delta$ more trials can now be written as

$$

L(N) – L (N+\Delta) = p^* \frac{\Delta_1}{N_1} + (1-p^*)\frac{\Delta – \Delta_1}{N-N_1} – \frac{\Delta}{N}.

$$

since $\Delta_1 < \Delta$ and $N_1 < N$, let’s rewrite them as $\Delta_1 = \beta \Delta$ and $N_1 = \alpha N$, where $\alpha \in [0,1]$ and $\beta \in [0,1]$. then,

$$

L(N) – L (N+\Delta) = p^* \frac{\beta \Delta}{\alpha N} + (1-p^*) \frac{(1-\beta)\Delta}{(1-\alpha)N} -\frac{\Delta}{N} = \frac{\Delta}{N} \left(p^* \frac{\beta}{\alpha} + (1-p^*)\frac{1-\beta}{1-\alpha} – 1\right)

$$

this says that the change from the loss at $N$ to the loss at $N+\Delta$ is inversely proportional to $N$ itself, which is what i wanted to see from the beginning. although there were a few leaps of faith along the way, but it looks like more tosses I have made (i.e, large $N$), the change i can make to my loss with a constant number of extra tosses diminishes.

the second (multiplicative) term is more complicated, and i find it easier to think of two extreme cases; $p^*=1$ and $p^*=0$. these cases are reasonable if we think of this exercise as a proxy to studying classification, where it’s often assumed that a given input either belongs to one (positive) or the other (negative) class in an ideal world. when $p^*=1$, the second term reduces to

$$

\frac{\beta}{\alpha} – 1~~

\begin{cases}

> 0, & \text{if } \beta > \alpha \\

< 0, & \text{if } \beta < \alpha \\

= 0, & \text{if } \beta = \alpha

\end{cases}

$$

in other words, if the extra tosses reflected the true distribution better ($\beta > \alpha$, because the true positive rate is $1$,) the loss dropped. otherwise, the loss increases ($\alpha > \beta$) or stays same (i.e., no additional information has been added.) the other extreme case of $p^* = 0$ works similarly.

what’s important is that this second term largely dictates the sign of how the loss changes with the extra $\Delta$ tosses. since we are considering only the ratios of the heads within sets of trials and (suddenly!) assume both $N$ and $\Delta$ are reasonably large, the magnitude of change is instead largely determined by the ratio between $\Delta$ and $N$, with $N$ in the denominator.

so, this is how i arrived at my shallow take on twitter that these scaling laws may not have too much to do with whether we use neural net parameterization or not, whether we are solving language modeling, machine translation, etc., nor whether we are working with text, image or both. “i think” it arises naturally from the maximum entropy formulation (you can think of estimating the log-frequency of the heads above with sigmoid/softmax to turn it into the Bernoulli distribution) and the log loss.

of course, because i had to make a number of leaps of faith (or to put it another way, a few unreasonable assumptions,) it’s possible that this actually doesn’t make much sense at the end of the day. furthermore, i’m super insecure about my math in general, and i’m about 99.9% sure there’s something wrong in the derivation above 😱. hence, why “i think” the scaling law arises from log loss (cross-entropy) and maximum entropy models.

## The scaling law for Bernoulli w.r.t. the model size

it’s important for me to point out at this point that Heninghan et al. did much more than what i’ve discussed in this post and provide a much more extensive set of very interesting findings. they looked not only at the effect of the data size, but also the compute budget $C$ and model size $|\theta|$. in fact, they focus much more on the latter two than the former which was my focus here.

in the case of the model size, it’s quite trivial to map it to the argument above i made regarding the number $N$ of observations. let’s consider the model size $|\theta|$ in this context of recovering Bernoulli as the number of bits (with an arbitrary basis, including $e$) allowed to represent $N$ and $N_1$ (and consequently, $\Delta$ and $\Delta_1$.) then, the maximum $N$ a model can count up to is $\exp(|\theta|)$, and by increasing the model size by $\delta$ (i.e., $|\theta|+\delta$,) we can toss the coin

$$

\exp(|\theta|) \exp(\delta) – \exp(|\theta|) = \exp(|\theta|) (\exp(\delta) – 1)

$$

more. in other words, increasing the size of the model, while assuming that we can run as many tosses as we can to saturate the model capacity, is equivalent to setting $\Delta$ above to $\exp(|\theta|) (\exp(\delta) – 1)$.

in this case, the first term in the change in the loss above reduces to

$$

\frac{\Delta}{N} = \frac{\exp(|\theta|) (\exp(\delta) – 1)}{\exp(|\theta|)} = \exp(\delta) – 1,

$$

which is weird, because the dependence on $N = \exp(|\theta|)$ disappeared. that is, the change in the loss w.r.t. the increase in the model size (the number of bits) is not dependent on the number of existing bits used by the model.

what is happening here? in my opinion, this implies that the # of parameters in a neural net, or increasing it, is **not** optimally done in terms of compression.

what if we instead assume that only a polynomial number of trials can be compressed, i.e., $N=|\theta|^c$? in particular, for the sake of simplicity, let’s assume $c=2$. in this case,

$$

\frac{\Delta}{N} = \frac{(|\theta|+\delta)^2}{|\theta|^2} = 2\frac{\delta}{|\theta|} + \left(\frac{\delta}{|\theta|}\right)^2,

$$

and voila! we recovered the dependence on the model size $|\theta|$, and this dependence is inverse proportional, as expected. by further assuming that $\delta \ll |\theta|$, we end up with

$$

\frac{\Delta}{N} \approx 2 \frac{\delta}{|\theta|}.

$$

so, what does it say about the observation by Henighan et al. that there is a scaling law w.r.t. the model size? i suspect that their observation is telling us that deep nets we use are far from optimal in the sense of compressing data. it could be due to the choice of architectures, due to our choice of learning algorithms or even due to regularization techniques we use. it’ll be interesting to pinpoint what’s behind this sub-optimality will be interesting.

as i was writing the last paragraph, i was reminded of this earlier workshop paper by Yann Dauphin & Yoshua Bengio from the workshop track of ICLR’13, titled “Big Neural Networks Waste Capacity.” in this work, they observed the “rapidly decreasing return on investment for capacity in big networks” and conjectured this is due to the “failure of first order gradient descent.” perhaps, Yann was onto something, although i don’t think he’s followed up on this.

## The scaling law for Bernoulli w.r.t. the compute amount

in the case of the compute budget, i have absolutely no idea, but i wonder if a similar argument as the model size could be made. the number of SGD steps largely dictates the maximum magnitude of the weights in a neural net. the resolution (?) of the computed probability is largely determined by the maximum magnitude of (or the variance of individual weights in) the final weight matrix (that feeds into the final softmax). perhaps we can connect these two to show that more SGD updates allow our neural net to more precisely identify the target probability. of course, this suggests that different optimization strategies may result in radically different scaling laws.

## Final thoughts

assuming what i wrote above makes even slightest bits of sense, this raises two interesting question, in my opinion. first, is all a sophisticated neural net does counting examples? the strict answer is no, because it both counts and compresses. it however looks as if it’s compression without any interesting emergent property (such as systematic generalization). second, how does this property change when we move away from the maximum entropy formulation and log-loss? i’ve pointed out two directions that look promising in a tweet earlier: margin ranking loss by Collobert & Weston and entmax series by Martins and co. if so, will it be the change in a desirable direction?

let me wrap up by thanking Henighan et al. and Kaplan&McCandlish et al. for thought-provoking pieces that have made me think of these models and problems i’ve been working with all along from a very different angle.

(*) of course the other (more positive) way to look at it is that there’s always more to be learned if we are ready to invest as much as we have invested already.