I’ve been learning about the neural tangent kernel and some things confused me.
In the NTK paper, the network layers have the form where is the number of neurons in . Why the square root?
So I worked it out.
Setup
Input: , where means a limited (hyper)real number. Limited just means it’s not infinitely big.
Output: .
The inputs and outputs are scalar, but this is just for simple exposition. The analysis works for vector input/outputs.
Layers: of shapes and where is hyperfinite (infinitely big natural number). Every entry in the matrices is limited.
Activation function: , the relu.
. I left out biases here, but the analysis doesn’t depend on them anyway.
Diagram:
x: (1,)
|
V
+---+
| |
| A |
|(H,1)
| |
| |
| |
| |
+---+
|
| ReLU
V
+-----------------+
| B: (1,H) |
| |
+-----------------+
|
V
y: (1,)
Now what?
Brute force. Let’s run through the net. First is multiplying by to get . Name it .
has shape (using NumPy notation), a vector of unlimited length. This is the first sign that something may be up: each of its entries are limited
(because is of limited length with limited entries), but their norm () may not be since there are an unlimited number of entries.
Example: Let be all 1s. Then its norm is -many times, which equals .
We want our network to have limited (aka assignable) values, so what to do? We can renormalize by inserting a constant in front of the matrix so its output is limited. But which constant?
.
Proof
is less than or equal to its maximum element (by absolute value) repeated -many times. Let . This exists because is hyperfinite, and so has a maximum element because finite sets do.
In math: . The magnitude of the right hand side is . The factor of is irrelevant since we only care about the output being limited, and dividing out the will do it. Any value of a lower order would still diverge, and a higher order (like or something) would make the output infinitesimal. Only is of the correct order.
By similar reasoning, for each layer, we divide by the square root of the number of elements in the layer matrix. The activation functions don’t matter too much.
Another way
This cannot be directly implemented but is mathematically equivalent. We stop worrying about the latent variables having unlimited norm and just let them be unlimited since algebraically, they act the same as any other number.
In that case, let’s consider and assume all the entries are positive so the relu is trivial and can be ignored, leaving .
The norm is bounded above by . AKA , which (using the reasoning above) is of order where the is the order of (because it’s limited), which doesn’t really matter. This suggests a different but equivalent renormalization strategy: divide by the product of the square roots of the number of elements in each layer , but not at each step. Only once, at the very end.
In practice, we would want to make each step stay limited (so that it could even be put on a computer for one thing), but for theoretical analysis, leaving out normalizing constants until the very end is nice.