Specifically, what does multiplying a matrix \(A\) by itself a bunch of times do?
If it’s diagonalizable1, the answer is pretty simple.
Simplify, Simplify
Matrices are just a bunch of numbers anyway, so why not focus on just one number?
If you take some number \(a \geq 0\) and multiply it by itself a bunch of times, what happens?
Or in math, what’s \(\lim\limits_{n \to \infty} a^n\)?
If \(a = 0\), then it’s just 0.
If \(a = 1\), then it’s 1.
The other cases are more interesting. For any \(0 < a < 1\), you’re repeatedly shrinking the result, and will converge to 0. For \(a > 1\), you go off to infinity.
Turns out this simple reasoning can carry over to matrices pretty much unchanged.
What Does It Mean For A Matrix To Blow Up?
\(A^n\) blows up if it maps a vector \(v\) to a vector \(v'\) that has a larger norm (aka length). Basically, \(A\) stretches vectors.
Eigenvalues Are Your Friend
Let \(A:V \to V\) be diagonalizable and \(V\) a real vector space. Then by the fundamental decomposition of linear algebra, we can write any vector \(v \in V = \sum a_i \lambda_i \cdot e_i\), where the \(e_i\) are eigenvectors. These components being summed can be looked at individually. That’s the whole point of finding eigenvectors.
Since we want to know how much \(A\) stretches \(v\), we only need its eigenvalues. Each one tells you how much \(v\) is stretched along the direction of its corresponding eigenvector.
Which lets us use the case of single numbers really easily. Exactly the same logic holds. Just replace \(a\) above with \(\lambda\), and it’s the same.
So if any eigenvalue’s magnitude is greater than 1, \(A\) will make \(v\) longer along that direction, and if any is less than 1, it will shrink it.
If even one eigenvalue has magnitude greater than 1, any nonzero vector will be stretched to be infinitely long.
If all the eigenvalues have magnitude less than 1, any vector will eventually shrink to 0.
If they’re all 1, then \(v\) may move around, but it won’t change length. \(A\) acts like a rotation/reflection.
In fact, this analysis lets us think of \(A^n\) as essentially the same thing as \(a^n\).
But here’s an observation that leads to something cool. Say \(A\) is \(2 \times 2\), with eigenvalues 3 and 4 and corresponding eigenvectors \(e_1,e_2\). Iterating \(A\) 20 times will stretch any vector \(v\) by a factor of \(3^{20}\) in direction \(e_1\) and by \(4^{20}\) in direction \(e_2\). But that means \(v\) will be stretched far more in one direction than another. By a factor of \(\frac{4^{20}}{3^{20}} \approx 315\). Each multiplication against \(A\) pulls \(v\) towards the direction corresponding to its larger (magnitude) eigenvalue.
You can see that by plotting a bunch of points on a sphere and their images under repeated multiplication by a fixed matrix. They get pulled along one direction more and eventually become lines in the limit. The narrower ellipses are later iterations.
This is where one of my favorite things about linear algebra kicks in. It lets you take that geometric intuition about what happens in 1 or 2 dimensions and generalize to higher ones. Diagonal operators make this even easier since you can look at one direction at a time. This pulling phenomenon happens in \(n\) dimensions and you end up with a vector that’s a scalar multiple of the eigenvector corresponding to the largest eigenvalue.
Eigenapplication: Finding Eigenvectors
Much to my surprise, this stupid simple method is called power iteration and it is actually used to find eigenvectors. You’ll always end up with a vector proportional to the eigenvector in the direction with the most stretch (the largest magnitude eigenvalue).
I was surprised at how fast the plots above converged to nearly straight lines. I ran a 100 different simulations and all converged within about 7 iterations. While these matrices are only 2-dimensional, it’s still not bad. Though I’m sure there exist examples that will make me recant this later.
Eigenapplication: Vanishing/Exploding Gradients
This is basically why recurrent networks and anything with long chains of the same gradient suck to train.
Let’s keep it simple and say we have a recurrent net that uses a hidden layer matrix \(h\) with the derivative of one layer with respect to its predecessor equal to \(D\) and ReLU activations. We’ll also pretend that the inputs are all positive, so the activation is just the identity function and we can ignore it. What we’re left with is the product of \(D\), \(n\) times. Unless all the eigenvalues are pretty close to 1, \(D^n\) is likely to go to 0 or blow up. Have fun running gradient descent.
In feedforward networks, we can try to avoid this by tuning our weights to have eigenvalues of roughly 1. This is less easy in RNNs because lots of layers use the same weight and therefore have the same gradients.
It’s cute that with a bit of linear algebra, we can state the basic point of this paper in a lot fewer words than they took.
-
With singular values, we can extend all this, but that’s not for today. ↩