This is a topic near to my heart, in the same way that an old bullet stuck in your chest is.
Matrix calculus is when learning calculus from a computational perspective starts to fail you, because the sheer amount of indices will crush you in their wake, leaving behind only a grad student who had the gall to try and fob me off with a bullshit explanation for why the derivative with respect to a matrix is what the Matrix Cookbook says it is.
Screw him, and the Matrix Cookbook. It’s OK to say you don’t know.
Here’s how you actually find matrix derivatives.
You can use the definition of the Frechet derivative in terms of the first-order Taylor approximation to actually find derivatives from first principles.
But once you’ve done that a few times, use the basic algebraic properties of the derivative to save yourself a lot of work.
I’m not going to explain the technical details of Frechet derivatives, but I will give an example. See this for a guide. Or one of a million real/functional analysis textbooks.
Also, read something like Sheldon Axler’s Linear Algebra Done Right. If you’re trying to compute matrix derivatives, you need to know what you’re doing, and if you don’t know the connection between matrices and linear maps, you don’t know what you’re doing.
After you’ve convinced yourself of the truth of all this, come back and use the definition below.
No, really. Go read one of the resources I mentioned first. Don’t waste 2 years like I did. There’s a reason I’m writing about this.
Frechet Derivatives in Terms of Taylor Expansion
To find the derivative of
The expression
Keep in mind that
A lot of the functions you want to differentiate aren’t too hairy. The ones that are, use the properties of the derivative, such as the chain rule. If even that fails you, my condolences.
The
Worked Example
The function we want to differentiate (with respect to
We start with the definition
Then we move stuff over to get
Doing algebra gives
Whenever you see something like
Something Ani Adhikari once said about tricks like this stuck with me:
Once is a trick, twice is a technique.
There is a principled reason for why this particular trick works (only in Hilbert spaces), but that’s the subject of another blog post.
Transpose the second term to get
Now we’ll put back the terms we left out to get the full expression.
The last terms on the left and right hand side can be ignored because
they’re independent of
Then
Worked Example, The Lazy Way
Or we could use the product rule and the fact that the derivative of the transpose is itself (since it’s linear. You did read about Frechet derivatives, didn’t you?)
Now we plug in the simple derivatives, which are intuitive. If you don’t believe me, compute them yourself.
The most interesting derivative is the transpose function. The
transpose takes a vector/matrix and returns its transpose, so the
derivative is the transpose function itself, which transposes its
argument. In math,
We can do the trick with flipping the transpose times something to get
That was much shorter, at least in algebra.
Practice makes perfect.
-
Technically, this approach only works if you already know that the derivative exists, but for almost everything you care about, it does, since generally you’re working with a composition of obviously differentiable functions. The Chain Rule guarantees their composition is differentiable. ↩