Matrix Calculus

This is a topic near to my heart, in the same way that an old bullet stuck in your chest is.

Matrix calculus is when learning calculus from a computational perspective starts to fail you, because the sheer amount of indices will crush you in their wake, leaving behind only a grad student who had the gall to try and fob me off with a bullshit explanation for why the derivative with respect to a matrix is what the Matrix Cookbook says it is.

Screw him, and the Matrix Cookbook. It’s OK to say you don’t know.

Here’s how you actually find matrix derivatives.

You can use the definition of the Frechet derivative in terms of the first-order Taylor approximation to actually find derivatives from first principles.

But once you’ve done that a few times, use the basic algebraic properties of the derivative to save yourself a lot of work.

I’m not going to explain the technical details of Frechet derivatives, but I will give an example. See this for a guide. Or one of a million real/functional analysis textbooks.

Also, read something like Sheldon Axler’s Linear Algebra Done Right. If you’re trying to compute matrix derivatives, you need to know what you’re doing, and if you don’t know the connection between matrices and linear maps, you don’t know what you’re doing.

After you’ve convinced yourself of the truth of all this, come back and use the definition below.

No, really. Go read one of the resources I mentioned first. Don’t waste 2 years like I did. There’s a reason I’m writing about this.

Frechet Derivatives in Terms of Taylor Expansion

To find the derivative of \(f\) with respect to its argument \(x\)1, use

\[f(x + h) = f(x) + Dh + o(\cdot)\]

The expression \(Dh\) is called the total differential (\(df\) in some textbooks) because it approximates \(f(x + h) - f(x)\) with a linear function. We care about \(D\), which is the Frechet derivative at \(x\).

Keep in mind that \(x\) and \(h\) can be scalars, vectors, or matrices. It’s all the same conceptually. \(x\) and \(h\) just have to be of the same type/shape.

A lot of the functions you want to differentiate aren’t too hairy. The ones that are, use the properties of the derivative, such as the chain rule. If even that fails you, my condolences.

The \(o(\cdot)\) term is all the expressions that grow slower than \(x\). Basically anything with more than one \(h\) in it.

Worked Example

The function we want to differentiate (with respect to \(x\)) is \(f(x) = x^T A x\). \(x\) is a vector, \(A\) a matrix.

We start with the definition \(f(x + h) = f(x) + Dh + o(\cdot)\).

Then we move stuff over to get \(f(x + h) - f(x) = Dh + o(\cdot)\).

Doing algebra gives \(f(x+h) = (x + h)^{T}A(x + h) - x^{T}Ax\). This equals \(x^{T}Ah + h^{T}Ax +h^{T}Ah\). The last term is in our \(o(\cdot)\) (you can see this is true by letting \(||h|| \to 0\) and noticing that the whole product goes to 0), so we’ll forget about it and look at the other 2 terms.

Whenever you see something like \(x^{T}Ah\) (a vector transposed times a matrix times another vector), it’s an expression that evaluates to a scalar, so transposing it again leaves its value unchanged. This trick lets us simplify the expression further.

Something Ani Adhikari once said about tricks like this stuck with me:

Once is a trick, twice is a technique.

There is a principled reason for why this particular trick works (only in Hilbert spaces), but that’s the subject of another blog post.

Transpose the second term to get \(x^{T}Ah +x^{T}A^{T}h\).

Now we’ll put back the terms we left out to get the full expression.

\(x^T A h +x^T A^T h + h^TAh = Dh + o(\cdot)\).

The last terms on the left and right hand side can be ignored because they’re independent of \(x\), and by factoring out the \(h\) on the left hand side, we get \((x^T A +x^T A^T) h = Dh\).

Then \(D = x^T A + x^T A^T\). We can factor out \(x^T\) to get \(x^T(A+A^T)\). If \(A\) is symmetric, then the expression simplifies to \(2x^TA\).

Worked Example, The Lazy Way

Or we could use the product rule and the fact that the derivative of the transpose is itself (since it’s linear. You did read about Frechet derivatives, didn’t you?)

\(d(x^T Ax) = d(x^T )Ax + x^T d(A)x + x^T A d(x)\) by the product rule.

Now we plug in the simple derivatives, which are intuitive. If you don’t believe me, compute them yourself.

The most interesting derivative is the transpose function. The transpose takes a vector/matrix and returns its transpose, so the derivative is the transpose function itself, which transposes its argument. In math, \(d(x^T) = (dx)^T\). In fact, any linear function commutes with the derivative like that.

We can do the trick with flipping the transpose times something to get \(D = x^T A + x^T A^T\), just like above.

That was much shorter, at least in algebra.

Practice makes perfect.

  1. Technically, this approach only works if you already know that the derivative exists, but for almost everything you care about, it does, since generally you’re working with a composition of obviously differentiable functions. The Chain Rule guarantees their composition is differentiable. 

Related Posts

NTK reparametrization

Kate from Vancouver, please email me

ChatGPT Session: Emotions, Etymology, Hyperfiniteness

Some ChatGPT Sessions

2016 ML thoughts

My biggest takeaway from Redwood Research REMIX

finite, actual infinity, potential infinity

Actions and Flows

PSA: reward is part of the habit loop too

a kernel of lie theory