Of Gradients and Matrices
This post contains additional information for readers of Programming Machine Learning. It focuses on a single line of code from the book: the code that calculates the gradient of the loss with respect to the weights.
The first time you see that line in the book, it looks like this:
In the book, functions that calculate the gradient are called gradient(). Here, I wrapped the code in a function named gradient_one_input(). The name highlights the fact that this code works when we have one input variable. For example, we use it to predict pizza sales from reservations.
However, the code changes when we move from one to many input variables. In Chapter 4 (Hyperspace!), it becomes :
The earlier version of the code, in gradient_one_input(), is the direct port of a formula: the derivative of the loss with respect to the weight. If you re-read A Sprinkle of Math, in Chapter 3, you can see how that code comes together. By contrast, the code in gradient_multiple_inputs() is less straightforward. With all those matrix operations, it’s hard to see how it works. Where does the matrix multiplication come from? What about the transpose operation on X?
To save space, I don’t answer those questions in the book. I drop the gradient code in your lap and move on. Indeed, the latter version of the gradient code might be the only line in the book that I don’t explain in detail.
This post makes up for that missing explanation. I’m going to show you that gradient_multiple_inputs() implements the same steps as gradient_one_input() — only it’s more generic. While gradient_one_input() works with a single input variable, gradient_multiple_inputs() removes that limitation.
Let’s begin with a review of gradient_one_input().
Breaking Down the One-Input Code
To understand gradient_multiple_input(), let’s we start from gradient_one_input(). Here it is again:
In this early version of the code, X and Y are one-dimensional arrays of examples, and w is a scalar weight. Let’s spell out what this function does:
- It calculates the differences between the system’s predictions and the labels: (predict(X, w) — Y). The result is an array of prediction errors.
- It multiplies each input in X by the matching prediction error.
- It averages the results.
- Finally, it multiplies everything by 2.
Those are the steps that gradient_one_input() goes through. Now let’s see how gradient_multiple_inputs() goes through the same steps — only with multiple inputs.
Moving to Multiple Inputs
We’re about to examine the code in gradient_multiple_inputs():
To read the next few paragraphs, remember how matrix multiplication and transpose work. If you need a refresh, go review those operations in the book. Take the time you need — I’ll be waiting.
Welcome back! Let’s begin by looking at the parameters of gradient_multiple_inputs() and their shapes:
Now let’s see what gradient_multiple_inputs() does. First, it calculates the prediction errors:
This calculation looks the same as the one in gradient_one_input(). However, its operands have a different shape–and so does its result. The result is a matrix with the same dimensions as Y:
Next, gradient_multiple_inputs() multiplies X by E—except that if you check those matrices’ dimensions, you’ll see that they don’t fit:
See? The inner dimensions of X and E are different, so you cannot multiply them. On the other hand, if you transpose X, the dimensions fit just right:
That’s what gradient_multiple_inputs() does. It transposes X and multiplies it by the errors:
Let’s take a close look at the result. By the rules of matrix multiplication, it has as many rows as X.T (the number of input variables) and as many columns as E (1). Each element in this matrix is the sum of the input variables in a row of X.T, multiplied by their matching weights:
Finally, the function takes the blue matrix above, divides its elements by X.shape[0], and multiplies them by 2. Note that X.shape[0] is the number of rows in X — that is, the number of examples:
Let’s recap the steps that gradient_multiple_inputs() performs:
- It calculates the prediction errors. There are as many errors as there are examples.
- For each example in X, it multiplies the example’s input variables by the matching error.
- It adds the results together and divides them by the number of examples — that’s like saying that it takes their average.
- It multiplies everything by 2.
Go back, and compare the four steps above to the four steps of gradient_one_input(). Yup! They’re exactly the same.
Bottom line: gradient_one_input() and gradient_multiple_inputs() do the same thing. Only, gradient_multiple_inputs() does it with matrices. Thanks to those matrices, that latter function can deal with many input variables.
Wrapping It Up
It’s relatively easy to wrap your mind around gradient_one_input(). By contrast, gradient_multiple_inputs() is harder to grok. In particular, the averaging operation is hard to spot, because it’s muddled by a matrix multiplication. If you glance at the code in the two functions, you see that they look similar — but it’s hard to say how exactly.
In this post, I showed you that the two functions aren’t just similar: they follow the same exact steps. However, the second function is more capable, because it can deal with many input variables.
And with that, I covered the one missing explanation in the book. At least, I think it’s the only one. If you disagree, and you’d like more clarifications, ask!
This post is a spin-off of of Programming Machine Learning, a zero-to-hero introduction for programmers, from the basics to deep learning. Go here for the eBook, here for the paper book, or come to the forum if you have questions and comments!