# Of Gradients and Matrices

This post contains additional information for readers of Programming Machine Learning. It focuses on a single line of code from the book: the code that calculates the gradient of the loss with respect to the weights.

The first time you see that line in the book, it looks like this:

In the book, functions that calculate the gradient are called *gradient()*. Here, I wrapped the code in a function named *gradient_one_input()*. The name highlights the fact that this code works when we have one input variable. For example, we use it to predict pizza sales from reservations.

However, the code changes when we move from one to many input variables. In Chapter 4 (*Hyperspace!*), it becomes :

The earlier version of the code, in *gradient_one_input()*, is the direct port of a formula: the derivative of the loss with respect to the weight. If you re-read *A Sprinkle of Math*, in Chapter 3, you can see how that code comes together. By contrast, the code in *gradient_multiple_inputs()* is less straightforward. With all those matrix operations, it’s hard to see how it works. Where does the matrix multiplication come from? What about the transpose operation on *X*?

To save space, I don’t answer those questions in the book. I drop the gradient code in your lap and move on. Indeed, the latter version of the gradient code might be the only line in the book that I don’t explain in detail.

This post makes up for that missing explanation. I’m going to show you that *gradient_multiple_inputs()* implements the same steps as *gradient_one_input() — *only it’s more generic. While *gradient_one_input()* works with a single input variable, *gradient_multiple_inputs()* removes that limitation.

Let’s begin with a review of *gradient_one_input()*.

# Breaking Down the One-Input Code

To understand *gradient_multiple_input()*, let’s we start from* gradient_one_input()*. Here it is again:

In this early version of the code, *X* and *Y* are one-dimensional arrays of examples, and *w* is a scalar weight. Let’s spell out what this function does:

- It calculates the differences between the system’s predictions and the labels:
*(predict(X, w) — Y)*. - It multiplies each input in
*X*by the matching prediction error. - It averages the results.
- Finally, it multiplies everything by 2.

Those are the steps that *gradient_one_input()* goes through. Now let’s see how *gradient_multiple_inputs() *goes through the same steps *— *only with multiple inputs.

# Moving to Multiple Inputs

We’re about to examine the code in *gradient_multiple_inputs()*:

To read the next few paragraphs, remember how matrix multiplication and transpose work. If you need a refresh, go review those operations in the book. Take the time you need — I’ll be waiting.

Welcome back! Let’s begin by looking at the parameters of *gradient_multiple_inputs() *and their shapes:

Now let’s see what *gradient_multiple_inputs() *does. First, it calculates the prediction errors:

This calculation looks the same as the one in *gradient_one_input()*. However, its operands have a different shape–and so does its result. The result is a matrix with the same dimensions as Y:

Next, *gradient_multiple_inputs()* multiplies *X* by *E*—except that if you check those matrices’ dimensions, you’ll see that they don’t fit:

See? The inner dimensions of *X* and *E* are different, so you cannot multiply them. On the other hand, if you transpose *X*, the dimensions fit just right:

That’s what *gradient_multiple_inputs()* does.* *It transposes *X* and multiplies it by the errors:

Let’s take a close look at the result. By the rules of matrix multiplication, it has as many rows as *X.T* (the number of input variables) and as many columns as *E* (1). Each element in this matrix is the sum of the input variables in a row of *X.T*, multiplied by their matching weights:

Finally, the function takes the blue matrix above, divides its elements by *X.shape[0]*, and multiplies them by 2. Note that *X.shape[0]* is the number of rows in *X *— that is, the number of examples:

Let’s recap the steps that *gradient_multiple_inputs()* performs:

- It calculates the prediction errors. There are as many errors as there are examples.
- For each example in
*X*, it multiplies the example’s input variables by the matching error. - It adds the results together and divides them by the number of examples — that’s like saying that it takes their average.
- It multiplies everything by 2.

Go back, and compare the four steps above to the four steps of *gradient_one_input()*. Yup! They’re exactly the same.

Bottom line: *gradient_one_input()* and *gradient_multiple_inputs()* do the same thing. Only, *gradient_multiple_inputs()* does it with matrices. Thanks to those matrices, that latter function can deal with many input variables.

**Wrapping It Up**

It’s relatively easy to wrap your mind around *gradient_one_input()*. By contrast, *gradient_multiple_inputs() *is harder to grok. In particular, the averaging operation is hard to spot, because it’s muddled by a matrix multiplication. If you glance at the code in the two functions, you see that they look similar — but it’s hard to say how exactly.

In this post, I showed you that the two functions aren’t just similar: they follow the same exact steps. However, the second function is more capable, because it can deal with many input variables.

And with that, I covered the one missing explanation in the book. At least, I *think* it’s the only one. If you disagree, and you’d like more clarifications, ask!

*This post is a spin-off of of *Programming Machine Learning*, a zero-to-hero introduction for programmers, from the basics to deep learning. Go **here** for the eBook, **here** for the paper book, or come to the **forum** if you have questions and comments!*