In this post, we’ll walk through a small neural network and see how it classifies data. We won’t spend much time on formulae. Instead, we’ll focus on visualizing the functions in the network’s nodes. In the end, you’ll have an intuitive sense of what happens at every step of a three-layered neural network.

To read this post, you should be familiar with neural networks and perceptrons. You should know about concepts like “sigmoid”, “activation function” and “bias node”. If you’re reading Programming Machine Learning, or another ML introduction, then you know everything you need already. Let’s jump in!

The Function Machine

Things are…

The derivative of the softmax and the cross entropy loss, explained step by step.

Take a glance at a typical neural network — in particular, its last layer. Most likely, you’ll see something like this:

The softmax and the cross entropy loss fit together like bread and butter. Here is why: to train the network with backpropagation, you need to calculate the derivative of the loss. In the general case, that derivative can get complicated. But if you use the softmax and the cross entropy loss, that complexity fades away. …

In the first of this two-posts series, I told you the basic idea behind machine learning. Now let’s get a bit more concrete and talk about a specific approach to machine learning — one that reaped impressive results. It’s called supervised learning. Let’s see how supervised learning solves hairy problems like recognizing images.

To do supervised learning, we need to start from a set of examples, each carrying a label that the computer can learn from. For instance:

As you can see, examples can be a lot of different things: data, text, sound, video, and so on. Also, labels can…

This is the first of two posts adapted from Chapter 1 of Programming Machine Learning. The second post is here.

Software developers like to share war stories. As soon as a few of us sit down in a pub, somebody asks: “What project are you working on?” Then we nod our heads off as we listen to each other’s amusing, and sometimes horrible, tales.

In the mid-90s, during one of those evenings of bantering, a friend told me about the impossible mission she was on. …

This post contains additional information for readers of Programming Machine Learning. It focuses on a single line of code from the book: the code that calculates the gradient of the loss with respect to the weights.

The first time you see that line in the book, it looks like this:

In the book, functions that calculate the gradient are called gradient(). Here, I wrapped the code in a function named gradient_one_input(). The name highlights the fact that this code works when we have one input variable. For example, we use it to predict pizza sales from reservations.

However, the code…

There are many ways to calculate the loss of a classifier. In Programming Machine Learning, we use at least three different loss formulae. This post explains the intuitive meaning of one of them: the cross entropy loss.

Three Birds in the Hand

Let’s say that we’re building a classifier that recognizes birds from their song. We have three species of birds:

We already collected a dataset of a few hundreds bird songs, labeled with the birds that produced them:

Imagine that we’re building a platypus detector. This system takes animal pictures, and identifies platypuses:

A machine learning system classifies pictures of animals.
A machine learning system classifies pictures of animals.

This system is a binary classifier, because it returns either a positive or a negative result. Let’s say that we get these results for the images above:

Paolo Perrotta

Author of Programming Machine Learning (

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store