Deep Learning is for Everyone

What is Machine Learning ?

A Machine Learning Program

As we learned earlier, machine learning is the science to write programs that learn. Therefore, machine learning could allow you to recognize dogs and cats without telling the program all the characteristics of each one (which is tricky) since the program can learn these. This learning is possible through this model training loop :

Let's break down this figure :

The model receives inputs which are the data (the images of dogs and cats).
The model ouputs predictions which looks like : "Dog" or "Cat".
The performance of the predictions is calculated.
The model is updated according to the performance in order to improve itself.

Architecture and Parameters

You can notice that we split Model into Architecture and Parameters. The architecture is the functional form of the model and the parameters are some variables that defines how the architecture operates. For example, $y = ax + b$ is an architecture with the parameters $a$ and $b$ that change the behavior of the function.

Labels and Loss

We can see that now the Performance is split into Labels and Loss. The labels are the (ground) truth. For example, if an image is a dog the label of the image is Dog. Therefore, the labels and the predictions can be compared to determine the performance of the model. Indeed, if the prediction on an image is Cat and the label is Dog, you would know that the model did bad. The loss is this measure of performance thats compares the labels and the predictions so that we can updates the parameters to perform better.

Trained Models

Once a model is trained. You can treat it as a regular program.

Regular Programming

def add(a, b):
    return a + b

add(2, 3)

5

As you can see this program takes some inputs and outputs results. Indeed, the inputs are $2$ and $3$ and the result, $5$.

What is Deep Neural Network ?

As we learned earlier deep neural network is a kind of machine learning model and "deep" refers to having move than 1 hidden layer (1 input layer → 1+ hidden layer → 1 output layer). This model can solve any problems according to the universal approximation theorem by varying the parameters. Therefore, we need a general "mecanism" to modify these parameters for each problem. This "mecanism" already exists and it is called stochastic gradient descent (SGD).

Tip: SGD and deep neural networks sounds complex, but they aren’t !

How Our Image Recognizer Works

Let's break down the first lines of code of our image recognizer :

from fastai.vision.all import *

This allows us to use all the tools we will need to code a variety of computer vision models.

PATH = untar_data(URLs.PETS)/'images'
PATH

Path('C:/Users/natha/.fastai/data/oxford-iiit-pet/images')

This line downloads a dataset from fast.ai datasets collection (if not previously downloaded), extracts it (if not previously extracted) and returns a Path object with the extracted location

def is_cat(x):
    return x[0].isupper()

Here we define the function is_cat to get the label of an image. Indeed, the function returns True if the image contains a cat since the dataset's creators set cats's filenames with an upper case at the beginning.

dls = ImageDataLoaders.from_name_func(path=PATH,
                                      fnames=get_image_files(PATH),
                                      valid_pct=0.2,
                                      seed=42,
                                      label_func=is_cat,
                                      item_tfms=Resize(224))

Our model needs to know the kind and the structure of the dataset it's working with. Therefore, we created a dataloader. Since we are using images, this is an ImageDataLoaders. Also, from_name_func is used, because we are using the name of the files to label our images.

Let's explain the parameters :

path : where the data is stored
fnames : an object containing the Path objects of the images' filenames
valid_pct : the percentage of data hold out randomly in the validation set (we will talk later about this)
seed : aims to make your code reproductible by always generating the same validation set
label_func : the function use to get the label of the image
item_tfms : a transformation done to each item (in this case each item is resized to 224-pixel square)

Sidebar: Datasets: Food for Models

In machine learning and deep learning, we can’t do anything without data. So, the people that create datasets for us to train our models on are the (often underappreciated) heroes. Most datasets used in this book took the creators a lot of work to build.

Some of the most useful and important datasets are those that become important academic baselines; that is, datasets that are widely studied by researchers and used to compare algorithmic changes, such as MNIST, CIFAR-10, and ImageNet.

The datasets used in this book have been selected because they provide great examples of the kinds of data that you are likely to encounter, and the academic literature has many examples of model results using these datasets to which you can compare your work.

Sets

In order to evaluate the performance of our models, we need to split our data into sets to prevent "cheating" (overfitting). This cut is based on how fully we want to hide it from the model and ourselves: training data is fully exposed, the validation data is less exposed, and test data is totally hidden.

Validation Set

If we train a model with all our data and evaluate the model using that same data, we would not be able to tell how well our model can perform on data it hasn’t seen since the model already has all the answers in the training set. Indeed, it could be overfitting.

To avoid this, we split our dataset into two sets: the training set and the validation set which is used only for evaluation (and not for the training). This lets us test that the model learns lessons from the training data that generalize to new data, the validation data.

Test Set

However, we as human can also cheat! Indeed, in realistic scenarios we are likely to explore many versions of a model by choosing various hyperparameters (parameters about parameters) : network architecture, learning rates, data augmentation strategies, and other factors we will discuss in upcoming chapters. So, just as the automatic training process is in danger of overfitting the training data, we are in danger of overfitting the validation data through human trial and error and exploration.

The solution is to introduce another level of even more highly reserved data, the test set. Just as we hold back the validation data from the training process, we must hold back the test set data even ourselves. It cannot be used to improve the model; it can only be used to evaluate the model at the very end of our efforts.

Use Judgment in Defining Sets

A key property of the validation and test sets is that they must be representative of the new data you will see in the future. Therefore, you shouldn't always choose a random subset of your data.

Exercices

For the following situations, how should you split the training set and the validation set ?

1.

You are using historical data to build a model to predict the future sales in a chain of Ecuadorian grocery stores as you can see below.

2.

In the Kaggle distracted driver competition, the independent variables are pictures of drivers at the wheel of a car, and the dependant variables are categories such as texting, eating, or safely looking ahead. Lots of pictures are of the same drivers in different positions, as we can see in this figure.

3.

You are trying to create an algorithm to distinguish dogs and cats for the Kaggle Dogs vs. Cats competition.

4.

The goal of the Kaggle fisheries competition was to identify the species of fish caught by fishing boats in order to reduce illegal fishing of endangered populations. The test set of Kaggle on which you'll do your predictions consisted of boats that didn't appear in the training data.

Solutions

1.

You are using historical data to build a model to predict the future sales in a chain

Therefore, we should take a part of the newest data in our validation set in order to be representative of the new data you will see in the future.

Indeed, a random subset is a poor choice (too easy to fill in the gaps, and not indicative of what you'll need in production), as we can see :

2.

Lots of pictures are of the same drivers in different positions.

The validation data should consists of images of people that don't appear in the training set in order to be representative of the new data you will see in the future.

Indeed, if you used all the people in training your model, your model might be overfitting to particularities of those specific people, and not just learning the states (texting, eating, etc.).

3.

Randomly is a good answer (since it will keep a good ratio between classes in the sets).

4.

The test set consisted of boats that didn't appear in the training data.

This means that you'd want your validation set to include boats that are not in the training set in order to be representative of the new data you will see in the future.

Questionnaires

Go to this link and learn the flash cards. #TODO

Bibliography

This post is based on Deep Learning for Coders [1].

[1]J. Howard and S. Gugger, Deep Learning for Coders with Fastai and Pytorch: AI Applications Without a PhD. O’Reilly Media, Incorporated, 2020.