# Deep Learning: Which Loss and Activation Functions should I use?

The purpose of this post is to provide guidance on which combination of final-layer activation function and loss function should be used in a neural network depending on the business goal.

This post assumes that the reader has knowledge of activation functions. An overview on these can be seen in the prior post: Deep Learning: Overview of Neurons and Activation Functions

### What are you trying to solve?

Like all machine learning problems, the business goal determines how you should evaluate it’s success.

#### Are you trying to predict a numerical value?

*Examples: Predicting the appropriate price of a product, or predicting the number of sales each day*

If so, see the section **Regression: Predicting a numerical value**

#### Are you trying to predict a categorical outcome?

*Examples: Predicting objects seen in an image, or predicting the topic of a conversation*

If so, you next need to think about how many classes there are and how many labels you wish to find.

If your data is binary, it is or isn’t a class (e.g. fraud, diagnosis, likely to make a purchase), see the section **Categorical: Predicting a binary outcome**

If you’ve multiple classes (e.g. objects in an image, topics in emails, suitable products to advertise) and they are exclusive — each item only has one label — see **Categorical: Predicting a single label from multiple classes**. If there are multiple labels in your data then you should look to section **Categorical: Predicting multiple labels from multiple classes**.

### Regression: Predicting a numerical value

*E.g. predicting the price of a product*

The final layer of the neural network will have one neuron and the value it returns is a continuous numerical value.

To understand the accuracy of the prediction, it is compared with the true value which is also a continuous number.

#### Final Activation Function

**Linear** — This results in a numerical value which we require

#### Loss Function

**Mean squared error (MSE)** — This finds the average squared difference between the predicted value and the true value

### Categorical: Predicting a binary outcome

*E.g. predicting a transaction is fraud or not*

The final layer of the neural network will have one neuron and will return a value between 0 and 1, which can be inferred as a probably.

To understand the accuracy of the prediction, it is compared with the true value. If the data is that class, the true value is a 1, else it is a 0.

#### Final Activation Function

**Sigmoid** — This results in a value between 0 and 1 which we can infer to be how confident the model is of the example being in the class

#### Loss Function

**Binary Cross Entropy** — Cross entropy quantifies the difference between two probability distribution. Our model predicts a model distribution of {p, 1-p} as we have a binary distribution. We use binary cross-entropy to compare this with the true distribution {y, 1-y}

### Categorical: Predicting a single label from multiple classes

*E.g. predicting the document’s subject*

The final layer of the neural network will have one neuron for each of the classes and they will return a value between 0 and 1, which can be inferred as a probably. The output then results in a probability distribution as it sums to 1.

To understand the accuracy of the prediction, each output is compared with its corresponding true value. True values have been one-hot-encoded meaning a 1 appears in the column corresponding to the correct category, else a 0 appears

#### Final Activation Function

**Softmax** — This results in values between 0 and 1 for each of the outputs which all sum up to 1. Consequently, this can be inferred as a probability distribution

#### Loss Function

**Cross Entropy** — Cross entropy quantifies the difference between two probability distribution. Our model predicts a model distribution of {p1, p2, p3} (where p1+p2+p3 = 1). We use cross-entropy to compare this with the true distribution {y1, y2, y3}

### Categorical: Predicting multiple labels from multiple classes

*E.g. predicting the presence of animals in an image*

The final layer of the neural network will have one neuron for each of the classes and they will return a value between 0 and 1, which can be inferred as a probably.

To understand the accuracy of the prediction, each output is compared with its corresponding true value. If 1 appears in the true value column, the category it corresponds to is present in the data, else a 0 appears.

#### Final Activation Function

**Sigmoid** — This results in a value between 0 and 1 which we can infer to be how confident it is of it being in the class

#### Loss Function

**Binary Cross Entropy** — Cross entropy quantifies the difference between two probability distribution. Our model predicts a model distribution of {p, 1-p} (binary distribution) for each of the classes. We use binary cross-entropy to compare these with the true distributions {y, 1-y} for each class and sum up their results

### Summary Table

The following table summarizes the above information to allow you to quickly find the final layer activation function and loss function that is appropriate to your use-case

I hope this post was valuable! For further information on neural networks and final activation functions, please see the prior post: