ReLU, Sigmoid & Tanh Activation Functions

Activation Function is a really important and really easy-to-understand concept. There are many activation functions, few of them are used widely. ReLU is the most important one because it is the foundation of modern deep neural networks. We don’t use sigmoid or tanh activation functions in deep neural networks. Because we don’t have to use them. We will understand why we do so in this blog.

Contents hide

1 Reasons why we don’t use tanh and sigmoid

2 Sigmoid activation function

3 Tanh activation function

4 Vanishing Gradient Descent

5 ReLU activation function

7.3 Discover more from Arshad Kazi

If you want to understand the activation function concept and the reason behind using a non-linear activation function. in neural networks read the following blog before reading this blog.

Intro to Activation Functions & their Non-Linearity!

Part 1 of this blog

Let’s just plot the structure for this blog, first, we will just see the reason for not using age-old sigmoid or tanh in modern networks. Then we will start understanding these functions. While doing so, we will understand the reasons one by one. After that, we will jump into a modern activation function, ReLU. We will spend a little extra time on ReLU because it is important to understand it.

At the end of this blog we will know about sigmoid, tanh and ReLU activation functions.

So, let’s start with first step, here are few reasons why we don’t use sigmoid nor tanh in modern neural networks,

Reasons why we don’t use tanh and sigmoid

Sigmoid and tanh functions contain exponential terms in them which are harder to compute.
They both squish the output between a smaller range and hence make it harder to train deeper neural networks.
Both suffer from the vanishing gradient problem.
sigmoid is not zero-centered. (though, it doesn’t matter that much)

Keep this side for a while and just start understanding these functions one by one, we will understand these reasons along with it,

Sigmoid activation function

Let’s see the graph first,

Sigmoid is an activation function with S-shaped graph. It takes any value(x) and squishes it between 0 and 1.

It contains an exponential term which is harder to calculate (compute) for our computers and thus, makes it slower to train any model. Yes, that’s one of the reason we don’t use sigmoid.

Another problem is that sigmoid is not zero-centered. This is a problem because it gives us positive values each time. It causes the saturation problem while training the network. (Although, we can avoid this problem by using large negative bias values)

Before discussing other problems let’s have a look on tanh.

Tanh activation function

tanh function also contains exponential terms like sigmoid. Only difference between tanh and sigmoid is that tanh is zero-centered. We can see the following graph for this,

Tanh graph is zero-centered i.e. it has a range of -1 to 1. It solves the saturation problem. In this problem, the network follows a zigzag order while training. But as we said, it doesn’t matter much.

We use a linear function before applying these activation functions at each neuron where we add the weights and biases. By choosing appropriate bias we can convert Sigmoid non-zero-centered functions to zero-centered.

The most important reason for not using sigmoid and tanh is following,

Vanishing Gradient Descent

In the graph above we can see the derivative of the sigmoid function gets closer to the zero when we increase the value of |x|.

We use this derivative in the back-propagation step where weights get updated to reduce the loss (the difference between actual and predicted output). We use these activation functions in every layer e.g. we multiply the output of each layer with the activation function.

Hence deeper the layer gets larger the output becomes and as the output is gets larger, the derivative gets closer to zero. This makes it impossible to update the weights of the deeper layers.

Both sigmoid and tanh functions suffer from this same problem.

We call this vanishing gradient problem where it becomes almost impossible to update weights in the deeper layers.

To solve all of these problems we use ReLU function, which stands for Rectified Linear Unit function.

ReLU activation function

ReLU or rectified linear unit function can be given as following,

ReLU is a semi-linear function (in formal terms it’s still non-linear) i.e. it is linear function after the zero value and before zero it gives zero as an output(see the graph. above).

There are no exponential terms in ReLU like we had in sigmoid/tanh.

It does not bound output for any certain limit i.e. it gives freedom to output for having any value because it’s linear function after zero.

The derivative for ReLU is 1 for all the values beyond zero. Which helps model to converge faster meaning, it helps our neural network to train even the deeper layers.

Sparsity

As the derivative of the ReLU before zero is zero. It gives zero output for the neurons having negative values. In simple terms, neurons become sparsely active and this is desirable.

It is important to have sparsity in the neural network because it becomes easier to learn for our network. Consider we are trying to identify Cat or Dog from the image. Using ReLU, it is possible to have a set of neurons that are not contributing to the final output as other neurons for Dog identification and another set for Cat identification.

In simple words, to get a certain output we will be using only a certain set of neurons. This makes it easier to get an accurate output as only a specific set of neurons is learning present for that specific output.

Leaky-ReLU

While there are many advantages of ReLU over using sigmoid or tanh there is one disadvantage too.

In ReLU, slop for negative values is zero i.e. derivative is zero. If a neuron gets stuck in negative values it will not overcome to do contribution in output, it will become dead.

Hence, when we will have a large negative bias we will have many dead neurons in the neural network which will not be a good idea.

To solve this problem we use Leaky-Relu. Essentially, it has a leak before zero. Leaky-Relu has non-zero slope (derivative) for the negative values. Which avoids the problem of dying neurons.

Conclusion

We use ReLU as an activation function in our neural network almost all the time. ReLU is easier to compute and does not suffer from vanishing gradient problems as other functions do.

Do you have any questions or suggestions? Please feel free to reach out on [email protected] or hit me anytime on Twitter, LinkedIn!

Discover more from Arshad Kazi

Subscribe to get the latest posts sent to your email.

ReLU, Sigmoid & Tanh Activation Functions

Reasons why we don’t use tanh and sigmoid

Sigmoid activation function

Tanh activation function

Vanishing Gradient Descent

ReLU activation function

Sparsity

Leaky-ReLU

Conclusion

Share this:

Like this:

Discover more from Arshad Kazi

Leave a Reply/Feedback :)Cancel reply

Discover more from Arshad Kazi