The astounding success of artificial neural networks can be attributed in part to the fact that they are able to estimate complex, non-linear functions that are often present in real-world data. This is achieved through activation functions, which introduce non-linearity into the networks and facilitate them to find a better fit for the data. Thus, it can be said that activation functions are crucial for the effectiveness of neural networks.
This article will try to provide a relatively comprehensive but not overly technical overview of activation functions in neural networks. By the end, you'll have a firm grasp of the following:
- What activation functions are and why to use them
- How activation functions help neural networks learn
- What the most widely used activation functions are, their pros and cons
- How to choose an activation function when training a neural network
- Why activation functions need to be differentiable
Why use an activation function
While not all activation functions are non-linear, the overwhelming majority is and for a good reason. Non-linear activation functions help introduce additional complexity into neural networks and facilitate them to “learn” to approximate a much larger swathe of functions. If not for non-linear activation functions, neural networks would only be able to learn linear and affine functions since the layers would be linearly dependent on each other and would just comprise a glorified affine function. Another important aspect of activation functions is that they allow us to map an input stream of unknown distribution and scale it to a known one (e.g., the sigmoid function maps any input to a value between 0 and 1). This helps stabilize the training of neural networks and also helps map the values to our desired output in the last layer (for non-regression tasks).
How do activation functions work?
Before discussing modern and widely used activation functions, it’s a good idea to get a solid understanding of how they work in a neural network. Regardless of network architecture, an activation function will take the values generated by a given layer of the network (in a fully connected network, this would be the sum of weights and biases) and apply a certain transformation to these values to map them to a specific range.
Here's a useful illustration of the role an activation function plays in a neural network.
After taking a weighted sum of the inputs plus the bias (W₁X₁ + W₂*X₂ + … + W𝚗*X𝚗+ b), we pass this value to the activation function ⨍, which then gives us the output of the given neuron. In this case, each of the Xᵢ values is the output of a neuron from the previous layer, while Wᵢ are our neuron’s weights assigned to each input Xᵢ.
Simple activation functions
With this in mind, what does a real-world activation function look like? Perhaps the simplest activation function one could think of is the identity function, which corresponds to not changing the inputs in any way. Of course, this wouldn’t be of much use as it literally doesn’t do anything, and so we would still face the aforementioned problem of an unpredictable distribution of values, destabilizing the training of our network. A somewhat more effective, but still super simple way to tackle this problem is the step activation function, an illustration of which can be found below.
As one can see, all the step activation function does is take the input, and assign it to either 0 or 1, depending on whether the input is larger or smaller than 0. While this fixes the issue of having a more predictable distribution of values, it’s almost never used, because you lose a lot of information by squishing all nuance out of the network, so to speak.
Popular activation functions
Now that we have a solid grasp of what activation functions do, let’s discuss some functions that are actually used in practice. There has been a hefty amount of research regarding non-linear activation functions in recent years, which has introduced new and improved activation functions and, thus, affected the popularity of old ones. However, tried-and-true activation functions are still used often and have their place.
In no particular order, a few of the most widely used and known ones are:
- Sigmoid - This activation function maps an input stream to the range (0, 1). Unlike the step function, sigmoid doesn’t just output 0 or 1, but instead numbers in that range (not including 0 and 1 themselves). Here’s an illustration and the formula:
While the sigmoid function is better than the ones discussed before and does have its place (especially in tasks like binary classification), it has somewhat major drawbacks in that very large and very small input values cause problems during backpropagation, since these saturated neurons “kill” the gradients. Another drawback is that since the range is (0, 1), the output of the sigmoid function is not 0-centered, which also causes problems during backpropagation (a detailed discussion of these phenomena is out of this article’s scope, though). Finally, exponential functions are a bit expensive computationally, which can slow down the network.
- Tanh - This activation function is somewhat similar to the sigmoid in the sense that it also maps the input values to an s-shaped curve, but in this case, the range is (-1, 1) and is centered at 0, which solves one of the problems with the sigmoid function. Tanh stands for the hyperbolic tangent, which is just the hyperbolic sine divided by the hyperbolic cosine, similarly to the regular tangent. Here’s an illustration along with the formula:
While tanh can be more effective than the sigmoid, it still encounters the same problems as the sigmoid by causing issues during backpropagation in case of very large or very small values, while also being an exponential function.
- ReLU - This is a more modern and widely used activation function. It stands for Rectified Linear Unit and looks like this:
The beauty of ReLU lies partly in its simplicity. As one can see, all it does is replace negative values with 0 and keep positive ones as is. This avoids the problem of “killing” the gradients of large and small values, while also being much faster computationally. Also, in practice networks using ReLU tend to converge about six times faster than those with sigmoid and tanh.
However, ReLU still faces some problems. First off, it’s not 0-centered, which can cause problems during training. Most importantly though, it does not deal with negative inputs in a particularly meaningful way. Even more modern activation functions tend to take ReLU and try to fix these problems. Some of them are a bit too complex for the purposes of this article, but one that warrants discussion is Leaky ReLU.
- Leaky ReLU - This activation function builds on top of ReLU by trying to handle negative inputs in a more meaningful way. More specifically, instead of replacing negative values with 0, it instead multiplies them by some user-defined number between 0 and 1, which essentially lets some of the information contained in the negative inputs “leak” into the model, hence the name.
How to choose an activation function
Here’s the million-dollar question: how to actually choose an activation function when training a network from scratch? The answer is simpler than it may seem, at least in my opinion. Generally, it’s a safe bet to choose one of the ReLU-based activation functions (including ReLU itself), since they have empirically proven to be very effective for almost any task. However, some architectures call for specific activation functions. For example, RNNs and LSTMs utilize sigmoid and tanh, and their logic gate-like architecture wouldn’t work with ReLU. However, other than those specific cases, one need not overthink, at least during the initial stages of experimentation.
Why should an activation function be differentiable?
As mentioned before, neural networks learn using an algorithm called backpropagation. This algorithm essentially uses the model’s incorrect predictions to adjust the network in a way that will make it less incorrect, thereby improving the network’s predictive capabilities. This is done through differentiation. Therefore, in order for a network to be trainable, all its elements need to be differentiable, including the activation function. The question of what constitutes differentiability warrants its own discussion, which unfortunately is out of this article’s scope.
To recap, activation functions are crucial for modern neural networks because they facilitate the learning of a much wider set of functions and can, thus, make the model learn to perform all sorts of complex tasks. The impressive advances in computer vision, natural language processing, time series, and many other fields would be nearly impossible without the opportunities created by non-linear activation functions. While exponent-based activation functions like the sigmoid and tanh have been used for decades and can yield good results, more modern ones like ReLU work better in most applications. As a rule of thumb, when training a neural network from scratch, one can simply use ReLU, leaky ReLU, or GELU and expect decent results.
Author: Davit Grigoryan, Independent Machine Learning Engineer