Activation Function: Top 9 Most Popular Explained & When To Use Them

by | Jun 16, 2023 | Machine Learning

What is an activation function?

In artificial neural networks, an activation function is a mathematical function that introduces non-linearity to the output of a neuron or a neural network layer. It determines whether a neuron should be activated (fire) based on its inputs’ weighted sum.

The activation function takes the sum of the weighted inputs and applies a transformation to produce the output. This output is then passed to the next layer or used as the network’s final output. The activation function is a crucial neural network component, allowing it to learn complex patterns and make non-linear decisions.

an activation function is a mathematical function that introduces non-linearity to the output of a neuron or a neural network layer

An activation function is a mathematical function that introduces non-linearity to the output of a neuron or a neural network layer.

Top 9 types of activation function in neural network

1. Sigmoid function

The sigmoid function is a commonly used activation function in neural networks. A smooth, S-shaped curve maps the input to a value between 0 and 1, making it useful for binary classification problems or when we want to represent probabilities.

The sigmoid function is defined as:

f(x) = 1 / (1 + e^(-x))

In this equation, x represents the weighted sum of the inputs to a neuron or a layer of neurons. The exponential term in the denominator ensures that the output is always between 0 and 1.

sigmoid function

The sigmoid function. Source Wikipedia

Properties of the Sigmoid function

  1. Output Range: The output of the sigmoid function is bounded between 0 and 1. When the input is large and positive, the output approaches 1. Similarly, when the input is large and negative, the output approaches 0.
  2. Non-linearity: The sigmoid function introduces non-linearity to the network’s decision-making process. This non-linear property allows neural networks to model complex relationships between inputs and outputs.
  3. Smoothness: The sigmoid function is a smooth and differentiable function which facilitates efficient gradient-based optimization algorithms during the training of neural networks.

Limitations of the Sigmoid function

  1. Vanishing Gradients: The gradients of the sigmoid function become very small for large input values, leading to the problem of vanishing gradients. This can hinder the learning process, especially in deep neural networks.
  2. Output Saturation: The sigmoid function saturates at the extremes (0 and 1), meaning that when the input is very positive or negative, the output becomes close to 0 or 1, respectively. This saturation can slow down learning as the network becomes less sensitive to changes in the input.

Due to these limitations, alternative activation functions like ReLU, Leaky ReLU, and variants have gained popularity, especially in deep learning architectures. However, the sigmoid function still finds applications in specific scenarios, such as the output layer of binary classification models or when the goal is to obtain probabilities.

2. Rectified Linear Unit (ReLU) function

The Rectified Linear Unit (ReLU) is a widely used activation function in neural networks, particularly deep learning models. It is known for its simplicity and effectiveness in overcoming the limitations of other activation functions like the sigmoid and tanh functions.

The ReLU function is defined as follows:

f(x) = max(0, x)

In this equation, x represents the weighted sum of the inputs to a neuron or a layer of neurons. The ReLU function returns 0 for negative inputs and the input value itself for positive inputs.

Properties of the ReLU activation function

  1. Non-linearity: Like other activation functions, ReLU introduces non-linearity to the network, enabling it to learn and model complex relationships in the data. It allows neural networks to approximate any non-linear function.
  2. Sparsity: ReLU encourages sparsity in neural networks. Since it outputs 0 for negative inputs, ReLU neurons can be completely inactive for specific inputs. This sparsity can lead to more efficient and concise representations of the data.
  3. Avoiding Vanishing Gradient: ReLU helps alleviate the vanishing gradient problem encountered in deep neural networks. The derivative of ReLU is either 0 or 1, which prevents the gradients from vanishing as the network gets deeper. This property facilitates more effective backpropagation and faster training.
  4. Computational Efficiency: The ReLU function is computationally efficient to evaluate compared to more complex tasks like the sigmoid and tanh. It involves simple thresholding and avoids costly exponential calculations.

Limitations of ReLU

  1. Dead Neurons: ReLU neurons can sometimes become “dead” or non-responsive, where the neuron’s output is always 0 for any input. Once a neuron dies, it no longer contributes to the learning process. This issue can be addressed using variants of ReLU, such as Leaky ReLU or Parametric ReLU (PReLU).
  2. Output Saturation: ReLU saturates at the upper bound, outputting the input value for any positive input. This saturation can cause the neuron to lose sensitivity to large positive values, limiting its ability to learn further from those inputs.

Due to its simplicity and effectiveness, ReLU is widely used in deep learning architectures, particularly in convolutional neural networks (CNNs). It has performed excellently in various computer vision and natural language processing tasks.

3. Hyperbolic tangent (tanh) function

The Hyperbolic Tangent (tanh) function is an activation function commonly used in neural networks. It is similar to the sigmoid function but is centred around zero and ranges between -1 and 1. The tanh function is defined as follows:

f(x) = (e^x - e^(-x)) / (e^x + e^(-x))

In this equation, ‘x’ represents the weighted sum of the inputs to a neuron or a layer of neurons.

Properties of the Tanh activation function

  1. Non-linearity: Like other activation functions, tanh introduces non-linearity to the network, enabling it to learn and model complex relationships in the data. It allows neural networks to approximate any non-linear function.
  2. Symmetry: The tanh function is symmetric around the origin (0, 0). It produces negative outputs for negative inputs and positive outputs for positive inputs, resulting in a smooth S-shaped curve.
  3. Output Range: The output of the tanh function is bounded between -1 and 1. When the input is large and positive, the output approaches 1. Similarly, when the input is large and negative, the output approaches -1.
  4. Zero-Centred: Unlike the sigmoid function, which is centred around 0.5, the tanh function is centred around zero. This can be advantageous in some cases, such as when the input data is zero-centred or when the model benefits from negative and positive activations.

The tanh function shares some similarities with the sigmoid function, but it has a steeper gradient, which makes it more sensitive to changes in the input. However, like the sigmoid function, it can still suffer from the vanishing gradient problem for very large/small input values.

The tanh function is used in various neural network architectures, particularly in recurrent neural networks (RNNs) and long short-term memory (LSTM) networks, where its zero-centred property and non-linear characteristics can be beneficial. However, in recent years, the popularity of the ReLU and its variants has increased, mainly due to their simplicity and better performance in deep learning models.

4. Softmax activation function

The Softmax function is an activation function commonly used in the output layer of a neural network for multi-class classification problems. It takes a vector of real numbers as input and normalizes it into a probability distribution, where the sum of the probabilities equals 1. The Softmax function is defined as follows:

f(x_i) = e^(x_i) / sum(e^(x_j)) for all j

In this equation, x_i represents the input value for a particular class i , and x_j represents the input values for all the classes.

Properties of the Softmax activation function

  1. Probability Distribution: The Softmax function transforms the input values into a probability distribution, where each value represents the probability of the corresponding class. The output values are positive and sum up to 1, making it suitable for multi-class classification problems.
  2. Sensitivity to Input Differences: The Softmax function amplifies the differences between the input values, which means that larger input values will correspond to higher probabilities. This property allows the neural network to make more confident predictions for higher-score classes.
  3. Differentiability: The Softmax function is differentiable, which is crucial for backpropagation and gradient-based optimization algorithms used during the training process of neural networks.

The Softmax function is typically used in the final layer of a neural network for multi-class classification tasks, where the goal is to assign an input to one of several possible classes. The class with the highest probability outputted by the Softmax function is usually considered the predicted class.

It is important to note that the Softmax function is sensitive to outliers and large input values, which can lead to numerical instability. This issue can be mitigated by subtracting the maximum input value from each input vector element, known as “logit shifting” or “logit scaling,” before applying the Softmax function. This helps prevent numerical overflow or underflow.

Overall, the Softmax function is a fundamental tool in multi-class classification tasks, enabling the neural network to provide a probabilistic interpretation of its predictions.

5. Leaky ReLU activation function

The Leaky ReLU (Rectified Linear Unit) is a variant of the ReLU activation function that addresses the “dead ReLU” problem. The Leaky ReLU introduces a small slope for negative inputs, allowing the neurons to have a non-zero output even when the input is negative. This helps mitigate the “dead” or non-responsive neurons in regular ReLU.

The Leaky ReLU function is defined as:

f(x) = max(ax, x)

In this equation x represents the input to a neuron or a layer of neurons and a is a small positive constant (usually a small fraction like 0.01). If x is positive, the function behaves like a regular ReLU, outputting x . However, if x is negative, the function returns ax instead of 0.

Properties of the Leaky ReLU activation function

  1. Non-linearity: Similar to ReLU, the Leaky ReLU introduces non-linearity to the network, enabling it to learn and model complex relationships in the data.
  2. Avoiding Dead Neurons: Introducing a non-zero slope for negative inputs helps mitigate the problem of dead or non-responsive neurons encountered in regular ReLU. With a small positive slope, even neurons that receive negative inputs can still contribute to the learning process.
  3. Continuous and Piecewise Linear: The Leaky ReLU function is continuous and piecewise linear, meaning it has a defined derivative for all values. This property facilitates backpropagation and efficient gradient-based optimization algorithms during the training process.

The choice between ReLU and Leaky ReLU depends on the specific problem and the characteristics of the data. Leaky ReLU is often preferred over regular ReLU when the risk of dead neurons is high or when having a more diverse range of activations is desirable by allowing negative values.

In recent years, other variants of ReLU, such as Parametric ReLU (PReLU), have also been developed. PReLU generalizes the concept of Leaky ReLU by allowing the a parameter to be learned during the training process rather than being predefined. This enables the network to determine the slope based on the data adaptively.

Leaky ReLU is a popular choice in neural networks, especially in scenarios where regular ReLU may lead to dead neurons or a broader range of activation values is desired.

6. Parametric ReLU (PReLU) activation function

Parametric ReLU (PReLU) is an activation function that extends the Rectified Linear Unit (ReLU) functionality by introducing a learnable parameter. In PReLU, the slope for negative inputs is not fixed but is learned during the training process.

The PReLU activation function is defined as follows:

f(x) = x if x >= 0

f(x) = ax if x < 0

In this equation, x represents the input to a neuron or a layer of neurons, and a is a learnable parameter that controls the slope for negative inputs.

Properties of PReLU

  1. Non-linearity: Similar to ReLU and Leaky ReLU, PReLU introduces non-linearity to the network, enabling it to model complex relationships in the data.
  2. Adaptive Slope: The main advantage of PReLU over Leaky ReLU is its ability to learn the optimal slope for each neuron. This adaptability can improve the flexibility and expressiveness of the network.
  3. Mitigating Dead Neurons: By allowing negative values to pass through with an adaptive slope, PReLU helps prevent the issue of dead or non-responsive neurons encountered in regular ReLU.

The choice between ReLU, Leaky ReLU, and PReLU depends on the specific problem and the characteristics of the data. PReLU is often used when there is a concern about dead neurons or when it is desirable to have a learnable slope that can better capture the nuances of the data.

It’s worth noting that PReLU introduces additional parameters to be learned, which increases the model’s complexity and computational requirements. Consequently, PReLU might be more suitable for larger datasets and more complex models.

PReLU has been successfully applied in various deep learning architectures, including convolutional neural networks (CNNs) and recurrent neural networks (RNNs), and has demonstrated improved performance in specific scenarios compared to other activation functions.

7. Exponential Linear Unit (ELU) activation function

The Exponential Linear Unit (ELU) is an activation function that aims to overcome some of the limitations of the traditional Rectified Linear Unit (ReLU) function, such as the dying ReLU problem and the saturation of negative values. ELU introduces a differentiable function that smoothly saturates negative inputs and gives negative values a non-zero output.

The ELU function is defined as follows:

f(x) = x if x >= 0

f(x) = a(e^x - 1) if x < 0

In this equation, x represents the input to a neuron or a layer of neurons, and a is a positive constant that controls the slope for negative inputs.

Properties of the ELU activation function

  1. Non-linearity: Like other activation functions, ELU introduces non-linearity to the network, enabling it to model complex relationships in the data.
  2. Smooth Saturation: The ELU function smoothly saturates negative inputs, avoiding the abrupt saturation of ReLU. This helps to alleviate the issue of dead or non-responsive neurons encountered in ReLU.
  3. Continuity and Differentiability: ELU is a continuous and differentiable function, allowing for efficient backpropagation and gradient-based optimization algorithms during training.
  4. Negative Output: ELU allows negative values to have a non-zero output, which can be helpful in cases where capturing negative activations is essential for the task at hand.
  5. Exponential Decay: The negative values in the ELU function decay exponentially as x approaches negative infinity, providing a more robust and well-behaved response to extreme negative inputs.

It’s important to note that ELU introduces additional computational complexity compared to ReLU and other more straightforward activation functions due to the exponential function. However, the improved performance and mitigated limitations make it a popular choice, especially in deep learning architectures.

ELU has been used in various applications and has shown promising results in reducing overfitting, improving learning efficiency, and achieving better generalization compared to ReLU, particularly in deep neural networks.

When choosing an activation function, it is vital to consider the specific problem and the characteristics of the data and experiment with different options to find the most suitable activation function for optimal performance.

8. Gaussian Error Linear Unit (GELU) activation function

The Gaussian Error Linear Unit (GELU) is an activation function that aims to combine the desirable properties of the Gaussian distribution and the rectifier function. It provides a smooth approximation to the rectifier while preserving the desirable properties of both functions.

The GELU activation function is defined as:

f(x) = 0.5 * x * (1 + tanh(sqrt(2 / pi) * (x + 0.044715 * x^3)))

In this equation x represents the input to a neuron or a layer of neurons.

Properties of the GELU activation function

  1. Non-linearity: GELU introduces non-linearity to the network, allowing it to model complex relationships in the data.
  2. Smoothness: GELU is a smooth function that transitions from negative to positive inputs. It is differentiable everywhere, facilitating backpropagation and gradient-based optimization.
  3. Gaussian Approximation: GELU approximates a Gaussian cumulative distribution function (CDF). This approximation allows GELU to exhibit similar behaviour to the rectifier function while providing smoother gradients.
  4. Saturation: GELU saturates at the upper and lower bounds, preventing significant activations. This can be beneficial for the stability and convergence of neural networks.

GELU has gained popularity in deep learning models, particularly in natural language processing tasks and transformer architectures. It has shown improved convergence speed and generalization performance compared to other activation functions, such as ReLU.

However, it is worth noting that GELU introduces additional computational complexity due to trigonometric and exponential operations. Therefore, it might have a slight impact on the overall computational efficiency of the model.

When choosing an activation function, it is essential to consider the specific requirements of the problem and experiment with different options to find the most suitable activation function for optimal performance.

9. Linear activation function

The Linear activation function, also known as the identity function, is one of the simplest activation functions used in neural networks. It applies a linear transformation to the input, meaning the output equals the input without any non-linear mapping. The linear activation function is defined as:

f(x) = x

In this equation, ‘x’ represents the input to a neuron or a layer of neurons.

The Linear activation function has the following properties:

  1. Linearity: As the name suggests, the Linear activation function introduces linearity to the network. It performs a simple scaling of the input without introducing any non-linear transformations.
  2. No Activation: Unlike other activation functions that introduce non-linearities to capture complex relationships, the Linear activation function does not alter the input values. It is essentially a pass-through function.
  3. Limited Representation Power: Since the Linear activation function does not introduce non-linearity, it has limited representation power. Neural networks with only linear activation functions can only learn linear relationships between the input and output.
  4. Gradient Stability: The gradient of the Linear activation function is constant and does not depend on the input. This can be advantageous in some cases, as it ensures stable gradients during backpropagation.

The Linear activation function is often used in the output layer of regression problems, where the goal is to predict continuous values. It allows the neural network to directly output values without any transformation.

However, non-linear activation functions are typically used for most other tasks, such as classification or complex pattern recognition. Non-linear activation functions enable neural networks to learn and represent more complex relationships in the data.

It’s important to note that even though individual layers of a neural network may use linear activation functions, stacking multiple linear layers does not increase the model’s representative capacity beyond that of a single linear layer. To model complex relationships, non-linear activation functions are essential in intermediate layers of the network.

Conclusion

Activation functions play a critical role in neural networks by introducing non-linearity, enabling the models to learn and represent complex relationships in the data. Several commonly used activation functions have been discussed:

  1. Sigmoid Function: The sigmoid function maps the input to a range between 0 and 1, suitable for binary classification tasks and providing smooth outputs. However, it suffers from the vanishing gradient problem and is less commonly used in deep neural networks.
  2. Rectified Linear Unit (ReLU): ReLU sets negative inputs to zero and keeps positive inputs unchanged. It is widely used due to its simplicity, computational efficiency, and effective mitigation of the vanishing gradient problem. However, it can suffer from dead neurons for negative inputs.
  3. Hyperbolic Tangent (tanh) Function: The tanh function is similar to the sigmoid function but centred around zero and ranged from -1 to 1. It is often used in recurrent neural networks (RNNs) and can capture both positive and negative activations. However, it also suffers from the vanishing gradient problem.
  4. Softmax Function: The softmax function is used in the output layer for multi-class classification, transforming a vector of real numbers into a probability distribution. It assigns probabilities to each class, facilitating selection of the most probable class.
  5. Leaky ReLU: Leaky ReLU introduces a small slope for negative inputs, preventing dead neurons and allowing negative values to have a non-zero output. It is a variant of ReLU that addresses some of its limitations.
  6. Parametric ReLU (PReLU): PReLU is an extension of Leaky ReLU where the slope for negative inputs is a learnable parameter. This allows each neuron to determine the most appropriate slope adaptively.
  7. Exponential Linear Unit (ELU): ELU smoothly saturates negative inputs, avoiding the dying ReLU problem. It introduces a non-zero output for negative inputs and decays exponentially for highly negative values.
  8. Gaussian Error Linear Unit (GELU): GELU provides a smooth approximation to the rectifier while resembling a Gaussian distribution. It combines desirable properties of both functions and has shown promising results in specific applications.
  9. Linear activation function: The linear activation function is one of the simplest activation functions used in neural networks. It applies a linear transformation to the input, meaning the output equals the input without any non-linear mapping.

Each activation function has its advantages, limitations, and use cases. The choice of activation function depends on the specific problem, the characteristics of the data, and the performance requirements. Experimentation and evaluation of different activation functions can help determine the most suitable one for a given task.

About the Author

Neri Van Otten

Neri Van Otten

Neri Van Otten is the founder of Spot Intelligence, a machine learning engineer with over 12 years of experience specialising in Natural Language Processing (NLP) and deep learning innovation. Dedicated to making your projects succeed.

Recent Articles

key elements of cognitive computing

Cognitive Computing Made Simple: Powerful Artificial Intelligence (AI) Capabilities & Examples

What is Cognitive Computing? The term "cognitive computing" has become increasingly prominent in today's rapidly evolving technological landscape. As our society...

Multilayer Perceptron Architecture

Multilayer Perceptron Explained And How To Train & Optimise MLPs

What is a Multilayer perceptron (MLP)? In artificial intelligence and machine learning, the Multilayer Perceptron (MLP) stands as one of the foundational architectures,...

Left: Illustration of SGD optimization with a typical learning rate schedule. The model converges to a minimum at the end of training. Right: Illustration of Snapshot Ensembling. The model undergoes several learning rate annealing cycles, converging to and escaping from multiple local minima. We take a snapshot at each minimum for test-time ensembling

Learning Rate In Machine Learning And Deep Learning Made Simple

Machine learning algorithms are at the core of many modern technological advancements, powering everything from recommendation systems to autonomous vehicles....

What causes the cold-start problem?

The Cold-Start Problem In Machine Learning Explained & 6 Mitigating Strategies

What is the Cold-Start Problem in Machine Learning? The cold-start problem refers to a common challenge encountered in machine learning systems, particularly in...

Nodes and edges in a bayesian network

Bayesian Network Made Simple [How It Is Used In Artificial Intelligence & Machine Learning]

What is a Bayesian Network? Bayesian network, also known as belief networks or Bayes nets, are probabilistic graphical models representing random variables and their...

Query2vec is an example of knowledge graph reasoning. Conjunctive queries: Where did Canadian citizens with Turing Award Graduate?

Knowledge Graph Reasoning Made Simple [3 Technical Methods & How To Handle Uncertanty]

What is Knowledge Graph Reasoning? Knowledge Graph Reasoning refers to drawing logical inferences, making deductions, and uncovering implicit information within a...

the process of speech recognition

How To Implement Speech Recognition [3 Ways & 7 Machine Learning Models]

What is Speech Recognition? Speech recognition, also known as automatic speech recognition (ASR) or voice recognition, is a technology that converts spoken language...

Key components of conversational AI

Conversational AI Explained: Top 9 Tools & How To Guide [Including GPT]

What is Conversational AI? Conversational AI, short for Conversational Artificial Intelligence, refers to using artificial intelligence and natural language processing...

7 common NLP tools

Top 10 Most Useful Natural Language Processing (NLP) Tools [Libraries & Frameworks] LLMs Included

What are Common Natural Language Processing (NLP) Tools? Natural Language Processing (NLP) tools are software components, libraries, or frameworks designed to...

0 Comments

Submit a Comment

Your email address will not be published. Required fields are marked *

nlp trends

2024 NLP Expert Trend Predictions

Get a FREE PDF with expert predictions for 2024. How will natural language processing (NLP) impact businesses? What can we expect from the state-of-the-art models?

Find out this and more by subscribing* to our NLP newsletter.

You have Successfully Subscribed!