A Simplified View On Neural Network Activation Functions

Rajesh R.

Published in

ILLUMINATION

4 min readSep 11, 2021

Understand Nonlinear Activation Functions In Neural Network.

By Rajesh R

In neural network architecture, the neurons will either have to be either active, i.e., fired, or be in an inactive state.

Let y be the value of a neuron at any point in the neural network.
Then y = Sum (Weights * input + bias).

This ‘y’ value can range anywhere between -infinity to +infinity. This value is acceptable if all we model is linear regression. In such cases, the activation function creates an output signal proportional to the input, as below.

Thus, Y = c * y, for linear transformation.

The c is a constant and multiple. Thus, a neural network with a linear activation function will not use back-propagation to improve prediction and collapses into a single-layer neural network of limited ability.

Most neural networks involve modeling complex data, such as images, video, audio, and data sets that are nonlinear or have high dimensionality. Nonlinear functions allow back-propagation and stacking of multiple hidden layers to learn complex data with a high level of accuracy. The activation function for a nonlinear transformation applied on ‘y’ can be represented below.

Y = Activation Function(y), for nonlinear transformation.

Thus, in the above equation, the activation function can transform the summed weighted input from the node to the node’s activation in a neural network. Furthermore, the nonlinear activation functions convert the linear information into a nonlinear output, thus facilitating high order dimensions for deep neural networks.

The well-known nonlinear activation functions in neural networks are Sigmoid, Tanh, ReLU, and Leaky ReLU, among many others.

Sigmoid Activation Function

The sigmoid function is a logistic function with an output ranging between 0 and 1. Now, irrespective of the input value, this causes a particular problem called Vanishing Gradient, where a significant change in the input of the sigmoid function causes a slight change in the output. Hence, the derivative becomes small, making learning slow and ineffective for large networks.

import matplotlib.pyplot as plt
import numpy as np
import math
  
x = np.linspace(-10, 10, 100)

sigmoid = lambda x: 1/(1 + np.exp(-x))  
plt.plot(x, sigmoid(x))
plt.xlabel("x")
plt.ylabel("Sigmoid(X)")
  
plt.show()

Tanh Activation Function

The tanh function is just a rescaled version of the logistic sigmoid function with a range of -1 to 1. It is worth noting that in both the Sigmoid and Tanh, a neuron saturates whenever large weights cause the neuron to produce gradients of values very close to the extremities, resulting in either under-fitting or over-fitting respectively.

tanh = lambda x: (np.exp(x)-np.exp(-x))/(np.exp(x)+np.exp(-x))
plt.plot(x,tanh(x))
plt.xlabel("x")
plt.ylabel("Tanh(X)")
  
plt.show()

ReLU Activation Function

The Rectified linear unit (ReLU) activation function is the most widely used for deep learning applications. The ReLU activation function applies a threshold operation to the input element, where values below or equal to 0 is 0. Now, when using ReLU, the stepwise process evaluates to 0; whenever the input is less than or equal to 0, the gradient is 0. Repeated training causing neurons to be 0 will eventually make the neuron always output 0 as its weights never get changed — this problem of ‘Dead Neurons’ decreases the capability of the neural network. Hidden layers mainly use ReLU activation.

relu = lambda x: x*(x>0)

plt.plot(x,relu(x))
plt.xlabel("x")
plt.ylabel("ReLU(X)")

plt.show()

Leaky ReLU Activation Function

The Leaky ReLU is an improvement over ReLU. Firstly, ReLU is not continuously differentiable and fails at 0. Additionally, ReLU sets all values < 0 to zero. The neurons arriving at large negative values cannot recover from being stuck at 0. The neuron effectively dies and results in the dying ReLU problem. The impact of this problem may be that the network essentially stops learning and underperforms. Leaky ReLU comes to the rescue in such cases. The leaky ReLU is a variant of ReLU, where when x<0, it applies a small, non-zero, constant gradient. Leaky ReLU is also a solution to ‘Vanishing Gradients’.

leaky_relu = lambda x: x if x > 0 else 0.01*x

plt.plot(x,list(map(leaky_relu,x)))
plt.xlabel("x")
plt.ylabel("Leaky ReLU(X)")

plt.show()

Quick Comparison Chart

Conclusion

The choice of activation function is vital in neural networks. All activation functions must be differentiable and quickly converge to get optimal values of the weights.