SWISH function and instance normalisation

The SWISH function and Instance Normalization are components used in deep learning to enhance model performance by improving activation behavior and normalizing data distributions, respectively. Here’s a detailed look at both concepts:

1. SWISH Activation Function:

The SWISH function is an activation function defined by the formula:

$$ \text{SWISH}(x) = x \cdot \sigma(\beta x) $$

where:

$x$ is the input to the activation function.
$\sigma(x)$ is the sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$ .
$\beta$ is a constant or learnable parameter that controls the shape of the function.

Properties of SWISH:

Smooth and Non-monotonic: Unlike the ReLU (Rectified Linear Unit) function, which is monotonic, SWISH is non-monotonic, which can lead to better gradient flow and improved optimization during training.
Range: The function outputs values between $-\infty$ and $\infty$ , allowing for both negative and positive activations.
Derivative: The derivative of SWISH is more complex than ReLU but can be computed efficiently for backpropagation.

Advantages of SWISH:

Improved Performance: SWISH has been shown to outperform traditional activation functions like ReLU and Leaky ReLU in some models, particularly deep neural networks, by enhancing learning capabilities.
Flexible: The presence of the parameter $\beta$ allows for a range of behaviors. When $\beta = 1$ , SWISH becomes $x \cdot \sigma(x)$ , and as $\beta \rightarrow 0$ , it approximates a linear function. This flexibility lets it adapt during training.
Gradient Flow: SWISH helps mitigate issues with vanishing gradients in deep networks, which can improve convergence rates and final model accuracy.

2. Instance Normalization:

Instance Normalization (IN) is a normalization technique used primarily in style transfer and generative models. It normalizes the input across each instance (or feature map) individually, rather than across a batch as in batch normalization.

Formula for Instance Normalization:

Given a feature map $x$ of size $N \times C \times H \times W$ (where $N$ is the batch size, $C$ is the number of channels, $H$ is the height, and $W$ is the width), the normalized output $\hat{x}$ for each instance is given by: