The SWISH function and Instance Normalization are components used in deep learning to enhance model performance by improving activation behavior and normalizing data distributions, respectively. Here’s a detailed look at both concepts:
1. SWISH Activation Function:
The SWISH function is an activation function defined by the formula:
$$
\text{SWISH}(x) = x \cdot \sigma(\beta x)
$$
where:
- $x$ is the input to the activation function.
- $\sigma(x)$ is the sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$ .
- $\beta$ is a constant or learnable parameter that controls the shape of the function.
Properties of SWISH:
- Smooth and Non-monotonic: Unlike the ReLU (Rectified Linear Unit) function, which is monotonic, SWISH is non-monotonic, which can lead to better gradient flow and improved optimization during training.
- Range: The function outputs values between $-\infty$ and $\infty$ , allowing for both negative and positive activations.
- Derivative: The derivative of SWISH is more complex than ReLU but can be computed efficiently for backpropagation.
Advantages of SWISH:
- Improved Performance: SWISH has been shown to outperform traditional activation functions like ReLU and Leaky ReLU in some models, particularly deep neural networks, by enhancing learning capabilities.
- Flexible: The presence of the parameter $\beta$ allows for a range of behaviors. When $\beta = 1$ , SWISH becomes $x \cdot \sigma(x)$ , and as $\beta \rightarrow 0$ , it approximates a linear function. This flexibility lets it adapt during training.
- Gradient Flow: SWISH helps mitigate issues with vanishing gradients in deep networks, which can improve convergence rates and final model accuracy.
2. Instance Normalization:
Instance Normalization (IN) is a normalization technique used primarily in style transfer and generative models. It normalizes the input across each instance (or feature map) individually, rather than across a batch as in batch normalization.
Formula for Instance Normalization:
Given a feature map $x$ of size $N \times C \times H \times W$ (where $N$ is the batch size, $C$ is the number of channels, $H$ is the height, and $W$ is the width), the normalized output $\hat{x}$ for each instance is given by: