Batch Normalization and Activation function Sequence Confusion

Nihar Kanungo
6 min readSep 11, 2019

The Process of building neural network involves many methodology and techniques . A data scientist should carefully observe the impact of each feature on the outcome of the network before deciding the choices to make for that network. Data Scientists and researchers are performing enormous analysis and study to map any feature to the situation that it can perform better on.

An activation function is the core of any neural network as with out it the network is essentially just a linear regression model and we know that in reality there are hardly any set of data completely linear in nature. The activation function does the non linear transformation to the input making it capable to learn and perform more comlex operations .

Simillarly Batch normalization since its inception (year 2015) is one of the most preferred choice of generalization method for neural networks. For quite sometime people were confused between standardization , equalization and normalization ( many are confused now as well). However we will not discuss on this topic here as the intent of this arcticle is to discuss about the sequence of using Batch normalization and activation function.

Batch normalization solves a major problem of internal covariate shift (Internal — input ,Covariate — Feature shift — change), which essentially means that if the input data is of different amplitude then how does a network understand the relationship between them (recently a paper came out which tells that Batch Normalization doesn’t solve internal covariation shift, however for now we will ignore that as we are yet to get the full understanding of it ). Having said that the effect of Batch normalization on generalization of the neural networks is very much evident .

The batch normalization is a very simple concept to understand . As the name suggests , it’s different from the Image normalization as here the focus is on normalizing over a batch instead of the entire image .Batch size refers to the number of images that the network is currently looking at to optimize and find out the textures,patterns , parts of object, objects etc.

For the point of discussion let’s say our model has 15 layer of convolution before we make the final prediction and the layer 10 of the network consists of 64 channels (image1 below shows example of 3 channels) and the batch size is 32 . Now the batch normalization here works by calculating the mean and standard deviation of every channel there by subtracting the mean from the channel and diving it by the standard deviation (as every channel is the output of a specific filter). As the network calculates the mean and standard deviation which is fixed over a batch hence it’s a non trainable parameter unlike other hyper parameters which the model trains for optimization.

Image-1 : Channels of an Image and the relu function

Similarly one the most preferred and computation friendly activation function is relu as works on the principle of just making the -ve values to 0 and keeping the positive values as it’s (refer image 1 above) .

Now since the Data scientists started using Batch normalization with activation function , the obvious question arised whether Batch Normalization should be used before the activation function or after it ? Many research papers,blogs and posts are out there who claims to have different results while using one after another , however no concrete evidance is out which clearly shows the benefit of using one after another .

Here we will take an example of both the scenarios and try to analyze the outcome . Let’s say we are looking at a layer which has the image size of 5 x 5 (we are taking small image size for the the simplicity of calculation and presentation).

Image of Size 5 x 5

Please note : For ease of understanding of the calculation we have used the flattend version of the image data in all our examples below. Infact this is how the GPU processes the data.

Activation Function after Batch Normalization

Relu after Batch Normalization

If we carefully observe the charts above then it’s evident that the distribution of the input to the Batch Normalization layer and the output of it are the same. The difference is only the scale that it’s represented. The distribution after the “relu” activation function shows that the negative values has been eliminated (we look more into the 3rd chart again in a while).

Batch Normalization After Activation Function

Batch Normalization after “relu”

Similarly, if we see the above two charts then it’s clearly visible that the Batch normalization has only made the scale different ensuring that the distribution of the data is exactly the same as before

Now let’s try to examine both the output charts

  1. “relu after Batch Normalization “and
  2. “Batch Normalization after relu.

if we carefully analyze then we will find that the distribution of the data has not changed. Only the values have changed. This should not impact the processing of the next layers as the kernel(filter) which tries to work on the output will have different weights for each. For example, one kernel may choose to have weights in the range of -2 to +1.5 whereas the other one may settle down to -1.8 to +1.7.

We also tried to add a max pooling layer of 2 x 2 on top of it and analyzed the output of the max pooling layer. Let’s observe the shape of the output for both the cases.

Did you observe any difference? Yes, the curve of “relu + Batch Normalization +Max pool” has slightly more values in Y axis than the “Batch Normalization + relu + Max pool”. However, the distribution and the behavior of both the data are the same. As we stated earlier the kernels which needs to work on both will be little different.


From the above analysis we could not find any significant difference of using one after another. We also ran two separate networks

  1. With Batch Normalization after Activation function and another
  2. With Activation function after Batch normalization on the same input data.

Neither we saw any large difference in the accuracy nor the loss between the networks. So, at this time we may conclude that the difference between both is all about the range of the data and nothing else. Having said that we are open to study any claim made by any researcher on this topic and update our knowledge.

#Thanks for reading the article .

Please write to me ( for any queries/concerns/corrections/ideas .I would be glad to connect with you .