Convolutional neural networks (CNN) for deep learning are like the magnifying glass for the investigator. If the loupe provides an amplified vision to the tiniest details, CNNs layers offer closer analysis of image parts through a sliding “window” over the input layer. While applied in image processing, classification, and such tasks, a convolutional neural network typically represents a network with one or more layers, effective because of its structure.

We put together a beginner’s guide to convolutional neural networks to give you a head start:

  • Brief history of convolutional neural networks
  • Layers explained
  • Must-know architectures
  • Key learning points

Brief history of convolutional neural networks

A convolutional neural network is one of the most popular deep neural networks that has taken a giant stride forward in the last decade. They were first developed and used around the 1980s. At a time, their potential was limited to handwritten digit recognition, so, from an industry standpoint, it was predominantly used in the postal markets to sport and identify zip codes.

convolutional neural networks

Deep learning models require huge amounts of data and massive computing resources to achieve the desired level of prediction accuracy. Given the operational restrictions of a then-CNN model, it didn’t end up as part of machine learning in the beginning. Today, instead, a convolutional neural network offers a more scalable approach to classification and object detection tasks.

Layers explained

CNN, as a neural networks sub-category, is distinctive in that it deals with two-dimensional input images. A typical CNN architecture is comprised of the following blocks:

  • Convolutional layer
  • Activation function
  • Dropout
  • Pooling layer
  • Fully connected layer

We’ll go through each one moving forward:

layers explained

Convolutional layer

The convolutional layer serves as a feature extractor. At this level, a CNN performs a convolution, a multiplication of a two-dimensional array of weights — also referred to as filter or kernel — and an array of input data. By sliding the filter over the input image, the dot product, the sum of products, is taken between the filter and the parts of the input image with respect to the size of the filter.

The output is termed as a feature map, which gives us information about the image, such as the corners and edges. Later, this feature map is fed to other layers to learn several other features of the input image.

convolutional layer explained

Activation function

An activation function can be described as a node located between or at the end of neural networks that help decide if the neurons will fire in a forward direction, if at all. In other words, it is used to navigate through the relationships between network variables. There are a bunch of activation functions (Sigmoid, tanH, Leaky ReLU, Maxout, ELU, etc.), of which Rectified Linear Unit (ReLU) is used most often. That is because with ReLU you don't have to do heavy calculations like with many other activation functions, only one comparison if the incoming value is larger than 0. So, ReLU is preferable to the others since it’s more computationally effective and faster. Yet, just like any function, ReLU doesn’t come without setbacks. Hence, the other side of the coin—saturated at the negative dimension (gradient =0), all weights are updated during backpropagation.

activation function

Dropout layer

Now, the dropout is not a mandatory CNN layer; however, when features bond with a layer, the training set becomes subject to overfitting. This phenomenon crops up when a given model performs extremely well, and then the data used to train that very model performs adversely when applied to a new model. To tackle this issue, the dropout layer is introduced. As inferred from the name, several neurons are randomly dropped out during the training process, which reduces the model size. In fact, the dropout can be utilized per layer and can ultimately be an effective yet computationally affordable solution to avoid or reduce overfitting.

Note: Albeit the efficiency of the dropout layer, overfitting may occur when using too small of a training set too. So, it’s crucial to understand the reasons behind overfitting first when developing a CNN model.

Pooling layer

The pooling function is used to reduce the size of the convolved feature map to ultimately reduce the parameter count and the computational cost on top of being able to control overfitting. There are two general types of pooling: max and average pooling. Just as it sounds, max pooling retains the most prominent features of the feature map missing out on a big chunk of data, while average pooling deals with average values of the feature map, extracting features smoothly. So, in retrospect, average pooling takes everything into account, things that may or may not be important, while max pooling is restricted to features of utmost importance. This means that the choice of the pooling layer comes from your expectations of the convolution and the pooling layer.

Fully connected layer

While the first two layers extract features and reduce the dimensions of feature maps, the fully connected (FC) layer acts more of a go-between: FC layers form the last few layers of the convolutional neural network, preceding the output layer. Connecting the neurons between two different layers, the FC layer consists of weights and biases, the purpose of which is to flatten the results for ultimate classification. So, the FC layer is an essential glue to develop a well-rounded trainable model.

Must-know architectures

Since the advent, convolutional neural network architectures have undergone a significant evolution stirring up an impressive forge-ahead in the field. Over time they made improvements on error rates setting a solid foundation for the future of the state-of-the-art.


Proposed by Alex Krizhevsky, ​​Ilya Sutskever, and Geoffery E. Hinton in 2012 and trained on the ImageNet dataset of high-resolution RGB images (14 million images, at the size of 227X227X3, ~1000 classes), AlexNet became a groundbreaking architecture for object detection tasks, winning the ImageNet ILSVRC-2012. Here, to avoid reducing the dimensions of the feature maps, padding was introduced. The model encompassed five convolutional layers with max pooling, three fully connected layers, and two dropout layers to prevent overfitting. The inner side layers used ReLU activation, which, in the long run, has profoundly increased the speed of the training process (~x6). The total number of learnable parameters is upwards of 60 million.



VGG is another convolutional neural network classic developed by K. Simonyan and A. Zisserman. As a combination of 13 convolutional (3x3 convolutional filters, 2x2 Max Pooling) and three fully-connected layers with softmax activation in the output, the network is indeed huge, totaling 138 million parameters. All of VGG’s hidden layers use ReLU. The model achieved ~93% top-5 test accuracy in ImageNet ILSVRC-2014, classifying input images of 224x224x3 into a thousand class categories. VGG-16 has come to solve AlexMax’s hyper-parameter-stuffing by applying smaller (3x3) kernels-sized filters, the downside of which, however, became the computation cost, longer training, and the vanishing\exploding gradient problem.



Ever since the success of AlexNet, neural networks have gone deeper in the race to achieve impressive performance, stacking hundreds of layers to capitalize on multiple computer vision applications. With this, however, there came the pressing problem of vanishing and exploding gradient, which caused unwanted training error:

  • Vanishing gradient: The gradient becomes so small that there is no update to the weights, and the NN doesn't learn.
  • Exploding gradient: The weights update is so much that they explode in their value and become a lot larger than needed, resulting in training failure.

The issues above were finally addressed with the introduction of ResNet and its use of skip connections — the network skips training of a few layers to link up headlong with the output layer. And the error rate would still be lower in as the flawed layers would be skipped upon regularization. Thus and so, ResNet climaxed into a catalytic upswing in the state-of-the-art, coming from Kaiming He, Xiangyu Zhang, Jian Sun, and Shaoqing Ren. The first ResNet architecture was a 34-layer network, influenced by VGG-16 (except ResNet had lesser filters). The image below demonstrates ResNet shortcut/skip connections in action.



DenseNet is a direct response to the theory that CNN models are more efficient to train when there are shorter connections between layers, which explains why in DenseNet, layers are connected to each other in a feed-forward manner. The network is relatively new, proposed in 2018 by Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. In a typical DenseNet architecture, each layer perceives former feature maps as inputs, which strengthens feature propagation, reduces the number of parameters, and mitigates the vanishing gradient problem, like in ResNet. Architecture-wise, both are similar, except the authors swapped the dense block as the repeated unit. Set side by side, DenseNet achieves the same level of performance (and even beyond) without futile complexity.

Key learning points

While traditional neural networks were as primitive as piled convolutional layers, modern convolutional neural network architectures encompass novel structures to foster competent learning. As observed, the overarching triumph of convolutional neural networks is in their ability to automatically identify target features without human intervention. Above, we explored the history of CNNs, existing layers, and some common architectures. Sign up for our newsletter to receive up-to-date guides on specific architecture types and more.

superannotate request demo

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate
Have any feedback or questions?
We’d love to hear it from you.
Contact us  >