Semantic segmentation is one of the fundamental elements for fine-grained inference in computer vision (CV). It is essential for models to understand the context of the environment in which they operate to achieve the desired precision levels. As such, semantic segmentation provides them with that understanding through pixel accuracy. In this post, we will cover the following:

  • What is semantic segmentation?
  • Semantic segmentation vs. instance segmentation
  • Use cases for semantic segmentation
  • Semantic segmentation popular architectures
  • Image segmentation datasets
  • Image segmentation frameworks
  • Key takeaways

What is semantic segmentation?

Semantic segmentation is defined as the process of classifying and labeling images on a pixel level, yet it can be easily confused with instance segmentation. The overarching distinction is that for semantic segmentation, all pixels that fall under a particular class hold the same pixel value.

original image vs. semantic segmentation

Semantic segmentation vs. instance segmentation

To give a broader overview, segmentation detects the object category it belongs to, whereas instance segmentation, as the name suggests, identifies instances by giving them unique labels. Let’s break this down: suppose there are sheep in an image you’re about to annotate. Semantic segmentation will detect that the objects in an image, in this case, sheep, belong to the same “sheep” class, while instance segmentation will also give them individual instances. Both segmentation methods make an impact across varied industries by smoothly identifying objects of interest.

semantic segmentation vs. instance segmentation

Use cases for semantic segmentation

The applications of semantic segmentation for CV cover an array of disciplines:

  • Facial recognition
  • Handwriting recognition
  • Virtual image search
  • Self-driving cars
  • Fashion industry and virtual try-on
  • Mapping for satellite and aerial imagery
  • Medical imaging and diagnostics

Overall, semantic segmentation is used for more complex tasks in contrast to other image annotation alternatives, as machines get to develop higher-level judgment. Moving forward, we will explore semantic segmentation common architectures for further comprehension.

Semantic segmentation popular architectures

After the huge success of the deep convolutional neural networks in the “ImageNet” challenge, the CV community gradually found applications for them on more sophisticated tasks, such as object detection, semantic segmentation, keypoint detection, panoptic segmentation, etc.

semantic segmentation

The evolution of the semantic segmentation networks started from a small modification on the state-of-the-art (SOTA) models for classification. The usual fully connected layers at the end of these networks were substituted with 1x1 convolutional layers, and as the last layer, a transposed convolution was added (interpolation followed by a convolution) to project back to the original input size. These simplest fully convolutional networks (FCNs) were the first successful semantic segmentation networks. The next major evolution step, made by U-Net, was the introduction of the encoder-decoder architectures that also employed residual connections, which resulted in more fine-grained and crisp segmentation maps. These major architectural proposals were also followed by various smaller improvements that resulted in a vast number of architectures, each having its own pros and cons.

Besides these founding architecture changes, there are some enhancements on them that are also worth mentioning.

Hybrid CNN-CRF methods (DeepLab)

The initial DeepLab version brings some improvements to the existing encoder-decoder architectures to make the boundaries of the segmentation masks crisper. Besides the CRF, it makes use of dilated convolutions which counters reduced feature resolution. Using dilated convolutions, the same receptive field can be achieved in shallower layers, making it possible to exclude some of the poolings. Next, as commonly used in object detectors, DeepLab also utilizes pyramid pooling to handle the problem of having objects at different scales. And finally, the CRF is used separately from the neural network part to improve the localization of edges. The main drawback of this approach is that there is no end-to-end training.


Following the success of DeepLab, FastFCN addresses the speed of the DeepLab model, which was affected by the dilated convolutions and the CRF. FastFCN overcomes the limitations of speed caused by the usage of dilated convolutions. As mentioned before, DeepLab increases the receptive field in the shallower layers and gets rid of stridden convolutions in the last blocks of ResNet. This results in an increase in spatial dimensions of the feature map, therefore becoming a significant speed bottleneck. FastFCN successfully addresses this issue by approximating the outputs of these blocks with its Joint Pyramid Upsampling blocks.


DeepLabv3 addresses its own speed limitations while also significantly improving the IoU score. The main drawback that the previous model had was the CRF that made it pretty slow and also was trained separately from the neural network, which also could make it score-wise suboptimal. To address these issues, the CRF block was completely removed, and the problem of the poor localization of edges was solved by combining the dilated convolutions and spatial pyramid pooling. This solves all the three problems that DeepLab addressed separately. To boost up the speed even further, DeepLabv3 also used depthwise separable convolutions instead of the usual ones that perform significantly fewer operations.

Image segmentation datasets

To create better and more reliable machine learning (ML) models, they need to be introduced to decent quantities of training data. It’s not always realistic, feasible, or cost-effective to annotate hundreds or thousands of images by yourself or with a team. Besides, odds are you will have to go back and retrain the model if its performance does not suffice your project requirements. In that case, you may need an extra amount of training and testing data, and this is where open-source datasets will step in.

At this point, you should be wondering which datasets exactly are best to get started? We recommend checking out the COCO dataset, PascalVOC, The Cityscapes, and BDD100K at SuperAnnotate’s CV datasets section for more.

Image segmentation frameworks

For the purposes of this article, we also jotted down a list of frameworks you can use to level up your computer vision project:

image segmentation frameworks
  • FastAI library: creates a mask of the objects in an image, contributes to quick and easy delivery of state-of-the-art results.
  • OpenCV: an open-source CV and ML library with 2500+ comprehensive sets of optimized algorithms.
  • Sefexa image segmentation tool: is a free tool for semi-automatic segmentation, analysis of images, and ground truth creation to test new segmentation algorithms.
  • MiScnn: an open-source Python library for medical image segmentation.
  • Fritz: provides a variety of image segmentation tools for AR experiences on mobile devices.

Key takeaways

In semantic segmentation, we come further a step to cluster together image parts that are representative of the exact same object class. Thus, the image is divided into multiple segments, which helps ML models better contextualize and make predictions upon the input data. We hope this article has expanded your understanding of the matter. Don’t hesitate to contact us should you need more information throughout different stages of your annotation pipeline. Enjoy the ride!

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate
Have any feedback or questions?
We’d love to hear it from you.
Contact us  >