Image segmentation detailed overview [Updated 2024]

No matter where we are, thousands of objects surround us, and our eyes can recognize them instantly if the object was seen previously. Computer vision has evolved to not only detect objects in a given visual and label them, but also to outline their entire form in a precise manner, regardless of the unique shape, all due to image segmentation. With that, we're able to expand the scope of accuracy and precision significantly among image annotation tasks and apply it to innovative technological advancements. Such instances include medical imaging, agriculture, satellite images, AI for autonomous vehicles, and more. In order to understand the significance of image segmentation in such use cases, we must have a steady grasp of what the process entails and how it differs from other forms of image annotation.

In our article, you'll learn:

What image segmentation is
Types of image segmentation
Image segmentation vs object detection & others
Image segmentation techniques
Image segmentation with deep learning
Evaluation and public datasets
Applications of image segmentation

What is image segmentation?

Image segmentation is a vital computer vision task that expands on the premise of object detection. We will expand further on the significant similarities and differences between image segmentation and other computer vision tasks, such as object detection and many more, in a moment. Before we get to that, it is necessary to determine what the task of image segmentation exactly is. As the name suggests, image segmentation is essentially segmenting an image into fragments and assigning a label to each of those. This occurs on a pixel level to define the precise outline of an object within its frame and class. Those outlines — otherwise known as the output — are highlighted with either one or more colors, depending on the type of segmentation.

To streamline image segmentation in machine learning, the system is trained with already segmented sets of data, which is manually segmented or open-source datasets in order to accurately identify segments and segment unknown set of images.

Types of image segmentation

Image segmentation can further be divided into the following categories — instance segmentation, semantic segmentation, and panoptic segmentation.

Panoptic segmentation refers to the most foundational definition of image segmentation which is the identification, grouping, and labeling of pixels in visuals that form a whole object. During those tasks for the given image, a so-called segmentation mask is created, which is the label of all the pixels in the image.

image classification vs object detection vs image segmentation

Instance segmentation

How does instance segmentation differ from that then? Similarly, objects with their bounds are detected in the image. However, each new object is labeled as a different instance, even within the same category. For example, there are billions of people in the world, but no two are essentially the same. For the sake of simplification, let's look at four people in the image below — all of them are human, but all of them are different individuals with varying heights, races, ages, gender, and so on. During instance segmentation tasks, all of them should be identified as general category "people", but as different instances. During the training, each of the segments labeled as the "people" category will be possible to select as a separate instance for neural network training.

The image below shows what image segmentation will look like for the current image. All persons in the image have different color labels and all glasses/cups also have different color labels.

Semantic segmentation

During a semantic segmentation task, segmentation masks represent fully labeled images. It means that all pixels in the image should belong to some category, whether they belong to the same instance or not. However, in this case, all pixels with the same category are represented as a single segment. If two pixels are categorized as "people" then segmentation mask pixel values will be the same for both of them.

The same image, as was shown for instance segmentation, is below presented as a semantic segmentation mask. Now all persons in the image have the same(orange) color label and all glasses/cups have the same(green) color label, but there is also the grey color that is labeled as background.

Both semantic segmentation techniques and instance segmentation techniques are used for different purposes. The semantic image segmentation is used, for example, to distinguish between the background and category. Image segmentation techniques are used, for example, to identify the exact shape of each instance in the image.

Instance segmentation assesses and takes account of the variety of objects in the segmentation process. That is the subtlety between semantic and instance segmentation that makes a noteworthy difference. The primary issue that lies with semantic segmentation is that it isn't optimal to use in cases where more precise labeling is necessary. For variations in species of animals or plants, it is vital that the AI distinguishes those differences in order to establish accurate labeling with minimal discrepancies.

Panoptic segmentation

Panoptic segmentation is a combination of both instance segmentation and semantic segmentation. With panoptic segmentation, the entire image should be labeled, and pixels from different instances should have different values even if they have the same category. The segmentation mask of the same image from instance segmentation labeling will look like as image below.

Panoptic segmentation is a complex computer vision task that solves both instance segmentation and semantic segmentation problems together. It is widely used, for example, in autonomous vehicles as the cameras on them should provide full information that around.

Image segmentation vs object detection

Image segmentation is often merged together with other image annotation processes such as image classification, localization, and object detection. While seemingly similar, all of them are innately different. Now that we have established what image segmentation is, let's briefly define the characteristics of each one:

Image classification — A single class is assigned to an image, commonly to the main object that is portrayed in the image. If an image contains a cat, the image will be classified as a 'cat'. With image classification, we do not know the precise location of the cat in the image nor can identify its limits on the visual compared to object localization, detection, or segmentation.
Object detection — The objects within an image or a video are detected, marked with a bounding box, and then labeled. The primary difference between object detection and image segmentation is the final output. With object detection, the key signifier is the bounding box that draws a square or rectangle around the limits of each object. With image segmentation, the entire outline of the object is considered, without the presence of any background content.
Localization — With image/object localization, we are able to identify the location of the main subject of an image. However, image localization does not typically assign classes to the localized object as it considers the main subject instead of all of the present objects in a given frame.

Image segmentation techniques

There are many techniques to achieve image segmentation, some of which are classical and less traditional, each proposing a unique approach for reaching the final output of an image or video. A few of the most common image segmentation techniques include but aren't limited to:

Region-based segmentation
Edge detection segmentation
Thresholding [Thresholding is rarely used as an end-to-end solution for image segmentation] it usually a preprocessing step. Thresholding can also be region based.
Clustering

Let's take a closer look at how some of them function.

Region-based

With this first technique, similarities are detected in the pixels of segments in direct proximity to one another. The pixels that are nearest to each other are more likely to be a part of the same object which is why this technique analyzes the similarities and differences of adjacent pixels and determines the boundaries of the given object. One shortcoming of this technique is that lighting and contrast within the image may lead to inaccurate defining of the object parameters.

Edge detection

With the aim of resolving the shortcomings of region-based techniques, edge-based segmentation algorithms emphasize the object edges to achieve reliant results. This is done by determining and classifying certain pixels as “edge pixels” before anything else. Edge detection is best used for visuals with objects that have clearly defined outlines. Also, they are simple to implement for regular uses compared to other techniques which are more time-consuming.

Edge "mask" is usually a binary mask with {0,1}, where {1} represents edge pixels. The image shows an edge "mask" for an input image. The most popular edge detection algorithms in image processing are:

Canny edge detection: This technique uses filters to smooth the image. After calculating the gradient magnitude and direction for each pixel, applying non-max suppression, and defining threshold value for weak edges, the algorithm manages to find changes in the image and separate instances from the background.
Sobel edge detection: This technique also uses the gradient magnitude and direction for each pixel for the input image, but it uses the Sobel operator to calculate them. Image below shows the horizontal and vertical convolutions and the gradient magnitude of the pixel is, for example, root sum square of each gradient.

Basically, the Sobel operator is a convolution operator that passes over the image vertically and horizontally to get the required information for the edge detection algorithm.

Thresholding

Thresholding is arguably the least complex technique for image segmentation. Essentially, it considers the act of changing the original image into black-and-white to result in a binary image or binary map instead. Typically, the pixels are valued with either 0 or 1 where the 0s are assigned to the background, and anything above that threshold, in the foreground, is valued with 1. Similar to the edge detection technique, the threshold technique passes through the image (usually grayscale) and calculates intensity per pixel. After calculating the intensity by a predefined threshold assignment will be done. For example, pixel intensity > 0.5 will be foreground rest background. However, not all parts of the image may have the same threshold, due to local adaptive thresholding techniques being used. This technique finds the optimal threshold for each part of the image. This is an optimal technique for images where the background and foreground have considerable contrast and must be highlighted.

Clustering techniques

Clustering algorithms are other image segmentation techniques for image annotation. In this case pixel values can be data points for the clustering algorithm. An example of the clustering algorithm is K-means. It starts by randomly selecting centers from the training data points and then calculates each pixel's similarity. The clustering-based segmentation model iterates until reaching stable values. Final grouping of pixels outputs instances in the image.

Image segmentation with deep learning

Similar to how deep learning takes high-performing processes and elevates them in terms of accuracy and speed, the same is applicable to image segmentation. Image segmentation with deep learning is arguably becoming the most accurate approach for the task in recent years. If you are looking to execute image segmentation for machine learning, it is useless to dive too deep into the complex architectures of segmentation with deep learning, as it would be more beneficial to grasp a basic understanding of how it functions.

Key elements of the image segmentation architecture consist of the encoder and decoder. Through this, parts of the image are extracted via filters in the pooling layers and then a final output is received with a segmentation mask. This is also known as the convolutional encoder-decoder architecture.

U-net

Another notable model is U-Net which resembles a 'U' shape when the architecture is visualized. It is composed of two parts: the upsampling and downsampling, also referred to as the contracting path and expanding part respectively. The significance of U-Net is the accuracy and speed it achieves for image segmentation by repurposing the same feature maps that are initially used to expand a vector into a fully segmented output image. The most prominent use of the U-Net architecture is for image segmentation for medical imaging.

Mask R-CNN

Mask R-CNN is an efficient and simple method for instance segmentation mask generation. The model output also contains bounding boxes for each instance, making the model more flexible to use in different applications. Mask R-CNN is a two-stage convolutional neural network; the first stage is the Region Proposal Network (RPN), which, as the name suggests, proposes regions of interest. The second stage is region-based parallel processing of finding instance bounding box, classification, and binary mask.

DeepLab Versions

DeepLabV1: This model uses Deep Convolutional Neural Network (DCNN) and Fully-Connected Conditional Random Fields (CRF). In the DCNN the atrous convolution or dilated convolution is used, allowing it to control resolution for feature response. The output of DCNN passes to CRF to improve the edges of the instance segmentation.
DeepLabV2: Atrous Spatial Pyramid Pooling(ASPP) was developed to segment objects with different scales. In this case, the CRF has input from ASPP, and the segmentation and edge detection efficiency increases.
DeepLabV3: The CRFs were removed and ASPP was updated to have image features as input and batch normalization was introduced.
DeepLabV3+: The Architecture of the network was changed to an encoder-decoder structure. The Encoder network is similar to the previous version and the decoder network is simple upsampling operations that recover the original boundaries.

Interactive segmentation

Interactive segmentation is another deep learning segmentation technique. With such types of image segmentation models, the user should provide pixels for segmentation. Then the model outputs the instance segmentation based on the image and provides input points. It is also possible to provide points for multiple segments and get instance segmentation based on each region.

f-BRS: The feature Backpropagating Refinement Scheme (f-BRS) is an example of an interactive image segmentation model, using clicks from the user as input for the training model. The clicks can be inclusive and exclusive for instance. The backpropagation is optimized in a way to not go all the way back to the input features, but only updates parts of the network using auxiliary.
DEXTR: The Deep Extreme Cut model (DEXTR) takes four extreme points (left-most, right-most, top, and bottom pixels) and generates Gaussian for each point. The total distribution of the Gaussian is added to the input as an additional input channel.

Video source

Meta's SAM

The Segment Anything Model (SAM) was developed by Meta and gives an opportunity for non-machine learning experts to use it for image segmentation. The model was trained over 1 billion masks for its ability to make accurate predictions on new datasets without additional training. The SAM is also useful for complex semantic segmentation tasks such as medical and satellite images.

Evaluation and public datasets

All of the above-mentioned methods can be used depending on the image segmentation tasks that need to be solved. However, there are a few other components that play significant roles in image segmentation or in any other general model training task.

First of all, it is essential to have good-quality data to train the model, and there are also some available public datasets like MS COCO and Pascal VOC that can be useful for image segmentation tasks.

Second of all, a good quantitative description of a computer vision model is important to understand the performance of each model and make decisions based on those values.

Pascal VOC dataset

This dataset consists of annotated images with 20 categories. The data contains bounding boxes and a semantic segmentation mask. As an open-source benchmark dataset, Pascal VOC is divided into the train, validation, and test sets, and it is widely used in computer vision challenges.

MS COCO dataset

The MS COCO dataset is one of the biggest open-source datasets with around 330000 images and 80 categories. The dataset contains each instance bounding box, instance segmentation as a polygon, and panoptic segmentation for the whole image. The vast majority of researchers use this dataset to benchmark their computer vision algorithms.

IOU (Intersection Over Union)

Whether it is an object detection or image segmentation model, one of the most common evaluation methods in computer vision is the Intersection Over Union (IOU) score. This value measures the similarity between predicted and ground truth values. IOU varies from 0 to 1, where 1 means identical prediction and ground truth, and 0 means something completely different. The score is not affected by data imbalance and it takes into account both true positive and false positive errors.

It is also important to mention that the intersection and union in image segmentation can be calculated from the number of pixels of a prediction and the ground truth if they both have the same class.

Applications of image segmentation

With the ongoing advancements in technology, the ability to analyze and extract useful information from images has opened up a mass of opportunities across various industries. Image segmentation applications span diverse fields, from medical imaging and autonomous vehicles to fashion and retail. Let's take a look at some of the most common ones.

Medical imaging

Image segmentation is beneficial in this sphere as it analyzes medical images(MRI scans, CT scans, and X-rays) to detect specific structures or irregularities for diagnosis and treatment planning. Image segmentation is also widely used in biomedical research for cell counting, tissue analysis, and anatomical structure studies.

Video surveillance

To detect and track objects of interest in real-time, image segmentation becomes extremely beneficial in video surveillance, including both individuals and vehicles. By applying image segmentation techniques, video surveillance systems can easily identify and isolate relevant objects, providing more accurate monitoring. It also enables security personnel to act on potential threats or mistrustful activities in a timely manner, improving the overall effectiveness of surveillance and creating a safer environment.

Autonomous vehicles

Image segmentation allows autonomous vehicles to perceive and comprehend their surroundings, including pedestrians, other vehicles, and even road signs. It is also useful in supporting tasks such as behavior analysis, object recognition, and anomaly detection.

Agriculture

The integration of image segmentation in agriculture involves the classification of lands and it enables researchers and planners to categorize and understand different types of land use. This information is essential in urban planning, as it provides insights into the spatial distribution of infrastructure, residential areas, and natural landscapes.

Satellite imagery

It assists in analyzing satellite imagery for various purposes, including land cover classification, urban planning, and environmental monitoring.

Robotics

Image segmentation plays a pivotal role in equipping robots with the ability to perceive and actively engage with their surrounding environment. By employing this technique, robots get to accurately identify and distinguish objects within their visual field. Moreover, they can make more informed decisions, and carry out tasks that require object recognition and manipulation with remarkable proficiency.

Art and design

In the realm of art and design, image segmentation finds its application as a means to extract and manipulate particular regions within an image. This process offers artists and designers the power to engage in creative modifications and enhance visual effects.

Gaming

The integration of image segmentation in gaming applications helps to detect and separate game objects, enable virtual characters to interact with the game environment, and enrich the overall gameplay experiences

Fashion and retail

Image segmentation serves as a valuable tool for tasks such as object recognition, enabling the identification and categorization of various fashion products. Additionally, image segmentation aids in the efficient categorization of products, assisting customers in navigating through extensive inventories. It also plays a crucial role in virtual try-on experiences, allowing customers to visualize how different clothing items would appear on them without the need for physical fittings.

Key takeaways

Thanks to image segmentation, we are now able to detect the location, class, and accurate limits of any given object within images and videos. The two prominent types of image segmentation are semantic and instance segmentation. The latter gives us the opportunity to explore new avenues of accurate and detailed segmentation for variation within images. That is particularly valuable for many everyday use cases ranging from satellite imagery to machine vision in agricultural automation and so on. A simple process that we carry out effortlessly with our eyes implemented in the world of AI is revolutionary for technology.