We often underestimate the everyday paths we cross with technology when we're unlocking our smartphones with facial recognition or reverse image searches without giving much thought to it. At the root of most of these processes is the machine's capability to analyze an image and assign a label to it, similar to distinguishing between different plant species for plant phenotypic recognition. Image classification brings that human capability to the world of tech. Essentially, technology and artificial intelligence have evolved to possess eyes of their own and perceive the world through computer vision. Image classification acts as a foundation for many other vital computer vision tasks that keeps on advancing as we go. Let's focus on what image classification exactly is in machine learning and expand further from there. We've compiled the only guide to image classification that you'll need to learn the basics — and even something more.
Here is a comprehensive breakdown of what we'll cover today:
- What is image classification?
- Types of image classification
- Image classification vs. object detection
- How image classification works
- Algorithms and models: Supervised and unsupervised classification
- Deep neural networks for image classification
- Data curation in SuperAnnotate
- Applications of image classification
- Key takeaways
What is image classification?
Among computer vision tasks, image classification stands out with its irreplaceable role in modern technology. It involves assigning a label or tag to an entire image based on preexisting training data of already labeled images. While the process may appear simple at first glance, it actually entails pixel-level image analysis to determine the most appropriate label for the overall image. This provides us with valuable data and insights, enabling informed decisions and actionable outcomes.
However, we need to make sure that data labeling is completed accurately in the training phase to avoid discrepancies in the data. In order to facilitate accurate data labeling, publicly available datasets are often used in the model training phase.
Types of image classification
Depending on the problem at hand, there are different types of image classification methodologies to be employed. These are binary, multiclass, multilabel, and hierarchical.
Binary: Binary classification takes an either-or logic to label images, and classifies unknown data points into two categories. When your task is to categorize benign or malignant tumors, analyze product quality to find out whether it has defects or not, and many other problems that require yes/no answers are solved with binary classification.
Multiclass: While binary classification is used to distinguish between two classes of objects, multiclass, as the name suggests, categorizes items into three or more classes. It's very useful in many domains like NLP (sentiment analysis where more than two emotions are present), medical diagnosis(classifying diseases into different categories), etc.
Multilabel: Unlike multiclass classification, where each image is assigned to exactly one class, multilabel classification allows the item to be assigned to multiple labels. For example, you may need to classify image colors and there are several colors. A picture of a fruit salad will have red, orange, yellow, purple, and other colors depending on your creativity with fruit salads. As a result, one image will have multiple colors as labels.
Hierarchical: Hierarchical classification is the task of organizing classes into a hierarchical structure based on their similarities, where a higher-level class represents broader categories and a lower-level class is more concrete and specific. Let's get back to our fruits and understand the concept based on a juicy example.
Our first model will recognize apple vs grape. If it predicts an apple, another model will be called for the subtype of apple to categorize between Honeycrisp, Red delicious, or Mcintosh red. The latter ones will hierarchically contain all features of higher-class attributes. This simple example is to just give you an idea of hierarchical image classification; for real-life wider cases, hierarchy enables a flexible and interpretable framework for organizing and representing complex visual concepts and also allows for efficient knowledge transfer between related classes.
Image classification vs. object detection
Image classification, object detection, object localization — all of these may be a tangled mess in your mind, and that's completely fine if you are new to these concepts. In reality, they are essential components of computer vision and image annotation, each with its own distinct nuances. Let's untangle the intricacies right away.
We've already established that image classification refers to assigning a specific label to the entire image. On the other hand, object localization goes beyond classification and focuses on precisely identifying and localizing the main object or regions of interest in an image. By drawing bounding boxes around these objects, object localization provides detailed spatial information, allowing for more specific analysis.
Object detection on the other hand is the method of locating items within and image assigning labels to them, as opposed to image classification, which assigns a label to the entire picture. As the name implies, object detection recognizes the target items inside an image, labels them, and specifies their position. One of the most prominent tools to perform object detection is the “bounding box” which is used to indicate where a particular object is located on an image and what the label of that object is. Essentially, object detection combines image classification and object localization.
How image classification works
It's a known fact that the image we see as a whole is made up of hundreds to thousands of tiny pixels. Before computer vision can determine and label the image as a whole, it needs to analyze the individual components of the image to make an educated assumption. That is why image classification techniques analyze a given image in the form of pixels and accomplish this by treating the picture as an array of matrices, the size of which is determined by the image resolution. The pixels of the digital image are taken and grouped into what we know as “classes.”
From here, the process will differ based on the algorithm but before observing the various machine learning algorithms, let's take a more generalized look at how it works. The chosen algorithm will transform the image into a series of key attributes to ensure it is not left solely on the final classifier. Those attributes help the classifier determine what the image is about and which class it belongs to.
Overall, the image classification pipeline looks something like this:
Image pre-processing -> feature extraction -> object classification
At the data preprocessing stage, you're implementing methods to improve your image data quality and prepare it for the subsequent stages. Here are some examples of what can be included in image processing stage:
- Image resizing: Image resizing is changing the image's dimensions(width and height). Images are often resized to a standard size in order to make them computationally less complex for further processing.
- Image cropping: Whenever there are irrelevant or unnecessary parts in an image that may affect the model performance (such as background or borders) it's better to crop the image and leave only the needed parts.
- Image normalization: Image normalization is used to adjust image pixel values to a standard distribution. It includes subtracting the mean and dividing it by the standard deviation of pixel values, rescaling pixel values to a fixed range, or using histogram equalization techniques to adjust the image parameters(like brightness or contrast) to make it more suitable for analysis.
- Noise reduction: Digital images often contain noise that can affect model accuracy and performance. This requires image filtering techniques like Gaussian filtering, median filtering, or Weiner filtering to improve image quality. Such techniques reduce noise by smoothing or blurring the image while preserving edges and details.
- Data augmentation: Data augmentation is the process of creating new variations of the images by creating image transformations, such as rotation, zooming, flipping, and changing the brightness and contrast. The goal of data augmentation is to diversify the training dataset and increase its size, which helps to improve the accuracy and robustness of your image classification model.
Feature extraction is a substantial process in image classification for identifying visual patterns within an image that will be used to distinguish one object from another. The patterns are typically exclusive to the specific class of images which results in distinct class differentiation. Once the computer has learned these important image features and recognizes them in the training data, it can use them to classify new images that it has never seen before.
In the case of classifying dog and cat pictures, there are some patterns that can be used as features to differentiate the two classes, like fur texture and color, ear shape and position, nose/eye shape and color, and body shape and size. This procedure of learning the features from the dataset is called model training, which plays a crucial role in image analysis.
Feature extraction enhances machine learning models' performance by focusing on the most relevant and important aspects of data. Without it, models would have to analyze entire images which require an immersive computational power and a very, very long time. Some of the techniques that are practiced for feature extraction are edge detection, texture analysis, also deep learning algorithms like CNN. Let's explain a few of them to deepen our knowledge.
Edge detection refers to spotting boundaries between regions in an image, which is then used to acquire information about objects' shape and structure. There are several edge detection methods like derivation, gradient operators, and several more advanced techniques.
Texture analysis is the procedure of finding repeating patterns within an image, which can be used to identify the presence of texture and distinguish between different materials or surfaces of objects. A famous practical application of texture analysis is identifying tumors in medical imaging -- the texture of cancerous tissue may differ from that of healthy tissue, assisting doctors in diagnosing tumor type.
Another notable technique from deep learning is Convolutional Neural Networks for feature extraction, which are widely used for allowing algorithms to learn directly from raw data. In this case, the network learns on a large dataset of labeled images and distinguishes the most important patterns for different classes of images. Since convolutional neural networks are a noble topic when it comes to image classification, we'll spare a few more paragraphs for them later in the article.
Algorithms and models
There isn't one straightforward approach for achieving image classification, thus we will take a look at the two most notable kinds; supervised and unsupervised classification.
Supervised learning is famous for its self-explanatory name - it is like a teacher guiding a student through a learning process. The algorithm is trained on a labeled image dataset, where the mapping between inputs and correct outputs is already known and the images are assigned to their corresponding classes. The algorithm is the student, learning from the teacher (the labeled dataset) to make predictions on new, unlabeled test data. After the supervision phase is completed, the algorithm refers to the trained data and draws similarities between that data and the new input. Since it has already learned from the labeled data, it can implement the knowledge gained from patterns of that data and predict the classes of the new images based on that.
The process doesn't end there. In their turn, supervised algorithms can be divided into single-label classification and multi-label classification. As the name suggests, single-label classification refers to a singular label that is assigned to an image as a result of the classification process. It is by far the most common type of image classification we witness on a daily basis.
If single-label classification generalized the image and assigned it a single class, then the number of classes an image can be assigned with multi-label classification is uncountable. In the field of medicine, for example, medical imaging may show several diseases or anomalies present in a single image for the patient.
Some of the famous supervised classification algorithms include k-nearest neighbors, decision trees, support vector machines, random forests, linear and logistic regressions, neural networks.
Let's take a small tour on a few of them.
Logistic regression: Logistic regression is actually a binary classification task, and is used in image classification to predict whether an image belongs to a certain category or not. It constructs a logistic function to model the relationship between input features and class probabilities. The final predictions are made by assigning a probability value to each input, which is then thresholded to make the final binary classification decision.
K nearest neighbors: K nearest neighbor classification is considered the simplest lazy algorithm that is famous for being easily understandable and interpretable. When we say lazy, we're not trying to bully the algorithm -- KNN is referred to as a "lazy learner" because it does not train itself when given training data; instead, it memorizes the entire dataset, leading to longer prediction times and increased computational complexity when new data points are encountered.
“I am as good as my nearest K neighbors.”
If you try to guess what KNN's function just by its name, you'll most likely find the answer yourself. What KNN does is it finds the K nearest neighbors to the new point, checks the most frequent class among the neighbors and puts this label to the given point. Despite all the nice things we said about KNN, we have to remember its main drawback -- prediction stage for KNNs usually takes a lot of time.
Support vector machines: In simple terms, support vector machine separates classes by a line or a boundary(called hyperplane). They use hyperplanes to maximally separate data points of one class from another; i.e, maximize the distance between the hyperplane and the closest data points of each class.
If we're trying to classify image as either "cat" or "dog" , support vector machine would come up with a line that separates these two. To do this, SVM takes the features of each image(like color, texture, shape of image) and tries to find the best hyperplane that separates the two classes of images with the largest possible margin (i.e. as we said, the distance between the hyperplane and the closest data points).
Decision trees: Decision tree is another easily interpretable technique widely used in image classification. It's like a flowchart that your model creates to make decisions based on the features of the data. Imagine you're trying to guess which fruit someone is thinking of, but you can only ask yes or no questions about its features. You start with a broad question like "Is it round?", and then narrow it down with more specific questions like "Is it red or green?" or "Is it sweet or sour" until you've guessed the fruit. Decision trees work the same way - they ask questions about the features of the data until they can make a prediction.
In this example, you're given a few questions to make a final decision. In case SuperAnnotate is N1 annotation platform in G2, provides comprehensive annotation tooling, advanced data curation, easy to navigate user interface, as well as a personalized and responsive customer support, (which it in fact does :), your decisions at each node will be "Yes," and the final request a demo leaf will indicate a potential deal with SuperAnnotate!
We can view unsupervised learning as the rebellious teenager of machine learning; unlike supervised learning, it doesn't want anyone telling it what to do, but rather discovers patterns and insights on its own, without a teacher to guide it. Here the algorithm is free to explore and learn without any preconceived notions.
In more machine learning terms, while the main goal of supervised learning is to repeat or create a model of some real-world function over a given labeled data, the goal of unsupervised learning is to help us explore and understand the nature of a given data. With unsupervised algorithms, no pre-existing tags are given to the system, only raw data. The system interprets the data on its own terms, recognizes image patterns, and draws conclusions from the data without human interference. How does it know what to look for and then properly classify it? Unsupervised classification makes avid use of a concept called clusterization to achieve this. Clusterization is the unsupervised, natural locating and grouping (or “clustering”) of data into groups. However, using this method will not give you a class automatically: You'll only have the different clusters, which you'd need to decide a class for in another way. There are a plethora of different clusterization algorithms in their turn, with some of the most notable ones being K-Means, Mean-Shift, DBSCAN, Expectation–Maximization (EM), Gaussian mixture models, Agglomerative Clustering, BIRCH, Mini-Batch, etc.
It is important to note that there isn't a single best choice out of these clusterization algorithms. Instead, it is optimal to test various ones until you settle on the suitable classification technique that works best with the specific task at hand.
Let's learn about some of these rebellious models:
K-means: She sounds like KNN but when taking a closer look, they're more of two unalike sisters among image classification techniques. While KNN relies on neighbors' "opinions" to make decisions about new data points, K-means is an independent thinker, making decisions based on its own observations and perceptions. It clusters data points together based on their characteristic similarity, without any labels to guide it. It works by selecting k initial centroids, assigning each data point to its nearest centroid, recalculating the centroids based on the newly assigned data points, and repeating until the centroids no longer change significantly. The resulting clusters can help identify patterns and insights in the data.
Gaussian mixture models: While K-means is straightforward and assigns data points to a single cluster based on their distance to cluster centroids, GMMs take a more sophisticated approach. They assume that the data points are drawn from a mixture of Gaussian distributions, allowing GMMs to capture more diverse data patterns, and handle complex and overlapping data distributions. This assumption makes GMMs a popular choice for image classification tasks.
Among the wide range of different image classification techniques, we discussed the most important ones, and for the final touch, we're about to learn about convolutional neural networks -- the game-changer for computer vision problems.
Deep neural networks for image classification
Deep learning methods have proven to take computer vision tasks to an even higher level of accuracy and efficiency, all thanks to convolutional neural networks (CNNs). Note that CNNs are indeed a part of supervised learning algorithms, and due to their significance and current prominence in image classification, we spared a separate section to discuss it. CNNs can emulate the neural networks of the human mind to complete specific computer processes with minimal human interference. As a type of neural network, convolutional neural netoworks have scored the best results in diverse image classification tasks. The variety of layers, starting with the input layer, to the hidden inner layers, and output layer are what make the network “deep.” In brief, these are the core components of convolutional neural networks:
- Input layer: The first layer of each CNN is the input layer, where images or videos are taken and pre-processed and then passed to the next layers.
- Convolution layer: The next layer applies learnable filters to extract features from images. The output of this layer is a feature map that indicates the presence or absence of particular features in the input image.
- Pooling layer: The extracted features are then passed to the pooling layer, where the large images are shrunk down while making sure the most important information is preserved. The most common pooling operation, max pooling, selects the maximum value of each sub-region of the feature map.
- ReLU layer: The ReLU (Rectified Linear Unit) changes every negative number of the pooling layer to 0 to maintain mathematical stability and keep learned values from being stuck around 0 or jumping into infinity. Surprisingly, this simple function can allow your model to account for non-linearities and interactions very well.
- Fully connected layer: The fully connected layer takes the output of the previous layers and produces a final classification. Each neuron in this layer is connected to all neurons in the previous layer.
Probably a lot of what you just read about convolutional neural network seems confusing, and it's just because understanding the world of convolution neural networks is just as layered and complex as the networks themselves. But there are many insightful research papers that do a great job in the detailed technical explanations of CNN concepts in case further learning is needed.
Although convolutional neural network is the big star in deep learning when it comes to image classification, artificial neural networks have also made important contributions in this field. ANNs were created to mimic the behavior of the human brain, using interconnected nodes that communicate with each other. They have been successfully applied to image classification tasks, including well-known examples such as handwritten digit recognition. Despite artificial neural networks' early successes, convolutional neural networks have taken over the spotlight in most image classification tasks.
Data curation in SuperAnnotate
We mentioned in our decision tree example that one of the reasons to choose SuperAnnotate as your annotation platform is its comprehensive data curation. Data curation is about "taking care" of your data and making sure it's in good shape and ready for further use. Not very often, as a matter of fact, you'll rarely have data with no impurities, especially in real-world scenarios. For an image classification problem, scenarios like blurry, out-of-focus, distorted, as well as irrelevant/outlier images will disrupt the model training process and affect the performance.
By curating your data, you'll ensure better performance and accuracy, and achieve more optimal, relevant, and fitting data for your image classification task. Note that without good data curation practices, your computer vision models may suffer from poor performance, accuracy, and bias, leading to suboptimal results and even failure in some cases.
Applications of image classification
Image classification got its hype for a reason - it has become a game-changer in many fields such as medicine, autonomous driving, agriculture, security, retail, and more. Let's see why it became so popular across these fields.
It is no secret that the healthcare industry has been widely implementing computer vision throughout their activities. In one of our case studies, we share how SuperAnnotate helped Orsi, Europe’s leading advocate for robotic and minimally invasive surgeries, achieve 3x faster annotation for their surgical image data. It doesn't stop there, as there are several such cases when medical companies streamline their processes by just trusting industry-lead annotation companies in automating their data processes.
To be more specific, image classification has proved to be critical in analyzing medical images such as X-rays, CT scans, MRIs, and more to diagnose diseases. For instance, dermatologists use image classification algorithms to detect and diagnose skin conditions e.g. melanoma. By analyzing thousands of skin lesions images of training data, these algorithms come up with patterns and features that are specific to the disease. A study published in the European Journal of Cancer found that a deep learning algorithm trained on skin images was able to outperform 157 dermatologists in accurately diagnosing skin cancer.
Autonomous driving carries a leading role as an image classification user. The cameras and sensors attached to the cars are able to detect objects on roads, mostly due to machine learning algorithms working on massive amounts of datasets of driving scenarios. The classifier helps to respond to the surroundings by identifying whether the object is a pedestrian, vehicle, road sign, or tree.
Tesla's autopilot - the cherry on top of the autonomous vehicles, is the pioneer of autopilot but not the only one that utilizes autonomous driving technology. Other car manufacturers like GM, Audi, BMW, and Ford are also making strides in developing autonomous driving technology that enables cars to stay centered in their lanes.
Autonomous driving is also known for being one of the riskiest users of image classification. Why? The cars are supposed to handle complex and diverse environments which may include a wide range of weather conditions, lighting, and other factors which can potentially affect the object's appearance and lead to serious risks. This highlights the importance of utilizing deep learning models that are trained on large and diverse datasets which include a wide variety of driving scenes.
In agriculture, image classification is used to classify images of crops, identify pests and diseases, monitor plant growth, and ease the life of farmers in general. It's similar to having a farmer's sixth sense that can detect changes in the health of crops and soil, helping them make more informed decisions about irrigation, fertilization, and pest control.
FarmSense, a California-based agtech startup uses novel video and image classification techniques to identify and track insects in real time, dramatically decreasing the immerse damage caused to agriculture by insect pests. You may encounter similar cases more often in the future since agro tech is getting closer and closer to integrating machine learning solutions like image classification techniques to cut diverse resource costs.
The adoption of image classification in security gained traction over the past decade as the technology became more sophisticated and accessible. It started with surveillance systems and was used to analyze recorded video footage and identify potential security threats after. However, with the advancements in hardware capabilities, such as faster processors and improved algorithms, real-time image classification for security purposes became feasible. Let's discuss a few cases where image classification helps achieve a more secure environment.
One way it is utilized is in security surveillance systems. Imagine a bustling airport or a crowded city street – image classification algorithms can automatically analyze the live video feed and identify potential threats or suspicious activities in real time. This helps security personnel to quickly respond and take appropriate action when necessary.
Image classification also assist a lot in facial recognition systems, which are commonly used in security applications. By analyzing facial features and matching them against training data of known individuals' photos, these systems can identify and track people of interest, such as wanted criminals or missing people. This technology helps law enforcement agencies in their investigative efforts and enhances public safety.
Additionally, image classification can be employed for object detection in security screening processes. For example, it can be used to automatically identify prohibited items, such as weapons or explosives, in luggage or belongings during airport security checks. By swiftly detecting potential threats, it enhances the effectiveness and efficiency of security protocols.
We've come a long way from the beginning of the article, so let's debrief what we learned so far. Image classification is a branch of computer vision that deals with categorizing images using a set of predetermined tags on which an algorithm has been trained. We discussed the main image classification types, expanded on supervised and unsupervised learning algorithms and saw where in real world image classification comes in hand.
Overall, image classification continues to evolve, and staying up-to-date with the latest advancements and best practices will enable researchers and practitioners to harness the full potential of this powerful technology.