Embeddings in ML: Zero-shot, one-shot annotations, and vector similarity search

It is no secret that machine learning models work well if the training data under the hood has good quality. Why is that so? Because the quality of training data directly affects model accuracy and performance. Embeddings have a pivotal role in making this happen. They transform real-life objects and relationships into numerical representations because machines, unlike humans, understand only numbers. It can't just see an image, read text, or listen to audio. It understands numbers.

The concept of similarity search is easily interpretable; the goal is to find similar items to the given item under preferred conditions. Embeddings are pivotal for executing similarity searches, and we just said why. Since the machine does not understand the 'find similar images to this dog image,' we need to convert the image to vector space and use algorithms to find similar vectors to the one we chose.

In this article, we'll delve into the world of embeddings in machine learning, algorithms, and their role in similarity search. Also, we are very excited to introduce you to SuperAnnotate's new similarity search tool!

Embeddings

Embedding in ML is a low-dimension vector representation of data so that a machine learning model can easily process it. Data comes in different forms: words, images, sounds, but ML models prefer their input in a consistent, numeric format. Let's talk about images. With Google Images, you can find visually similar images. How? By converting each image into a numeric embedding that captures the essence of the image and searching for similar images in the numeric space. The distance between two vector embeddings represents their relatedness. The further the distance, the less related the objects are. Vector embeddings are at the core of NLP, search, and other ML areas (we'll focus more on images).

Image embedding

Here's the thing. Images are complex. For a computer, an image is a grid of colors. An average photo on your phone has millions of pixels; each pixel has its color values. Imagine how hard it is to capture images in this form; you can't just ask your computer to remember millions of pixels. This is where embeddings came to the rescue. They condense the image into its very essence — a list of numbers that captures the "heart and soul" of the image.

"[...] once you understand this ML multitool (embedding), you'll be able to build everything from search engines to recommendation systems to chatbots and a whole lot more. You don't have to be a data scientist with ML expertise to use them, nor do you need a huge labeled dataset." - Dale Markowitz, Google Cloud.

Note that the numbers generated as the output of the embedding aren't really interpretable. You can't say why 1.0 is the first or it’s followed by 0.5. However, each number holds some properties of the image. For example, some numbers in the beginning might represent the presence or orientation of edges, color gradients, etc.

Take a look at this image and interpret it. While our interpretation might be subjective, let's say that the dog is super cute and friendly, the cat is cute but doesn't like to make friends with people, the lizard is friendly but subjectively not cute, and we couldn't find anything friendly or cute about the blobfish on the bottom left. This is our human interpretation of these animals. In the ML world, an ML model takes this information that we intuitively see and converts it into embedding space by converting the images into numerical values.

How to get embeddings

Let’s talk about the algorithms that generate embeddings and discuss the pros and cons of the major ones. Most of the algorithms represent convolutional neural networks, which we’ll touch a bit before delving into the algorithms themselves.

Algorithms

Convolutional neural networks (CNN)

Convolutional Neural Networks (CNNs), a cornerstone of deep learning architectures alongside advanced neural networks, have long been celebrated for their prowess in image classification tasks. In the realm of CNNs, embeddings are often crafted for images, effectively converting rich visual information into condensed vector forms. This transformation, driven by the convolutional layers of the network, ensures that semantically similar data points lie closer in the embedded space. Whether it's for similarity searches, recommendation systems, or any task that requires understanding the essence of an image, CNN-driven embeddings are at the heart of it, bridging the gap between high-dimensional data and meaningful, actionable insights.

Central to the CNN architecture are the convolutional layers, which serve as the primary feature detectors. Imagine viewing an image under a magnifying glass, one small section at a time. This is essentially how convolutional layers operate, focusing on small parts of the image to detect local patterns, whether it's the curve of a smile or the sharp edge of a building. As we move deeper into the network, these local patterns combine, allowing the network to recognize more abstract and complex patterns, transforming mere pixels into meaningful visual stories.

Next in line are the pooling layers, which, in essence, simplify the image data. By performing operations like max-pooling or average pooling, these layers retain the crux of the image's information while cutting down on computational overhead. You essentially distill the most meaningful details from the photograph and make it a 'sketch,' preserving the essence while shedding some extra layers.

Tying everything together are the fully connected layers. Think of them as the grand consolidators. They take in the plethora of visual information deciphered by the previous layers and produce a concise output. This could be a classification (is the image a cat or a dog?) or, in our context, an image embedding — a dense vector capturing the image's soul.

A feature that truly gives CNNs their versatility is the inclusion of activation functions. Introducing elements of nonlinearity into the mix, functions like ReLU or sigmoid, ensure that CNNs aren't just linear transformers but can understand and represent the rich tapestry of patterns present in visual data.

We will later discuss some CNN techniques that are pivotal for creating image embeddings.

Top 5 pre-trained models for image embedding creation techniques

Note that all the techniques we'll discuss deserve separate research, so don't expect to learn everything about them from this article. Here, you’ll get an idea of how embeddings come to life and what's the process under the hood.

Creation techniques:

Here are a few main embedding creation techniques.

VGG-16: At the heart of VGG-16 lies simplicity and depth. Born from the Visual Geometry Group at Oxford, VGG-16 employs a sequential stack of convolutional layers. With its 16-layer depth, it’s built for clarity, capturing intricate visual hierarchies. While not the lightest model around, its strength is its straightforward and consistent structure.

ResNet50: ResNet-50 is a convolutional neural network that is 50 layers deep. The fundamental breakthrough with ResNet allowed us to train extremely deep neural networks with more than 150 layers. Its innovation is "skip connection" or "residual connection." Instead of mapping out a direct path, ResNet50 occasionally jumps over layers, preventing information degradation over its 50 layers. This leapfrogging approach ensures that image features remain crisp and discernible.

Inceptionv3: Inception v3 is an image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. The design of Inceptionv3 was intended to allow deeper networks while keeping the number of parameters from growing too large: it has "under 25 million parameters", compared to 60 million for AlexNet. Diversity is at the core of Inceptionv3. Instead of sticking to one filter size, it concurrently applies multiple, capturing minute and grand details both minute and grand. Its architecture acknowledges the multi-scaled nature of visual information, ensuring its embeddings resonate with varied image features.

EfficientNet: EfficientNet is a convolutional neural network architecture and scaling method that uniformly scales all depth/width/resolution dimensions using a compound coefficient. Unlike the conventional practice that arbitrarily scales these factors and instead of merely stacking more layers, it scales the network width, depth, and resolution in tandem. The outcome is a model that, while light, delivers on the richness of embeddings without compromising on speed or efficiency.

ViT (Vision Transformer): In our article about large language models (LLMs), we talk about transformers. Transformers are linguistic heroes in natural language processing, but they’re pivotal in vision, too. The transformer has made its mark in vision with the Vision Transformer (ViT) by segmenting images into patches and processing them in a unique way. Research shows that transformers lagged behind the established ResNet architectures on smaller datasets. But, with a larger dataset in play, the Vision Transformer (ViT) showcased impressive prowess, often reaching or surpassing top benchmarks in image recognition.

While transformers are the hot favorite in the NLP world thanks to their smooth scaling and efficiency, the computer vision arena still sees a lot of love for CNNs. That said, some folks have been playing mix-and-match, blending CNNs with a sprinkle of self-attention. The writers of this study decided to take Transformers for a spin with images. They were okay but not exactly headline-worthy on smaller datasets compared to the good old ResNets. But throw in a larger dataset, and The Vision Transformer (ViT) steps up its game, often matching or even outdoing some of the top performers in image recognition.

Choosing model

Model choice is pivotal for the later development of embeddings. When you're weighing CNNs against vision transformers, it's a bit of a balancing act. CNNs, being the older players in the game, are neat and efficient, making them a go-to for spaces where resources are thin. Their track record in handling image tasks is commendable, and they consistently deliver accuracy across many vision challenges.

In the other corner, we've got vision transformers. These models bring a fresh lens to the scene, capturing holistic views of images and digging deep into their contexts. While they can lead in specific areas, especially when they've got a lot of input data to play with, they can be resource-hungry. Their larger footprint in terms of size and memory might make them a tad cumbersome in settings where every byte counts. It's all about sizing up your project's needs, available tools, and balancing complexity, performance, and resource availability.

Utilization Techniques(similarity search)

KNN (K-Nearest Neighbors): KNN is a straightforward method in similarity search. Given image embeddings, it works by measuring distances between them to find the closest matches. Instead of relying on a single closest match, KNN checks for 'K' closest matches in the dataset. This approach reduces errors from anomalies and provides more reliable results. When using image embeddings, KNN becomes particularly effective for tasks like grouping similar images or recommending images based on similarity. Popular tools, such as Scikit-learn, often have built-in functions to implement KNN efficiently.

ANN (Approximate Nearest Neighbors): ANN is essentially a variation of KNN (K-Nearest Neighbors). Instead of searching through all the items in the dataset, which can be computationally expensive and time-consuming for large datasets, ANN aims to streamline the process by examining only a subset of items that are most likely to be close to the query point. This approximation can lead to significant speed improvements without sacrificing too much accuracy, making it particularly suitable for applications with large datasets or where real-time results are essential.

Few shot-learning

Let's chat about this nifty thing called few-shot learning. It's all about teaching machines without overloading them with data. Sometimes, a little info is all they need.

As an ever-evolving field, machine learning constantly brings forth innovative approaches to solving intricate problems. Among these innovations, few-shot learning stands out as a refreshing approach to understanding and acting on limited data. Simply put, it's the art of teaching a machine with minimal data. Instead of feeding our models thousands of examples, few-shot learning says, "Hey, even a handful can do."

Zero-shot learning: In zero-shot learning, a machine is trying to make sense of something it has never seen. For example, consider a recommendation system that has details about rock and pop music but needs to categorize a new jazz track. Using zero-shot learning, it might label it correctly based on related musical characteristics. Embeddings come into play here, linking what's known to what’s new.

One-shot learning: Next up, we have one-shot learning. Imagine teaching a facial recognition system to identify a person from just one photo. Later, when a slightly different picture of the same person pops up – maybe they're wearing glasses this time – the system still recognizes them. That's one-shot learning in action. And embeddings help by capturing the unique facial features from the single photo provided.

Few-shot learning: Finally, there's the broader category: few-shot learning. Let's say you want a machine to differentiate between different dog breeds. Instead of giving it thousands of pictures for each breed, you provide a handful for, say, golden retrievers, poodles, and beagles. By using those limited images, the machine gets a good grasp and is able to identify them accurately. Embeddings are crucial here, too, linking the visual cues and breed characteristics together.

Few-shot learning, encompassing zero-shot and one-shot, is groundbreaking. In a world obsessed with big data, it reminds us that you can do more with less.

Similarity search

At its core, similarity search is a method used to locate items in a dataset that are most like a given query item. In practice, it measures how close or distant data points are to each other, typically based on certain features or embeddings. Similarity search is like that feeling when you're trying to remember a song and can only hum a few notes. You might not have the exact details, but you're sure there's a match out there.

This tool is critical in many fields, from image recognition to recommendation systems. By determining how similar two items are, similarity search aids in tasks such as finding duplicate images, recommending similar products, or even grouping data into clusters. Tools and algorithms that support similarity search, like KNN, are often employed to achieve precise and efficient results.

Why is it game-changing? With similarity search powered by embeddings, the system can swiftly show options that are, well, similar.

Applications

Imagine you have an online store with thousands of products (as we know, e-commerce is a playground for machine learning models). A customer loves a particular pair of shoes but wants something slightly different, maybe a different color. Platforms like this showcase items not just randomly but based on deep semantic similarity. So when you see a product, the embedding model runs a similarity check, displaying items that are similar.

Or, take content recommendations. Say you just read a captivating article about space exploration. Instead of leaving you hanging, similarity search, using embeddings, can suggest other reads that orbit (pun intended) the same topic. It feels like the system understands you, suggesting you content that aligns with your interests.

Beyond shopping, content platforms use machine learning algorithms and word embeddings to suggest related reads or videos after you've consumed content. Whether it's one-shot learning, few-shot learning, or even zero-shot learning, these models, with their prior knowledge, ensure users are always engaged and find value.

To wrap it up, similarity search, with a little help from embeddings, brings a kind of intuition to our digital interactions. It's not about exact matches but finding that "almost perfect" fit.

Similarity search in SuperAnnotate

Now that you know much about embeddings and similarity search, let's get to the heart of the matter. At SuperAnnotate, we recently revolutionized the way our clients interact with their data by providing them with new features. With our new similarity search tool, the client has the freedom to manipulate data and extract valuable information from it for later use in practical applications.

What do we offer?

1. Image and text embeddings

2. Find Similar

Outliers to correct - Deal with edge cases in data.
Errors to find - Find similar errors and manage them.
Similar data to label - Find the cases in a large dataset where the model is having difficulty, find similar data, and try to link with that.

3. Natural Language Search

Find unlabeled objects - Search datasets for specific unlabeled images, such as "person sitting in a car."
Specific image qualities - Filter images with particular attributes, such as "dusk settings" or "more than two individuals."
Count of unlabeled items - Use the query “Show me images with more than two people” or similar queries that require counting unlabeled items in data.

Why is it so useful?

Let's bring a few use cases of how this tool will help clients deal with their data.

Outliers to correct

When working with data, outliers can be problematic. They can lead to inaccurate results and complicate training processes. Our tool's similarity search feature is designed to tackle this problem head-on. If an error or anomaly is spotted in an image or text, the tool enables the user to search for similar instances. This helps in identifying other potential outliers. The real advantage for our clients is the ability to detect and correct these outliers quickly, ensuring their data is clean and models are trained more effectively.

Find long-tailed data

Our tool can be useful in identifying cases of long-tail distribution. In the graph below, very few examples of books are overrepresented (the first 3-4 bars), while the majority have very few instances. It's essential to focus on these underrepresented examples (books you've never heard of). Using our tool to find similar items in the long tail, especially those the model might not be familiar with, we can prioritize labeling them. This targeted approach ensures the model gets a balanced view of the data during training. In turn, it aids in selecting the right data points to focus on, enhancing the overall training process.

‍

Data under specific conditions

Sometimes, the clients need to find data under specific conditions. That can be any details, specific number of items, color, position of the object, and other needs. All you need to do is type the requirement in the search panel, and the tool will fetch the images through natural language search.

For example, in the unlabeled data, we search for animals and see a drawing of an animal, not an actual one, which you do not want in your training data. You can just select that image, find similar images of animal drawings, and remove them from the training data.

Zero-shot annotation

Remember the concept of zero-shot learning? We introduce something very similar in our tool — zero-shot annotation. As previously mentioned, you're not training the model with predefined classes or using pre-labeled examples for zero- shot learning. Instead, you initiate the process by selecting a particular item or even specific segments within that item. The underlying algorithm from embeddings then scours the dataset, hunting for data points that share inherent similarities with the selected item based on their features.

Zero-shot annotation is a concept for grouping similar items into their respective classes. You're selecting one item (or segments in it), finding similar data points, and bringing them to the same class. And why zero-shot? Because there are no labeled examples or training data involved in the process, but the model finds similar data points based on the underlying patterns of the data. It's like grouping apples from a mixed fruit basket by showing the system a single apple slice. The power here lies in its ability to group and classify data into respective classes based solely on their inherent properties, devoid of explicit prior examples. This paradigm of classifying data without explicit prior examples not only expedites the similarity search process but also allows for more flexible and adaptive data categorization, especially in contexts where labeled data is scarce or expensive to obtain.

Key takeaways

The future of machine learning is way beyond traditional machine learning algorithms. Deep learning, particularly through deep neural networks, is revolutionizing how we interact with data. The article emphasizes the crucial role of embeddings in representing complex data types. Techniques like convolutional neural networks excel at generating these vector representations, placing similar data points in the same vector space. This not only facilitates efficient similarity search, such as reverse image search, but also aids in the representation of categorical variables, eliminating the need for traditional methods like one hot encoding. By harnessing these vector embeddings, most machine learning algorithms can uncover hidden layer patterns in data, offering meaningful insights and predictions, even on unseen data.

Happy searching! 🧭