Join our upcoming webinar “Deriving Business Value from LLMs and RAGs.”
Register now

Webinar #5 | How to deal with dataset bias

Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.

It’s hard to imagine modern artificial intelligence (AI) algorithms, especially computer vision algorithms, without datasets. These act as both a source of training data and a means of comparing and measuring algorithm performance. It’s clear to everybody that AI systems make decisions based on the training data they receive. Because humans process this data, it often reflects biased approaches, such as historical or social inequalities. It follows that the machine learning (ML) algorithms are only as good as the people who develop them. Some may assume that the solution will simply be removing sensitive variables such as gender, race, or any other interfering component; however, practice shows that this alone, especially when building an AI model, is not enough.

how to deal with dataset bias

That’s why we organized a webinar to discuss bias in ML, focusing on dataset bias, where we shared some best practices to measure it and keep it to its possible minimum.

Please leave your email below to access the webinar recording.

What is bias?

With so many definitions within and outside ML, it’s safe to say that bias is a complex term. Generally speaking, bias is an inclination or prejudice for or against a specific choice. Similarly, when discussing the AI model performance, bias can be seen as a tendency towards a particular prediction. As for neural networks, bias is a constant term added to the calculation of perceptrons in a network. We often run into the term bias-variance tradeoff when it comes to AI model accuracy.

graphical illustration of bias-variance tradeoff
Graphical illustration of bias-variance tradeoff
  • As already mentioned, the bias is an error from false assumptions in the learning algorithm. When increased, it can cause an algorithm to miss the relevant relations between features and target outputs, otherwise called underfitting.
  • The variance is an error from sensitivity to small fluctuations in the training set. A high variance increases the random noise in the training data, and that’s overfitting.

Inarguably, we want low variance and low bias, but often there is a preferable tradeoff when we increase one to decrease the other and get the desired model accuracy.

Dataset bias in a nutshell

We now know that dataset bias is an error in the ML dataset where certain elements are more heavily weighted and represented than others. Depending on where and how these errors occur, bias can be classified differently.

Dataset sample bias

Also known as selection bias, sample bias occurs when a dataset does not represent the facts of the environment where the model is going to operate.

Human sampling bias

This type depends more on people who work with the dataset rather than the data itself, meaning that given a clear and profound dataset with various data points, we can still deliberately or unintentionally create a bias after sampling it.

How do we minimize bias?

Once we know why and how the bias occurs, it’s much easier to control the situation and think of ways to keep it low.

Detailed instructions

Annotation instructions are the primary bias source. If there’s a lot of room for interpretation and the details are unclear, no matter how professional your annotation team is, they will interpret the data points differently. As a result, you might end up with a dataset that contradicts itself or is queued somewhere you don’t want it to. That’s why make sure to design an instructions document that shares the desired outcome, provides annotation examples, and covers the edge cases along with the most common ambiguities.

Multi-level QA

Create a workflow that fits your project, yet doesn’t leave the responsibility on annotators alone. At SuperAnnotate, we choose to work with a multi-level QA system that ensures annotations get approved or disapproved at different levels. For this, we’ve built the most communication-friendly environment for team members with different roles to share detailed comments and approvals within the platform.

multi-level QA system

Data versioning

What we do at SuperAnnotate and suggest you implement during your next annotation project is data versioning. It basically means creating snapshots of your dataset at different times, which allows you to benchmark models, go back, and revert the errors. It also allows you to review the dataset once finished. We dive deeper into the benefits of data versioning during the webinar, show how to perform the filtering, queries, view specific classes, and more. This allows us to further investigate the datasets, minimize the bias, and improve the annotation process overall.

Consensus

Consensus is a common tactic to reduce bias. Typically, it includes multiple people working on the same annotations not in a consecutive manner but independently. It does minimize bias, but it also increases work. We can also implement consensus as a workflow and draw conclusions based on the majority decision. Or else put different metrics and only accept the outputs where everyone agrees.

consensus example

As with any other bias prevention method, the consensus has its limitations. It fits well the classification tasks where the output is defined, and we can make sure it corresponds to the expectations. However, it is more challenging to implement for other tasks such as segmentation or object detection because there are so many more factors (location, class, shape), and it’s hard to develop an unbiased metric here. For these cases, multi-level QA usually performs better.

Wrapping up dataset bias

As we all know, bias is always there in our minds and the datasets we produce. It can, however,  be detected and reduced due to solid practices and workflows. During this webinar, we shared our thoughts and illustrated how to handle bias. It’s especially easy to do with SuperAnnotate because our platform was built around the annotator’s experience with business efficiency and bias reduction in mind.

We are happy to invite you to our next webinar series about Automated CV pipelines, where we’ll reveal ways to automate the data annotation process of any computer vision project. Register here and we’ll see you on March 16.

superannotate request demo

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate
Thank you for subscribing to our newsletter!
Oops! Something went wrong while submitting the form.