The ultimate tools to build innovative AI products

Object detection is an extremely common use case of deep learning in computer vision. Its main task is to put accurate bounding boxes around all the objects of interest in an image and correctly label such objects. While this sounds simple enough as a high-level explanation, when we get to actually implementing an object detector it is not immediately obvious how we should go about evaluating our model’s performance. After all, the concept of correctness isn’t as clear-cut here as in the case of, say, classification. This is where Intersection over Union and mean Average Precision (mAP) come in. These concepts may sound intimidating (especially since mean and average are the same thing, which can cause confusion), but they are quite simple when broken down. This blog post will focus on mean Average Precision, addressing the following:

mean average precision (mAP)
  • What mean Average Precision is and how it’s used in object detection
  • Precision and Recall: The metrics mAP is based on
  • How to calculate mAP based on precision and recall
  • What mAP score can be considered good: Evaluating object detectors
  • Why mAP can’t be used directly as a loss function

Mean Average Precision: What it is and how it works

We’re going to delve into the nitty-gritty in a moment, but first, let’s develop an intuition for how we would evaluate an object detector. To do this, we must understand what exactly we want our model to do and what we want it not to do. Well, for starters, we would like our model to be able to detect all the objects of interest present in an image, or at least as many as possible. We would also want the bounding boxes drawn by the model to be as accurate as possible, enveloping the object fully but not including much else. We’d like our model to assign class labels to each object correctly (i.e. both detect the cat and say that it’s a cat). Finally, we wouldn’t want our model to “see” objects where there are none.

Now that we have a pretty good idea of what we want our model to do let’s see how mAP works and how it helps us evaluate models based on the above-mentioned criteria.

Precision and recall

To understand mean Average Precision, one first needs to comprehend the concepts of precision and recall, which in turn requires the understanding of true/false positives and false negatives (the concept of true negatives doesn’t really have a meaningful usage in object detection). In the object detection case, it can be interpreted as follows:

  • True positive: the model has identified the correct type of object in the correct location
  • False positive: the model has identified an object that isn’t there or has assigned a wrong label
  • False negative: the model hasn’t identified an object that it should have

Once our model makes predictions, we go over each of the predictions and assign it one of the above-mentioned labels. Based on this, we can calculate precision and recall. Below are the formulas for both, followed by an explanation.

Precision = True positives / (True positives + false positives)

Recall = True positives / (True positives + false negatives)

While the formulas may at first seem random, they’re quite intuitive. Simply put, precision and recall answer the following questions:

  • Precision: out of all the predictions the model has made for a given class, how many are correct (expressed as a fraction)?
  • Recall: out of all the examples of a given class in the data, how many has the model found (again expressed as a fraction)?

These two metrics, thus, help us form a fairly coherent picture of the model’s performance. In fact, they aren’t bound to object detection or even computer vision and are used in a wide variety of scenarios.

Calculating mAP from precision and recall

Once we have precision and recall, calculating mean Average Precision is pretty straightforward. If one recalls (pun intended) the concept of Intersection over Union, they will remember that the threshold of minimum IoU for a prediction to count can be adjusted (it can be any number between 0 and 1, where 0 means no overlap at all, while 1 means full overlap without anything extra). Taking this into account, when calculating mAP, all we have to do is run the model with a bunch of different IoU thresholds, calculate precision and recall for each threshold, and plot them on a chart: very creatively called the precision-recall curve.

calculating mAP from precision and recall
Image source: Paper on sparse distance learning for object recognition combining RGB and depth information

This is the precision-recall curve for an object detector that detects bowls, coffee mugs, and soda cans. To calculate the Average Precision for each class, all we need to do is calculate the area under its respective curve (e.g., the purple one for the coffee mug). Then, to calculate the mean Average Precision, we just calculate the mean of these areas. Intuitively, this answers the following question:

“If I take into account the potential trade-off between precision and recall, how good is my model on average (on all classes, not just one)?”. This explains why mean Average Precision is so commonly used as a metric in object detection: it’s just a robust and relatively comprehensive way to measure the elusive concept of correctness or goodness in object detection.

Evaluating object detectors: So, what’s a good mAP score?

We’ve reached a point in the article where the reader may be thinking, “I get how mAP works and why we use it, but how do I actually interpret an mAP score?” Well, one way to look at it is the closer your model’s mAP score is to 1 (or 100), the better. However, since an mAP score of 1 would mean that your model has perfect precision and recall across all classes with all IoU thresholds considered, it obviously isn’t feasible. A more realistic way to evaluate a model, especially if you’re using an open-source dataset, would be to take a look at the state-of-the-art. There are a bunch of different object detection challenges, the most famous one being the COCO challenge. The current state of the art has an mAP of 63.3, with only a handful of papers having achieved an mAP of over 60. This gives us some much-needed perspective on how good we can actually expect a model to be. However, these numbers, of course, depend highly on the dataset in question, the number of classes, and a myriad of other factors.

Side note: On the interchangeability of Average Precision and mean Average Precision

This blog post talks about Average Precision and mean Average Precision as two separate concepts, which is how they’re mostly talked about in the community and in the literature. However, the two terms are sometimes used interchangeably to refer to the idea behind mean Average Precision. For instance, the very same COCO challenge uses the term Average Precision when they are, in fact, talking about mean Average Precision. While this isn’t very common, it’s nevertheless useful to be aware of this to avoid confusion.

Bonus: Why mAP can’t be used directly

This blog post talked at length about how mAP is a good way to evaluate the performance of object detectors. This might give the reader the wrong idea that mAP can be used directly in the training of neural networks. After all, what are loss functions if not a signal that tells the model how good it’s doing? However, mAP cannot be used directly but is instead for humans to evaluate the performance of networks. Its intention is quite simple: the way we calculate mAP isn’t differentiable. This is a deal-breaker since the optimization of neural networks relies heavily on differentiation.

To recap

As we discussed, mean Average Precision is an evaluation metric often used in object detection because it provides a meaningful estimation of how good the neural network in question is doing. To compute mAP, we need to calculate precision and recall for each class for different IoU thresholds, plot them against each other, getting the precision-recall curve, the mean of the areas under which will give us the mAP score. While an mAP of 1 (or 100) is theoretically possible, no model has been able to achieve a score even close to it yet, at least no major, publically available datasets.

Author: Davit Grigoryan, Independent Machine Learning Engineer

superannotate request demo

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate