In recent years we keep witnessing a major advance in AI, which brings about an ascending implementation of computer vision (CV) in real life. Putting all the current hype around autonomous vehicles aside, the autopilot system is indeed one of the major accomplishments in ML that projects the future reality. It is expected that self-driving cars will become commonplace in the span of 10 years. Chris Gerdes, a professor of Mechanical Engineering at Stanford University and co-director of the Center for Automotive Research at Stanford, is confident in his statement that “we can soon give cars the skills of the very best human drivers, and maybe even better than that.” Yet, with the growing demand, there are also emerging challenges for CV in autonomous vehicles. In this blog, we will cover the following:

  • Gathering the training data
  • Data labeling
  • Object detection for autonomous vehicles
  • Semantic segmentation and instance segmentation
  • Multi-camera vision and depth estimation
  • Key takeaways

Gathering the training data

Cars without a human operator require rigorous pattern recognition and a ton of computing power to drive independently. One of the main challenges for AI-powered self-driving cars is the acquisition of training datasets. An AI solution is as good as the data it is trained on. Given that, quality datasets and pixel-perfect labeling are of incremental value for the model.

One of the better options of data collection to be used for computer vision in autonomous vehicles is driving around and capturing shots, which can be done either through semi-autonomous driving or by using an artificial model such as the computer game engine. The model has to undergo multiple iterations of camera-generated images for sufficient detection. Keep in mind that the training process will mostly require object images to be recognized by your CV model: things that may appear on the road, street signs, road lanes, humans, buildings, other cars, etc.

Each of these elements is labeled through an individual annotation type: polyline for lane detection, 3D point annotation for LiDARs, and so forth. A similar variety points at the complexity and vast amounts of data needed to train a model.

Data labeling

Data labeling requires heavy manual labor. For datasets as massive as self-driving cars, data labeling is especially dependent on human effort to identify unlabeled elements in raw images. In the meantime, the labeled data has to be accurate to run successful ML projects. Maintaining high levels of precision for large-scale projects is especially challenging. With the increased workforce, there comes increased responsibility for keeping communication open and setting up an effective feedback system so that annotation teams or members within the teams operate in cohesion. For that, we recommend setting up an annotation guideline that roadmaps the annotation process and provides concise instruction to avoid further mistakes and imbalance.

data labeling

Cohesion, however, should not be confused with the data type. A CV model has to be able to make accurate predictions and estimations based on what it “sees” on the road and beyond, which comes down to the need for diverse data input when training a model.

There are multiple ways you could go with data labeling, including in-house, through outsourcing, or crowdsourcing. Whichever you end up choosing, make sure to set up a robust management process to develop a scalable annotation pipeline.

Object detection for autonomous vehicles

Self-driving cars use CV to detect objects. Object detection takes two steps: image classification and image localization.

Image classification is done by training the convolutional neural network (CNN) to recognize and classify objects. The problem with CNN is that it’s not the best solution for images with multiple objects, as the model is likely not to capture all objects. This is where sliding windows come into play.

As the window slides over the image, it runs each part of the CNN and checks if it resembles any object the model is trained to recognize. If there are objects considerably larger or smaller than the window size, the model won’t detect them. To get that covered, you can use different window sizes for sliding purposes or apply the You Look Only Once (YOLO) algorithm. In this case, the image is run through the CNN only once, as you split it into grids. In the end, YOLO provides predictions based on the probability of each grid cell containing an object: so, no need for several run-throughs.

Now, to point out where the object is positioned on an image, we use the so-called non-max suppression (NMS) algorithm. The NMS algorithm selects the best bounding box for an object based on the highest objectiveness score and the overlap or intersection over union (IoU: calculated by dividing the area of overlap by the area of union) of the bounding boxes while omitting the rest. The objectiveness score provides the probability of an object being present in the bounding box. The selection process is repeated and reiterated until there is no room for box reduction. In short, NMS can be described as taking the boxes with the lowest probability score and suppressing them.

Let’s use the example below to illustrate how NMS works. Suppose you want your model to detect the car vs. the truck on an image. Here is how you should proceed with box selection using NMS.

box selection using NMS
  • Pick the box with the highest objectiveness score
  • Compare the IoU of the selected box with other boxes
  • Suppress the bounding box with the IoU over 50%
  • Move the next highest objectiveness score
  • Repeat steps 2-4 all over

The end result, in our case, will be the green boxes with the highest objectiveness score, which is what you want your CV model to identify.

Semantic segmentation and semantic instance segmentation

While sounding similar, semantic segmentation and semantic instance segmentation pose different challenges in regard to autonomous vehicles. The difference between the two is often confusing: semantic segmentation labels each object in an image (truck, van), while semantic instance segmentation draws the differences between the labeled objects (car1, car2, car3).

The impending problems with the two are performance and confusion. Performance can be a problem because of the sensor limitations. Confusion, on the other hand, can be triggered by a handful of external factors, including the lighting and shadows, weather conditions, and so on.

Performance and confusion are essential factors to consider when dealing with bigger datasets, as the neural network will be more prone to results generalizations. To that end, the dataset variety and the number of iterations in the process are of uttermost significance to your CV project.

autonomous vehicles

Multi-camera vision and depth estimation

Vehicle safety is one of the key metrics for safe-driving cars that cannot be ensured without proper depth estimation. The distance between camera lenses and the object's exact location helps build a secure system and is a pivotal step towards building a stereo vision system.

Perspective distortion

A good distance between the lenses of cameras stimulates effective depth estimation. This, however, can also bring about distortion, which you want to avoid to make accurate calculations.

Non-parallel representation

The difference in pixel accuracy can affect the way the machine calculates the distance. This is especially the case with self-driving cars, as their cameras might not deliver images with the same pixel accuracy. Even the slightest difference in pixels can impact the model calculation to an extent.

Key takeaways

AI in self-driving cars is an unending ocean full of discoveries and rapidly advancing technological tweaks and turns. Yet autonomous driving would have been impossible without state-of-the-art datasets and robust CV, which comes down to the need for a consistently expanding workforce and respective challenges for your model to excel.

The main challenges we tracked when training a CV model for self-driving cars were the process of dataset gathering, data labeling, object detection, semantic segmentation, and semantic instance segmentation, object tracking for the control system and 3D scene analysis, multi-camera vision, and depth estimation. Which one do you think poses the greatest challenge?

SuperAnnotate is helping companies build the next generation of CV products with its end-to-end platform and integrated marketplace of managed annotation service teams. It provides comprehensive annotation tooling, robust collaboration and quality management systems, no-code neural network training and automation, as well as a data review and curation system to successfully develop and scale CV projects.

Recommended for you

Stay connected

Subscribe to receive new blog posts and latest discoveries in the industry from SuperAnnotate
Have any feedback or questions?
We’d love to hear it from you.
Contact us  >