Automating OCR annotations through SuperAnnotate

Webinar series: Automating computer vision pipelines: Epoch #2 automation suite for OCR

Despite the massive leap we’ve taken towards digitization, a thick pile of documents towering right by our laptops for data entry (sounds familiar?) is still a plague no one is prepared for. It is just something we can’t afford, given the fast-paced business dynamics. And also, why bother when there are ready-made solutions like optical character recognition (OCR) to do the job for us.

Much like the end user can automate data entry through an OCR software, SuperAnnotate can automate the process of annotation to build a top-notch OCR. Not to be confusing, let’s take one step at a time:

1) To build an OCR model capable of extracting and processing text from digital images or hand-written text, you first off need data — loads of it, in fact.

2) The data has to be annotated and fed into the model for training/testing to best serve your purposes.

In all of the model development, annotating enormous amounts of textual data can be beyond mind-numbing and — the cherry on the top — extremely and unreasonably time-consuming. And here’s where SuperAnnotate offers a hand.

In SuperAnnotate’s webinar series on automated computer vision pipelines, the company’s leading AI solution experts covered various methods to automate the most tedious pieces of the annotation process for OCR. Having those methods ingested in the pipeline, you’ll still need special tooling and fine-tuning to make the given pipeline run in a loop. In this article, we’ll take insights from webinar hosts and our industry experts and will turn them into a reusable automation guide for you.

Note: If you did not get a chance to join our live sessions, you could still request access to recordings on our website.

Traditional use cases and the evolution of OCR

OCR has come a long way, from merely matching text on an image with a digital database of text to drawing insights from recognized trends and patterns. Some of the traditional and more familiar use cases of OCR cover the following:

Scanning a document and converting paper-based data into a format editable by a word processor
Indexing print material
Mobile deposit
Other use cases in insurance, banking, finance, retail, and healthcare

There’s this shift happening right now where we are no longer interested in solely extracting information, but we’re also looking into gaining insights by taking a step further with natural language processing (NLP). The insights extraction part discussed earlier is largely attributed to NLP, as it allows to process and understand the OCR material. NLP, however, deserves a spotlight of its own.

optical character recognition prediction

And it all begins with…

We often consider annotation and modeling paramount to computer vision. In OCR, pre-processing carries the day, for a good reason, of course: Without proper pre-processing in place, you’re single-handedly creating room for error in future predictions of your model. For example, if in the pre-processing stage, the areas of interest are not defined precisely in an image — and by areas, we also consider the blank space too, not just the context of the text — your model won’t deliver expected outcomes when deployed.

That being said, we want the pre-processor data to ensure superior quality with inch-perfect annotations, with all possible edge cases counted in. We also want to remove any characteristics from the data that might hinder the systems, for which we can implement the following tasks:

Binarization: This is the process of converting RGB images into black-and-white.

Skew correction: Skewed images are rare neither for images captured by scanners nor even photographed images. So, data pre-processing has to involve detecting tilted items and fixing them to make the text appear consistently horizontal and legible to the model.

Noise removal: To get the highest possible source image quality it has to be deprived of any possible noises and distortions.

Format conversion: This relates to the file format. For example, png to jpg.

A new generation of models operating based on neural networks eliminate the need for implementing data pre-processing. However, they are also very expensive and hard to deploy.

OCR operating principle

OCR in most ML systems mimics how our brains work. We’re using our brains to understand and process what it is that we’re reading through our eyes. And OCR works the exact same way.

Annotation/ localization

First, we tell our system where the items are, which in OCR is done in the annotation, more particularly named entity annotation step. We can also call this localization.

The named entity is when we isolate text components, localizing exactly where they are. This basically answers the question of where. And we can accomplish this on a line, word, and on character level.

Classification

Next, we have to determine what, which is done in the classification step. In this step, we take the annotations of words or groups of words from localization and extract the underlying text.

Once the text has been localized with a bounding box, the document is cropped using that area. The cropped region is analyzed for contours, in other words, the dark pixels are identified as a feature, and the light pixels are identified as background. The contours are then fed into a pattern recognition or feature detection algorithm to determine the alphabetic letter(s) or numeric digit(s). For successful classification, the localization must be done properly because this will eventually decide what the crop regions of interest will be that the contours will be extracted from. Proper pre-processing will also allow the model to better detect contours in this region of interest.

This can be automated to speed up the process and make your OCR system more efficient.

Automating OCR through SuperAnnotate

Having quality annotations is crucial, and this is where you can benefit from having a platform like SuperAnnotate. At SuperAnnotate, we utilize our Vector Editor for OCR use cases:

1) First, we place bounding boxes for the named entity recognition.

2) Then, we extract the text using free text attributes for classification.

So now that we know OCR from a theoretical perspective, the question in the air remains how do we automate this?

Now, putting a bounding box may not seem time-consuming, but typing out the text (when you have to describe the body of the text) takes quite a bit of time. Besides, typing out manually has its own challenges, as there’s a chance for mistyping and other errors.

To minimize the error rate and streamline the process, we’re proposing a tutorial on how to use an OCR model to automate text annotations. We will be using bounding box annotations labeled in SuperAnnotate’s platform to perform text predictions on and merge the predictions to the annotations. Here we will be using an EasyOCR model to get predictions. You can access the tutorial here.

But why SuperAnnotate vs. other platforms?

There are three significant areas where SuperAnnotate stands out among its competitors:

DataOps platform

With our advanced DataOps platform, you can automate labeling, get unparalleled control of annotation workflow, quickly spot quality issues in your data, and seamlessly integrate into your existing pipeline. Here is a dropdown of our core advantages:

ML data pipeline and annotation platform
Robust QA, project management, data management
Deep integration and customization via SDK
AI automation, 10x faster annotations

DataOps services

We’ve screened over 350 annotation service companies worldwide and selected the top 35 to be in our marketplace. Our hand-picked teams of professionals are all trained on our software, so they can flawlessly leverage the platform features. SuperAnnotate’s marketplace gives you the following:

Comparison between the best annotation teams for any project.
All teams leverage SuperAnnotate to deliver maximum speed and the highest quality.
Fully managed marketplace (no crowdsourcing).

MLOps success program

We bring excellent DevOps and machine learning expertise to serve as an extension of your existing data engineering unit. We’re not only supporting your efforts in labeling datasets but also setting up and improving data pipelines. Our dedicated account teams include ML engineers, annotation experts, DevOps engineers, and project managers. From day one, we provide white-glove service, where we institute our best practices to help you scale your data practice. So, by choosing SuperAnnotate’s platform, you receive:

Best-in-class customer success and PM
ML, annotation, DevOps, pipeline support
Software and services all-in-one place

Wrap up

We hope this article provides you with hands-on tips and knowledge to automate your OCR project and accelerate your annotation pipeline without compromising on quality. With a real-life use case of production ML, we showcased how such tooling can boost the scalability of annotation processes making transitions between dataset creation and model training smooth and flawless. By now, you should have an understanding of how SuperAnnotate can facilitate you in your OCR automation efforts.