Improve OCR Performance and Accuracy Using Image Dropout

Forms dropout is a vital tool for document processing applications. In most cases, dropout involves the removal of pre-printed content from a scanned image, such as graphics, logos, picture areas, or form field lines, leaving behind whatever data was added to the form. By isolating the document template’s entered text, an optical character recognition (OCR) engine can read and recognize much more accurately. In cases where filled data and and document images overlap, the dropout engine reconstructs the text as accurately as possible.

Although most OCR engines can process an entire document at once, it’s often easier to break the image up into a set of regions to recognize only the areas containing form text because the desired data may only be a small fraction of all the text within the image. Dropping out all elements of the document template speeds up the recognition process and reduces the likelihood of errors. De-skewing and pre-processing the image to remove noise also improves OCR performance.

3 Easy Steps for Using Form Dropout in Document Processing

Accusoft’s FormSuite for Structured Forms provides a variety of OCR tools that allow applications to read and convert machine printed characters into editable text. In addition to recognizing multiple languages, it can also use intelligent character recognition (ICR) to read hand printed form text and optical mark recognition (OMR) to read fillable marks like multiple choice bubbles or check boxes.

By utilizing image dropout to isolate text data, FormSuite can greatly improve OCR performance, reading forms at a higher rate and with a higher degree of accuracy. Applying image dropout is a simple process that can be completed with just three simple steps.

1. Identify Form Templates

Forms must first be properly identified to distinguish which elements need to be dropped out of the image. In most cases, organizations with automated forms processing workflows utilize a variety of form templates. Even if they only use one specialized form type, they may have multiple variations of that form. It’s also possible to process just a portion of a document.

FormSuite’s identification capabilities allow it to quickly scan and identify documents within an application provided a matching template exists. If the image is unknown, it will attempt to match the form to a template in the form set. Identification thresholds set the minimum confidence required for a document to be considered a match. Higher values are especially helpful for an organization that uses many similar forms since the likelihood of an incorrect match is higher. Raising FormSuite’s Identification Quality setting will produce more accurate matches, but also requires more time and processing power since the recognition engine is working harder and comparing more individual document objects to known form templates. Lower settings may be more efficient when processing documents with minimal form content. FormSuite can also rotate unknown images up to 270 degrees prior to checking for a potential match.

2. Register the Image

After the correct form is identified, the document needs to be mapped to the template image on record. This process is called registration and involves transforming two or more sets of image data into a single coordinate system. Accurate image registration ensures that the document image aligns properly with the template. Many dropout problems can usually be traced back to poor registration. FormSuite allows users to adjust the value of allowable mis-registration to help improve performance.

The transformation process is resolution independent and requires all parts of the document to be properly aligned. It’s important to reduce false positive image templates that are actually text; accidentally removing text data from the document can render sections of OCR data unreadable.

3. Dropout Template Content

When utilizing image dropout techniques, it’s important to identify cases where portions of the text have been incorrectly removed as part of an image template that has been removed from the document. This can manifest as removed strokes from letters and may induce errors in the recognized OCR data.

Consider the following example:

When dropping out the template form, graphic lines that bisect the descender characters can cause issues. Without intelligent stroke reconstruction, the data below the dropout area may be affected as the OCR process may not correctly associate all the text image data together. In the example below, for instance, the recognition errors may include: ‘g’ – ‘a’, ‘j’ – ‘i’, and ‘y’ – ‘v’.

In cases like this one, advanced intelligent reconstruction techniques need to be deployed to correct the text for accurate OCR reading. FormSuite for Structured Forms reconstructs missing data and characters damaged during the form dropout process with the FormFix SDK, which is built into its core functionality.

4. Perform a Secondary Alignment

After the form image is clipped out, it’s a good idea to perform a secondary alignment at the field level. It can be difficult to align text properly when the entire image is still visible. By aligning smaller field areas after form dropout, the OCR engine will be able to recognize text more accurately, especially if there is significant image distortion in the filled image as compared to the form template image.

Boost OCR Performance with FormSuite for Structured Forms

By applying intelligent image dropout, document recognition can be both faster and more accurate. FormSuite for Structured Forms is a comprehensive solution for integrating powerful forms processing into your application. With robust dropout and imaging capabilities, FormSuite excels at preparing documents and images of all formats for OCR reading. Learn how you can improve OCR performance for forms processing by downloading a fully-featured trial of FormSuite for Structured Forms.