Improve OCR Performance and Accuracy Using Image Dropout
Jeffrey Hodges, Accusoft Senior Software Engineer
Image dropout is derived from the desire to isolate the text of a form for more accurate data extraction. Documents with graphics, logos, and picture areas can make it difficult for many OCR programs to determine the data. The text can be in close proximity, so that normal page segmentation does not allow for its accurate separation and recognition.
Most OCR programs, such as OCR Xpress, process the whole document or just a set of regions on the image. Only by dropping out the template image can some text data even be identified for recognition. All these processes perform best when images are de-skewed and pre-processed to remove noise.
Processing just the text data is more accurate and faster than processing the whole document and trying to detect the text associated with the form. The desired data output could be a very small fraction of all the text within the image. But by calling recognition on just the document text data, OCR recognition is much faster.
There are three steps to isolate just the text data for recognition. They are the identification of the document to a form template, registration of the document to form template, and intelligent dropout of the template from the document to leave only the textual data. Then recognition can more accurately produce the output text data.
Identifying Form Templates
Form Image dropout requires that the form is first properly identified. Usually, the user will have multiple forms for processing and not just one. Even when there's just a single form, there could be multiple variations. It's also possible to process just a portion of the document. When there are multiple possible form templates, then a process to identify the "best" match requires an heuristic algorithm to generate a match score. The algorithm should be able to calculate scores that can distinguish between two similar form templates. When two image templates are suitably different, we can define a new image template from one of these.
The algorithm needs to be very fast in order to calculate and compare all the form templates, so it can't perform any text recognition during this phase. It should also be independent of scale; the same form should be able to be identified regardless of resolution (ie. scanned at 300, 400, or any other dpi).
Image registration is done to map the template image to the document. The transformation is resolution independent and it is required that all parts of the document are properly aligned. It's important to reduce false positive image templates that are actually text; accidentally removing text data from the document can render sections of your OCR data unreadable. It's also important to note the nature of how perspective affects image registration. When taking pictures with a camera, for instance, non-affine transformations are often required to correct for perspective warping.
When employing image dropout techniques, it's important to identify cases where portions of the text have been incorrectly removed as part of an image template that has been removed from the document. This can manifest as removed strokes from letters and may induce errors in the recognized OCR data. Advanced intelligent reconstruction techniques exist—in software such as FormFix and ScanFix Xpress, for example—to correct for these types of errors.
When dropping out the template form, graphic lines that bisect the descender characters can cause issues. Without intelligent stroke reconstruction, the data below the dropout area may be affected as the OCR process may not correctly associate all the text image data together. In this example, the recognition errors may include: 'g' – 'a', 'j' – 'i', and 'y' – 'v'.
By applying intelligent image dropout, document recognition can be both faster and more accurate. ScanFix Xpress and FormFix Xpress provide industry leading technology for solving these problems and more. FormSuite includes these and other components to provide a complete form processing solution.
Jeffrey Hodges is a Senior Software Engineer in the SDK division of Accusoft Corporation. He is an expert in document recognition technologies with over 20 years developing innovative solutions.