How Using Image Cleanup Can Improve OCR Accuracy
Jeffrey Hodges, Accusoft Senior Software Engineer
Many factors are important for generating the best possible OCR results from image documents. One of the most important factors is to start with the best quality image possible. The OCR accuracy has a direct correlation to the quality of the document. While OCR is usually done on a black-and-white binary document, it would be better to scan the document to an 8-bit or higher bit depth image. This greater image depth can be useful for many of the image processes necessary to clean up scanner artifacts. These include light and dark specks, skew, warp, and border artifacts.
Eliminate Border Artifacts
Scanned images always have some artifacts that affect the quality of the document. Pages are almost never exactly aligned within the scanner. One effect is the addition of a border line into the image. This border is outside of the original page being scanned, but is included in the scanned document. This also happens when the page is smaller than the scanner surface. These border effects are not part of the original page and should be removed. These documents should be clipped to remove the border defects, otherwise when performing OCR these regions may yield erroneous data, increasing recognition errors.
Skew is a very common effect that occurs when scanning documents. It almost always needs to be accounted for when performing OCR, otherwise the text can be
Correct Perspective Warp
When images are taken from a camera or phone, and not from a flatbed scanner, more distortion will occur. The camera takes in the whole image and there is always some distortion at different angles. Perspective warp correction is required to allow for the non-linear transformation across the image.
The most common type of noise is extra specks within the document. These specks could be both light or dark and are most likely to occur when the document is scanned in black and white. Speck removal is the elimination of these small stray marks in the image without removing important pixels. Overaggressive speck removal will negatively affect text recognition accuracy by removing correct objects such as periods, the dot above the letter i, or other small marks, but under-removal of the specks leave noise that may be incorrectly recognized as text.
Most OCR is performed upon binary images to enable faster analysis, transforming the scanned document to text data. By scanning the document in a higher bit depth, advanced image processing can improve the quality of the document for further processing. Following this, binarization (the process of intelligent color detection and reduction of the bit depth to 1 bit per pixel) is performed to change the document to a black-and-white image suitable for OCR processing. Choosing the correct binarization algorithm can also smooth the background and flatten color regions.
Accusoft's ScanFix Xpress SDK provides advanced document image processing to automatically clean up and improve document quality. Automatic image cleanup processes within ScanFix Xpress yield improved accuracy of subsequent OCR processing. These clean up processes also improve forms processing and intelligent character recognition (ICR).
Jeffrey Hodges is a Senior Software Engineer in the SDK division of Accusoft Corporation. He is an expert in document recognition technologies with over 20 years developing innovative solutions.