Image-Only PDF in FormAssist
FormAssist, a part of FormSuite, makes it easy for you to automatically extract your data from your forms. At Accusoft, we are currently working on a new update that will support extracting image data directly from compatible PDF documents.
We want to allow our users to process their scanned documents using FormAssist and FormSuite as easily as possible. However, PDF documents produced from scanners vary by vendor. So during development, we looked for the simplest set of traits that were found among the documents produced within our test set.
We found that scanned documents had one image per page and no visible text. So to support these documents, we defined what makes up an image-only PDF document. These qualifications include:
- Each page must have exactly one image
- Each page must have no visible text
This page in this image has only one image, but it has a visible text layer, and so it is not image-only.
Common scanners will produce PDFs with these traits when scanning documents including scanners from Brother, Cannon, Epson, Ricoh, and Xerox. Some scanners can produce searchable documents by running the image through an internal OCR engine, then adding a hidden text layer overtop of the characters detected in the image. For our purposes, we chose to ignore invisible text layers, like in those searchable documents, when determining if a document is image-only.
With these rules in place, we implemented image-only support by extending part of the ImagXpress API used within FormAssist. Additionally, this check is made on a page-by-page basis. A particular page can be loaded from a document individually, and that page will be accepted as image-only (if it meets the criteria) even if the other pages of the document do not.
This update will allow you to load a compatible PDF in FormAssist using the same methods as the other image files you were already using. Any compatible PDFs then have their contents decoded and extracted to a Device Independent Bitmap (DIB), able to be processed like a DIB from any other image type supported by ImagXpress and FormSuite like TIFF or JPEG. All of this will let you use scanned PDF documents immediately within your processing workflow. No document conversion or additional processing required.
This new feature is being added as part of the ImagXpress API, so now your own projects can take advantage of the new image-only PDF handling by including ImagXpress. All you need is a set of your own image-only PDFs, with no text or extra images.
Brian Bordeaux, Software Engineer I
Brian Boudreaux started at Accusoft in the support department back in June of 2017. He is a 2016 alumnus from the University of Central Florida’s Computer Science program. He currently works as a software engineer in Accusoft’s SDK group, contributing to products like FormSuite, ImagXpress, and ImageGear. In his time outside of work, Brian can often be found playing Magic, practicing Shorin-ryu karate, or trying to learn a new language (again).