Answer
To do this:
Use the ImGearPDFPage.GetContent()
function to get the elements stored in the PDF page.
Then loop through these elements and check if they are of the type ImGearPDEImage
.
Convert the image to an ImGearPage
and find it’s bit depth.
Use the highest bit depth detected from the images as the bit depth when rasterizing the page.
The code below demonstrates how to do detect the bit depth of a PDF page for all pages in a PDF document, perform OCR, and save the output while using compression.
private static void Recognize(ImGearRecognition engine, string sourceFile, ImGearPDFDocument doc)
{
using (ImGearPDFDocument outDoc = new ImGearPDFDocument())
{
// Import pages
foreach (ImGearPDFPage pdfPage in doc.Pages)
{
int highestBitDepth = 0;
ImGearPDEContent pdeContent = pdfPage.GetContent();
int contentLength = pdeContent.ElementCount;
for (int i = 0; i < contentLength; i++)
{
ImGearPDEElement el = pdeContent.GetElement(i);
if (el is ImGearPDEImage)
{
//create an imGearPage from the embedded image and find its bit depth
int bitDepth = (el as ImGearPDEImage).ToImGearPage().DIB.BitDepth;
if (bitDepth > highestBitDepth)
{
highestBitDepth = bitDepth;
}
}
}
if(highestBitDepth == 0)
{
//if no images found in document or the images are embedded deeper in containers we set to a default bitDepth of 24 to be safe
highestBitDepth = 24;
}
ImGearRasterPage rasterPage = pdfPage.Rasterize(highestBitDepth, 200, 200);
using (ImGearRecPage recogPage = engine.ImportPage(rasterPage))
{
recogPage.Image.Preprocess();
recogPage.Recognize();
ImGearRecPDFOutputOptions options = new ImGearRecPDFOutputOptions() { VisibleImage = true, VisibleText = false, OptimizeForPdfa = true, ImageCompression = ImGearCompressions.AUTO, UseUnicodeText = false };
recogPage.CreatePDFPage(outDoc, options);
}
}
outDoc.SaveCompressed(sourceFile + ".result.pdf");
}
}
For the compression type, I would recommend setting it to AUTO. AUTO will set the compression type depending on the image’s bit depth. The compression types that AUTO uses for each bit depth are:
1 Bit Per Pixel – ImGearCompressions.CCITT_G4
8 Bits Per Pixel – ImGearCompressions.DEFLATE
24 Bits Per Pixel – ImGearCompressions.JPEG
Disclaimer: This may not work for all PDF documents due to some PDF’s structure. If you’re unfamiliar with how PDF content is structured, we have an explanation in our documentation . The above implementation of this only checks one layer into the PDF, so if there were containers that had images embedded in them, then it will not detect them.
However, this should work for documents created by scanners, as the scanned image should be embedded in the first PDF layer. If you have more complex documents, you could write a recursive function that goes through the layers of the PDF to find the images.
The above code will set the bit depth to 24 if it wasn’t able to detect any images in the first layer, just to be on the safe side.