Technical FAQs

Question

What are the differences between the compressions used in TIFF files?

Answer

The Tagged Image File Format (TIFF) is widely popular, and is particularly used in document imaging. It can support a number of compression types:

  • Packbits – Created by Apple, this lossless compression type is used for run-length encoding (RLE). Baseline TIFF readers must support this compression. Use this compression for higher compatibility with various applications.
  • CCITT (Huffman encoding) – Used particularly for encoding bitonal (or bi-level) images. “Group 3” and “Group 4” are particularly known for its use with fax transmission of images. Using this compression type will help keep smaller file sizes.
  • LZW – A lossless compression type that supports multiple bit depths. Because it’s lossless, it produces files that are generally larger than other compressions. Use this compression if you want to retain the exact visual quality of the image without data loss or artifacts.
  • JPEG – Very popular compression, used for color and grayscale images and can produce high compression ratios. JPEG allows a good amount of control over how the image in question should be compressed. Use this compression for general color or grayscale images.
  • Deflate – A lossless compression using Huffman and LZ77 techniques and also supports different bit depths.

 

The industry-wide push to digitize documents and minimize the use of physical paperwork has made PDF one of the most ubiquitous file formats in use today. Business and government organizations use PDFs for a variety of document needs because they can be viewed by so many different applications. When it comes to archiving information, however, PDFs have a few limitations that make them unsuitable for long-term storage. That’s why many organizations require such files to be converted into the more specialized PDF/A format.  Learn how easy it is to convert PDF to PDF/A with ImageGear.

What Is PDF/A?

Originally developed for archival purposes, the PDF/A format is utilized for long-term preservation that ensures future readability. It has become the standard format for the archiving of digital documents and files under the ISO 19005-1:2005 specification. Government organizations are increasingly utilizing PDF/A to digitize existing archival material as well as new documents.

The distinctive feature of PDF/A format is its universality. Although PDFs are well entrenched as the de facto standard for digital documents, there are many different ways of assembling a PDF. This results in different viewing experiences and sometimes makes it impossible for certain PDF readers to even open or render a file. Because PDF/A documents need to be accessible in the indeterminate future, there are strict requirements in place to ensure that they will always be readable.

PDF vs PDF/A

While PDF and PDF/A are based upon the same underlying framework, the key difference has to do with the information used to render the document. A standard PDF has many different elements that make up its intended visual appearance. This includes text, images, and other embedded elements. Depending upon the application and method used to create the file, the information needed to render those elements may be more or less accessible for a viewing application.

When a PDF viewer cannot access the necessary data to render elements correctly, the document may not display correctly. Common problems include switched fonts (because the original font information isn’t available), missing images, and misplaced layers.

A PDF/A file is designed to avoid this problem by including everything necessary to display the document accurately. Fonts and images are embedded into the file so that they will be available to any viewer on any device. In effect, a PDF/A doesn’t rely on any external dependencies and leaves nothing to chance when it comes to rendering. The document will look exactly the same no matter what computer or viewing application is used to open it. This level of accuracy and authenticity are important when it comes to archival storage, which is why more organizations are turning to PDF/A when it comes to long-term file preservation.

How to Convert PDF to PDF/A

ImageGear supports a broad range of PDF functionality, which includes converting PDF format to a compliant PDF/A format. It can also evaluate the contents of a PDF file to verify whether or not it was created in compliance with the established standards for PDF/A format. This is an important feature because it will impact what method is used to ultimately convert a PDF file into a PDF/A file.

Verifying PDF/A Compliance

By analyzing the PDF preflight profile, ImageGear can detect elements of the file to produce a verifier report. The report is generated using the ImGearPDFPreflight.VerifyCompliance method. 

It’s important to remember that this feature does NOT change the PDF document itself. The report also will not verify annotations that have not been applied to the final document itself. Once the report is generated, a status code will be provided for each incompliant element flagged during the analysis. 

These codes can have two values:

  • Fixable: Indicates an incompliance that can be fixed automatically during the PDF/A conversion process.
  • Unfixable: Indicates a more substantial incompliance that will need to be addressed manually before the document is converted into PDF/A.

Converting PDF to PDF/A

After running the verification, it’s time to actually convert the PDF to PDF/A. The ImGearPDFPreflight.Convert method will automatically perform the conversion provided there are no unfixable incompliances. This process will change the PDF document into a PDF/A file and automatically address any incompliances flagged as “Fixable” during the verification process.

While it is not necessary to verify a PDF before attempting conversion, doing so is highly recommended. Otherwise, the document will fail to convert and return an INCOMPLIANT_DOCUMENT code. The output report’s Records property will provide a detailed report of incompliant elements. Since any “Fixable” incompliances would have been addressed during conversion, the document’s remaining issues will need to be handled manually.

This method is best used when manual changes need to be made to the PDF file prior to conversion. One of the most common changes, for example, is making the PDF searchable. Once the alterations are complete, the new file can be saved using the ImGearPDFDocument.Save method.

Other ImageGear PDF to PDF/A Conversion Methods

Raster to PDF/A

ImageGear can save any PDF file produced directly by a raster file as a PDF/A during the initial conversion. A series of automatic fixes are performed during this process to ensure compliance.

  • Uncalibrated color spaces are replaced with either a RGB or CMYK color profile. This could change the file size.
  • Any LZW and JPEG2000 streams are recompressed since PDF/A standards prohibit LZW and JPEG 2000 compression.
  • All document header and metadata values are automatically filled in to comply with PDF/A requirements.

Quick PDF to PDF/A Conversion

For quick conversions in workflows that don’t require displaying or working with a file in any way, the ImGearFileFormats.SaveDocument method is another useful option. This process loads the original file, converts it, and saves the new version all at once. It’s important to set the PreflightOptions property to be set in the save options. Otherwise, the new document will not save as a PDF/A compliant file.

Take Control of PDF/A Conversion with ImageGear

Accusoft’s versatile ImageGear SDK provides enterprise-grade document and image processing functions for .NET applications. With support for multiple file formats, ImageGear allows developers to easily convert, compress, and optimize documents for easier viewing and storage.

ImageGear takes your application’s PDF capabilities to a whole new level, delivering annotation, compliant PDF to PDF/A conversion, and other manipulation tools to meet your workflow needs. Learn more about how ImageGear can save you time and resources on development by accessing our detailed developer resources.

Although PDFs are one of the most common document types in use today, not every PDF file is identical. A document with multiple layers, annotations, or editable form fields can create significant challenges for an application, especially when it comes to viewing, printing, and OCR reading. One of the most effective ways of dealing with these PDFs is to use powerful digital tools that “flatten” the document to remove unseen or unnecessary information to reduce the overall complexity of the file.

What Is PDF Flattening?

Flattening can be used to refer to a number of different processes, but in principle, they all accomplish the same goal of merging distinct elements of the document. A few example of flattening include:

  • Making interactive form elements non-fillable and static.
  • Burning annotations into the document to make them native text.
  • Combining multiple layers of text or images into a single layer, eliminating any non-visible elements.

3 Reasons to Flatten PDFs

There are numerous reasons why an end user may wish to flatten a PDF document, but they usually fall under one of three broad categories.

1. Better Security

Forms often contain valuable information, especially when it comes to financial, insurance, or government forms. If a PDF with editable forms were to fall into the wrong hands, someone could easily alter the information contained in the form to commit fraud or falsify data. By flattening the forms, the entries become a static element of the document and cannot be altered any further. By building applications with the ability to flatten PDF forms, developers can help organizations protect themselves and their customers from the threat of falsified forms.

2. Faster Viewing

Speed is often crucial when it comes to viewing or processing documents. The more information is contained in a PDF, the longer it takes an application to render and view it. While this is sometimes a byproduct of file size, complex or poorly-designed forms can also make a PDF less responsive. Flattening a multi-layered PDF into a single, flattened layer eliminates hidden elements and makes the document much easier to read. This can also apply to forms, which often contain substantial annotation information. Eliminating forms simplifies the document, allowing it to render more quickly.

3. Easier Printing 

Many PDFs contain hidden data that is not visible on a viewing screen, but turns up on the page when the document is printed. Buttons and dropdown fields, for instance, can make a printed document look cluttered and confusing. When form fields are flattened, hidden annotation data is removed, eliminating any unpleasant surprises when the document hits the printer tray. For PDFs with multiple layers and hidden elements, flattening ensures that only the visible portions of the document will appear on the printed version.

How to Flatten a PDF Form Field Using ImageGear

With ImageGear, converting interactive form fields into static page content is a simple process that can be accomplished programmatically before documents are read by an OCR or ICR engine. It can also remove XFA form data, which often creates challenges for forms processing software.

ImageGear provides two options for flattening form fields. Although nearly identical in name, they perform somewhat different functions and should be used in different instances.

  • FlattenFormField: Flattens specified fields into the page.
  • FlattenFormFields: Flattens every field contained in the PDF into the page.

During the flattening process, a boolean can be used to indicate which fields should appear during printing, which is useful for hiding interactive elements that have no use on a printed page (such as buttons). Each field contains annotation information that determines how it should be represented on the page. Fields typically features one of three flags to dictate their representation:

  • HIDDEN: Any field with this category will not appear on the page after flattening.
  • NOVIEW: This field will only be visible on the page if “forPrinter” is specified during the flattening process.
  • PRINT: These fields will appear on the page whether or not “forPrinter” is specified. If a field does not have the PRINT flag, it will only appear when “forPrinter” is not specified.

Dealing with XFA Forms

Although officially deprecated by international open PDF standards, Adobe’s proprietary XFA forms are still found in many PDF documents. Opening and editing a PDF that contains XFA data often creates exceptions that make them difficult to manage when it comes to extracting forms information. ImageGear FlattenFormFields function will remove any XFA data from a document during the flattening process.

How to Flatten a PDF for OCR Processing with ImageGear

While flattening forms is an effective way of simplifying a document, it doesn’t change the file format itself. The document itself is still a PDF. So while ImageGear’s form flattening features are an effective solution for managing PDFs securely, another approach is often needed for OCR image processing.

Consider, for instance, an insurance solution that needs to be able to extract data from a wide variety of forms. Some of these documents are interactive PDFs with editable forms, some are static PDFs, and still others are scanned images of a document. Rather than devising multiple strategies for dealing with each document type, the solution can streamline the process by simply rasterizing every PDF it receives into an image file, which effectively flattens any form elements it contains.

Once the PDF is flattened into an image, it can easily be run through an OCR engine to match it to the correct form template and then send it to the appropriate database or extract specific form information. This process ensures that all documents coming through the solution can be handled the same way, which makes for a more streamlined and efficient workflow.

Expand Your Application’s PDF Capabilities with ImageGear

Flattening PDFs is just one of many features developers can incorporate into their applications with Accusoft’s ImageGear SDK. Other core functionality includes the ability to annotate, compress, split, and merge PDF files, as well as convert multiple file types to or from PDF format. ImageGear also provides a broad range of PDF security features like access controls, encryption settings, and digital signatures. Get a hands-on trial of ImageGear today for a closer look at what this powerful SDK can do for your application.

resize PDF

Portable Document Format (PDF) files have become the ubiquitous way to store documents for sharing with a broad audience. While popular, PDF documents have several drawbacks. One large drawback is the fact that PDF documents are intended to be immutable. In other words, PDF documents lack the internal information necessary to reorganize its contents, unlike, for instance, a word processor document.  Here’s how to resize PDF files with ImageGear.

There are however ways to reclaim portions of PDF documents for use in new or updated PDF documents. One of the most common is to reuse pages from existing PDF documents. This does lead to one particularly vexing issue – reusing pages that were created for different media sizes.

When a PDF document has pages that are all the same size, PDF viewers can scale and scroll the document consistently, and the document appears aesthetically pleasing. The document is also likely to print well. The problem is that PDF pages from image scanners, PDF pages produced by word processors, and PDF pages generated from images are likely to be produced with different media sizes.

This is where ImageGear .NET can help. ImageGear .NET can resize PDF pages using the following code.

First, we need to determine what size we want the pages to be. We will use this size to define the “MediaBox”, which specifies the outermost boundaries of the page. For this example, we will use 8.5 inches by 11 inches, which is letter size.


ImGearPDFDocument igPDFDocument;
using (Stream stream = new FileStream(@"PDFDocumentIn.pdf", FileMode.Open, FileAccess.Read))
    igPDFDocument = (ImGearPDFDocument)ImGearFileFormats.LoadDocument(stream, 0, -1);

    double newMediaBoxWidth = 8.5 * 72.0; // Convert inches to Points
    double newMediaBoxHeight = 11 * 72.0;

 

Next, we iterate through the pages in the PDF document, and resize each page. To resize a page, we first need to determine how much to scale and translate the page.


foreach (ImGearPDFPage page in igPDFDocument.Pages)
{
    using (ImGearPDFBasDict pageDict = page.GetDictionary())
    {
	     // Get the existing MediaBox to determine how much to scale and translate the page
	     ImGearPDFAtom mediaBoxKey = new ImGearPDFAtom("MediaBox");
 	     ImGearPDFBasArray mediaBox = (ImGearPDFBasArray)pageDict.Get(mediaBoxKey);
	     double mediaBoxLowerLeftX = (((ImGearPDFBasInt)(mediaBox.Get(0))).Value);
	     double mediaBoxLowerLeftY = (((ImGearPDFBasInt)(mediaBox.Get(1))).Value);
	     double mediaBoxUpperRightX = (((ImGearPDFBasInt)(mediaBox.Get(2))).Value);
	     double mediaBoxUpperRightY = (((ImGearPDFBasInt)(mediaBox.Get(3))).Value);
	     // Calculate how much to scale each axis to fill the page
	     double scaleX = newMediaBoxWidth / (mediaBoxUpperRightX - mediaBoxLowerLeftX);
	     double scaleY = newMediaBoxHeight / (mediaBoxUpperRightY - mediaBoxLowerLeftY);
	     // Determine which axis needs the least scaling to fill the page
	     double scale = scaleX;
	     if(scaleY < scaleX)
	         scale = scaleY;
	     // Determine how much to shift the content to center the page
	     double translateX = mediaBoxLowerLeftX + (newMediaBoxWidth - (mediaBoxUpperRightX - mediaBoxLowerLeftX) * scale) / 2.0;
	     double translateY = mediaBoxLowerLeftY + (newMediaBoxHeight - (mediaBoxUpperRightY - mediaBoxLowerLeftY) * scale) / 2.0;
 

 

Next, create an Affine matrix to scale and translate the page.


        // Create an Affine matrix to scale and translate the page
        ImGearPDFFixedMatrix scaleMatrix = new ImGearPDFFixedMatrix
	    {
	        A = ImGearPDF.DoubleToFixed(scale),
	        D = ImGearPDF.DoubleToFixed(scale),
	        H = ImGearPDF.DoubleToFixed(translateX),
	        V = ImGearPDF.DoubleToFixed(translateY)
	    };

 

Using the Affine matrix, transform the contents of the page. Since we will be transforming individual elements on the page, we need to keep track of each transformed element (using transformedIDs) so we don’t transform any element more than once.


        try
        {
             // Transform all the elements on each page. Keep track of transformed elements
             // so that no element is transformed more than once.
             using (ImGearPDEContent content = page.GetContent())
             {
                  List<int> transformedIDs = new List<int>();
                  TransformContent(content, scaleMatrix, transformedIDs);
                  page.SetContent();
             }
         }
         finally
         {
             page.ReleaseContent();
         }

 

Now that the page has been transformed, set the new MediaBox.


         using (ImGearPDFBasArray newMediaBox = new ImGearPDFBasArray((ImGearPDFDocument)page.Document, false, 4))
         {
   	          // Update the MediaBox
              newMediaBox.PutFixed(0, false, ImGearPDF.DoubleToFixed(0.0));
              newMediaBox.PutFixed(1, false, ImGearPDF.DoubleToFixed(0.0));
              newMediaBox.PutFixed(2, false, ImGearPDF.DoubleToFixed(newMediaBoxWidth));
              newMediaBox.PutFixed(3, false, ImGearPDF.DoubleToFixed(newMediaBoxHeight));
              pageDict.Put(mediaBoxKey, newMediaBox);
   	          // Remove any existing CropBox
              ImGearPDFAtom cropBoxKey = new ImGearPDFAtom("CropBox");
              if(pageDict.Known(cropBoxKey))
                  pageDict.Remove(cropBoxKey);
          }

 

Now we need the function TransformContent() to transform the content of a PDF page. This will take each element on a page and individually transform it to its new location on the page.


private void TransformContent(ImGearPDEContent content, ImGearPDFFixedMatrix scaleMatrix, List<int> transformedIDs)
{
   // If there is a matrix in the content attributes, transform it.
   ImGearPDEContentAttrs contentAttributes = content.GetAttributes();
   contentAttributes.Matrix = Concat(contentAttributes.Matrix, scaleMatrix);
   // Transform each element in the content
   for (int i = content.ElementCount - 1; i >= 0; i--)
   	using (ImGearPDEElement pdeElement = content.GetElement(i))
   	     TransformElement(pdeElement, scaleMatrix, transformedIDs);
}

 

Now we need the function TransformElement() to transform individual elements on a PDF page. Note that some elements contain elements and even content. These elements and content will be transformed recursively (hence the need for the transformedIDs list).


private void TransformElement(ImGearPDEElement pdeElement, ImGearPDFFixedMatrix scaleMatrix, List<int> transformedIDs)
{
   if (!transformedIDs.Contains(pdeElement.UniqueId))
   {
   	transformedIDs.Add(pdeElement.UniqueId);
   	switch(pdeElement.Type)
   	{
   	    case ImGearPDEType.CONTAINER:
   		ImGearPDEContainer pdeContainer = (ImGearPDEContainer)pdeElement;
   		using (ImGearPDEContent moreContent = pdeContainer.GetContent())
   		    TransformContent(moreContent, scaleMatrix, transformedIDs);
   		break;
   	    case ImGearPDEType.CLIP:
   		 ImGearPDEClip pdeClip = (ImGearPDEClip)pdeElement;
   		 for (int i = pdeClip.ElementCount - 1; i >= 0; --i)
   		     using (ImGearPDEElement anotherElement = pdeClip.GetElement(i))
   			 TransformElement(anotherElement, scaleMatrix, transformedIDs);
   		 break;
   	    case ImGearPDEType.GROUP:
   		 ImGearPDEGroup pdeGroup = (ImGearPDEGroup)pdeElement;
   		 using (ImGearPDEContent moreContent = pdeGroup.GetContent())
   		     TransformContent(moreContent, scaleMatrix, transformedIDs);
   		  break;
   	    case ImGearPDEType.TEXT:
   		 ImGearPDEText pdeText = (ImGearPDEText)pdeElement;
   		 for (int i = 0; i < pdeText.RunsCount; ++i)
   		     pdeText.RunSetMatrix(i, Concat(pdeText.GetMatrix(ImGearPDETextFlags.RUN, i), scaleMatrix));
   		 break;
   	    case ImGearPDEType.FORM:
   		 ImGearPDEForm pdeForm = (ImGearPDEForm)pdeElement;
   		 pdeForm.SetMatrix(Concat(pdeForm.GetMatrix(), scaleMatrix));
   		 using (ImGearPDEContent moreContent = pdeForm.GetContent())
   		     TransformContent(moreContent, scaleMatrix, transformedIDs);
   		 break;
   		 default:
   		     pdeElement.SetMatrix(Concat(pdeElement.GetMatrix(), scaleMatrix));
   		     break;
   	 }
   	 if (pdeElement.Type != ImGearPDEType.CLIP)
   	    using (ImGearPDEElement pdeClip = pdeElement.GetClip())
   		if (pdeClip != null && pdeClip.Type == ImGearPDEType.CLIP)
   		    TransformElement(pdeClip, scaleMatrix, transformedIDs);
   }
}

 

The last part we need is a function to concatenate (multiply) two Affine matrices together.


private ImGearPDFFixedMatrix Concat(ImGearPDFFixedMatrix matrix1, ImGearPDFFixedMatrix matrix2)
{
    // Multiply two Affine transformation matrices together to produce one matrix
    // that will perform the same transformation as the two matrices performed in series
    double matrix1A = ImGearPDF.FixedToDouble(matrix1.A);
    double matrix1B = ImGearPDF.FixedToDouble(matrix1.B);
    double matrix1C = ImGearPDF.FixedToDouble(matrix1.C);
    double matrix1D = ImGearPDF.FixedToDouble(matrix1.D);
    double matrix1H = ImGearPDF.FixedToDouble(matrix1.H);
    double matrix1V = ImGearPDF.FixedToDouble(matrix1.V);
    double matrix2A = ImGearPDF.FixedToDouble(matrix2.A);
    double matrix2B = ImGearPDF.FixedToDouble(matrix2.B);
    double matrix2C = ImGearPDF.FixedToDouble(matrix2.C);
    double matrix2D = ImGearPDF.FixedToDouble(matrix2.D);
    double matrix2H = ImGearPDF.FixedToDouble(matrix2.H);
    double matrix2V = ImGearPDF.FixedToDouble(matrix2.V);
    ImGearPDFFixedMatrix result = new ImGearPDFFixedMatrix
    {
        A = ImGearPDF.DoubleToFixed(matrix1A * matrix2A + matrix1B * matrix2C),
        B = ImGearPDF.DoubleToFixed(matrix1A * matrix2B + matrix1B * matrix2D),
        C = ImGearPDF.DoubleToFixed(matrix1C * matrix2A + matrix1D * matrix2C),
        D = ImGearPDF.DoubleToFixed(matrix1C * matrix2B + matrix1D * matrix2D),
        H = ImGearPDF.DoubleToFixed(matrix1H * matrix2A + matrix1V * matrix2C + matrix2H),
        V = ImGearPDF.DoubleToFixed(matrix1H * matrix2B + matrix1V * matrix2D + matrix2V)
     };
     return result;
}

 

After modifying the PDF document, save it.


using (Stream stream = new FileStream(@"PDFDocumentOut.pdf", FileMode.Create, FileAccess.Write))
    igPDFDocument.Save(stream, ImGearSavingFormats.PDF, 0, 0, -1, ImGearSavingModes.OVERWRITE);

 

Using this code, you should be able to resize any PDF document page. 

To learn more about ImageGear and all of its capabilities, check out the ImageGear .NET product page and dive into the developer resources section.

imaging color wheel

Imaging software is one of the core foundations of Accusoft. However, there can be some complex concepts and terms related to imaging that lead to bad practices or misleading expectations. This guide was created to get you up and running with some common knowledge, as well as give you some terms to research and things to keep in mind when you begin using our products. 

Two Formats for Representing Image Data

There are two different ways to represent image data on a screen. These are Raster and Vector format, and each type has its advantages and disadvantages that should be understood before working with them. It is important to note that they are each independent of each other and, although you can convert from one format to another, you cannot make any assumptions about a raster image based solely upon its vector counterpart, and vice versa. With that being said, we’ll be focusing on Raster format for this post.

Raster Format

Raster, in its simplest form, is nothing more than an array of pixels organized into a grid, or matrix, where each cell contains a value representing information, with color data being the most common. These grids, along with the data within each ‘cell’ (called a pixel) come together to form the images we see on our screens. Some of the most common file types for Raster Data are JPEG, JPEG2K, Exif, TIFF, GIF, BMP, and PNG.

To get an image to display digitally, it must go through a process called sampling and quantization. Sampling takes the continuous image and breaks it into a matrix. Quantization then takes this grid and places a ‘quantity’, or numeric value, into each of the cells that represents the color to be displayed there.

Through this process, the computer is then able to interpret the data and display it accurately on a screen.

The Pixel

To understand raster data, it’s fundamental to understand the pixel, similar to how biologists must understand the cell to understand whole organisms. A pixel (short for picture element) is an individual cell of a matrix that is mapped to a segment of the screen. This matrix represents our raster image, as shown below:

You can see there are many individual squares that are one color only. Each square is a pixel, holding a numerical value representing a color value. Each pixel has an address, and you can access them in order to read or even modify the data stored there. Each Pixel has an ‘address’ that is its location in the Matrix.


As you can see, the grid starts with the top-left most pixel as the origin, and the bottom right most location as (width-1, height-1) location in a zero-indexed array. Different file formats will store this data slightly differently, and usually include some metadata and header information along with the pixel data for context.

Color Depth

Now, you may be wondering, “How much data can be stored within a single pixel?” It depends upon its color-depth or bit-depth, which is measured in bpp, or bits per pixel. The common values for bit depth, and the number of colors each can represent, are listed below:

  • 1bpp = 21 = 2 colors (Black and White)
  • 2bpp = 22 = 4 colors
  • 4bpp = 24 = 16 colors
  • 8bpp = 28 =  256 colors (Grayscale OR Limited Color)
  • 16bpp = 216 = 65536 colors (R/G/B Color)
  • 24bpp = 224 = 16777216 colors (R/G/B Color)

…and so on!

Bits per Pixel refers to the amount of data stored within each pixel. This is generally used when referencing color data. The overwhelmingly most common are 1 (for black and white) and 24 (for color images). With 1- bit images, you’re allowed to represent two unique colors, with those two colors represented by ‘flipping the bit’ to switch between black and white. Most forms used in the business world use this bit depth. This will result in a smaller file size, but you will not be able to represent any complex color data with such a limited bpp.

To represent color, the most common format for digital images is 24 bpp, usually associated with the RGB (Red/Green/Blue) color space, also known as the Additive Color Model. This allocates 1 byte to each color (or channel) and thus allows for ~16 million unique colors. It creates all of these colors by using Red/Green/Blue and combining those colors using various ratios to create unique ones. For printed images it is common to see 32 bpp used to represent color in the form of CMYK (Cyan/Magenta/Yellow/Key (Black)). This model is subtractive, as instead of adding colors together to create it’s ~4 million unique colors, it subtracts colors from other colors. K, or black, is the subtraction of all of these colors, and is why it is used for printing. These two color models are opposites of each other, as you can see here:

32 bpp can also be used to represent an RGB image that has an 8-bit alpha channel appended to it, referred to as RGBA. This alpha channel stores the ‘transparency’ or ‘opacity’ of the image.

Learn more about color depth, resolution, vector formats, and more in the rest of my article here.

Question

I am trying to perform OCR on a PDF created from a scanned document. I need to rasterize the PDF page before importing the page into the recognition engine. When rasterizing the PDF page I want to set the bit depth of the generated page to be equal to the bit depth of the embedded image so I may use better compression methods for 1-bit and 8-bit images.

ImGearPDFPage.DIB.BitDepth will always return 24 for the bit depth of a PDF. Is there a way to detect the bit depth based on the PDF’s embedded content?

Answer

To do this:

  1. Use the ImGearPDFPage.GetContent() function to get the elements stored in the PDF page.
  2. Then loop through these elements and check if they are of the type ImGearPDEImage.
  3. Convert the image to an ImGearPage and find it’s bit depth.
  4. Use the highest bit depth detected from the images as the bit depth when rasterizing the page.

The code below demonstrates how to do detect the bit depth of a PDF page for all pages in a PDF document, perform OCR, and save the output while using compression.

private static void Recognize(ImGearRecognition engine, string sourceFile, ImGearPDFDocument doc)
    {
        using (ImGearPDFDocument outDoc = new ImGearPDFDocument())
        {
            // Import pages
            foreach (ImGearPDFPage pdfPage in doc.Pages)
            {
                int highestBitDepth = 0;
                ImGearPDEContent pdeContent = pdfPage.GetContent();
                int contentLength = pdeContent.ElementCount;
                for (int i = 0; i < contentLength; i++)
                {
                    ImGearPDEElement el = pdeContent.GetElement(i);
                    if (el is ImGearPDEImage)
                    {
                        //create an imGearPage from the embedded image and find its bit depth
                        int bitDepth = (el as ImGearPDEImage).ToImGearPage().DIB.BitDepth; 
                        if (bitDepth > highestBitDepth)
                        {
                            highestBitDepth = bitDepth;
                        }
                    }
                }
                if(highestBitDepth == 0)
                {
                    //if no images found in document or the images are embedded deeper in containers we set to a default bitDepth of 24 to be safe
                    highestBitDepth = 24;
                }
                ImGearRasterPage rasterPage = pdfPage.Rasterize(highestBitDepth, 200, 200);
                using (ImGearRecPage recogPage = engine.ImportPage(rasterPage))
                {
                    recogPage.Image.Preprocess();
                    recogPage.Recognize();
                    ImGearRecPDFOutputOptions options = new ImGearRecPDFOutputOptions() { VisibleImage = true, VisibleText = false, OptimizeForPdfa = true, ImageCompression = ImGearCompressions.AUTO, UseUnicodeText = false };
                    recogPage.CreatePDFPage(outDoc, options);
                }
            }
            outDoc.SaveCompressed(sourceFile + ".result.pdf");
        }
    }

For the compression type, I would recommend setting it to AUTO. AUTO will set the compression type depending on the image’s bit depth. The compression types that AUTO uses for each bit depth are: 

  • 1 Bit Per Pixel – ImGearCompressions.CCITT_G4
  • 8 Bits Per Pixel – ImGearCompressions.DEFLATE
  • 24 Bits Per Pixel – ImGearCompressions.JPEG

Disclaimer: This may not work for all PDF documents due to some PDF’s structure. If you’re unfamiliar with how PDF content is structured, we have an explanation in our documentation. The above implementation of this only checks one layer into the PDF, so if there were containers that had images embedded in them, then it will not detect them.

However, this should work for documents created by scanners, as the scanned image should be embedded in the first PDF layer. If you have more complex documents, you could write a recursive function that goes through the layers of the PDF to find the images.

The above code will set the bit depth to 24 if it wasn’t able to detect any images in the first layer, just to be on the safe side.

convert pdf

PDFs are everywhere. Vice calls them “the world’s most important file format,” and that’s not far off the mark. The sheer number of documents converted to, from, and often back to PDFs is astounding. The hard truth? They’re also frustrating to work with. Start a Google search with the word “convert” and three of the top five results involve PDFs. 

While this portable document format lives up to its namesake by making it easy for users to attach and send documents across their organizations, PDFs often run into problems when it comes to conversion, collaboration, and communication. While many tools offer piecemeal PDF functionality, they lack a complete cadre of critical capabilities, in turn forcing software engineers to use multiple software solutions for seemingly simple tasks. 

ImageGear offers a different take on the standard software development kit (SDK) designed to help developers maximize their PDF potential. Here’s how it works. 


The Value of PDF Conversion

While PDF conversion is one of the top sought-after functionalities, there’s another area that’s often overlooked: modifying the characteristics of PDFs on-screen. With companies now handling PDFs from multiple sources that may include everything from computer-generated form data to handwritten information and images, it’s no surprise that staff encounter a wide variety of viewing issues.

ImageGear PDF helps solve these problems by allowing users to call the shots on PDF content at scale with features such as:

  • Conversion
  • Metadata Management
  • Content and Font Editing
  • Text Extraction
  • PDF Watermarking
  • Container, Dictionary, and Layer Creation
  • 3D Asset Modification

ImageGear PDF also helps improve document processing with document cleanup and advanced optical character recognition (OCR). With the ability to encrypt and decrypt entire images (or part of an image), automatic ImageClean correction of white text blocks, borders, and inverted images, plus intelligent re-sizing, any PDF can be cleaned and made more readable for the user. 

OCR support for almost any document type is also a benefit. This includes those produced on typewriters, dot-matrix printers, ink-jet printers, laser printers, and photocopied, scanned, and faxed documents. ImageGear PDF helps users control and customize multiple PDF variables, making it a fully functional PDF conversion solution for your application.


PDF Pain Points

One of the biggest PDF frustrations? The inability to break apart and combine PDF documents. Let’s imagine you have a massive legal PDF or in-depth medical file. In these circumstances, professionals only need a portion of the PDF, but without the right tools they’re stuck sending entire files when all they need is a single page. In other cases, employees might have a host of related PDFs that are part of the same project, but can’t be easily combined to save space and time.

ImageGear PDF has you covered with the ability to easily delete or insert PDF pages, render pages in a single PDF, split a PDF, merge two or more PDFs into a single file, or even merge specific pages from two or more PDFs into a single PDF. This not only makes a massive difference in time spent working with PDF documents, it helps reduce unnecessary storage and transmission of multiple files. 


Convert PDF: Multiple File Formats for Conversion

Conversion is critical for PDF success. Instead of creating complexity by forcing end-users to stick with original file formats, implementing an SDK with cutting-edge conversion empowers corporate consistency and saves on storage space. ImageGear PDF supports a host of common file formats for conversion including Microsoft Office, JPEG 2000, CAD, and SVG.

Of course, no feature forward PDF framework is complete without robust annotation, redaction, and commenting capabilities. These features make it easy for other users to see exactly what’s been changed, when, and why, along with providing a critical, auditable paper trail to meet evolving compliance and regulatory standards.


PDF Functionality for Your Application

Best of all, ImageGear isn’t designed to replace your current software, but integrate alongside existing workflows. Rather than adding another application to already-overloaded IT arsenals, straightforward SDK integration means everything happens within your own application, making it easy for everyone to find exactly what they’re looking for within familiar territory. Need help jumpstarting your SDK deployment? Check out our full list of ImageGear .NET samples for ASP.NET, CAD, OCR support, and more.

PDFs remain eternally popular and continually frustrating. Solve for document viewing, split and merge, and conversion issues and streamline employee efforts with ImageGear.

document conversion

Not all file formats are created equal. Some — like the .docx files produced by the ever-popular Microsoft Word — are ideal for creating and editing text-based documents, while others offer the high resolution necessary for medical images or the security required for legal case files.

Challenges emerge, however, when businesses need the same information, but require a different file format. Recreating the document or image from scratch is a waste of time and resources, while leveraging free online programs to make the switch introduces potential security risks. As noted by 9to5Mac, 23 file conversion apps for iOS were recently found to completely lack encryption, putting both information and organizations at risk. Companies need to simplify the switch with robust document conversion solutions capable of delivering both speed and security.


Scale of the Switch

A quick Google search for the phrase “convert to PDF” turns up more than 3 billion search results. It makes sense. PDF documents can be easily password protected and converted to read-only, making them ideal for data companies that need to share, but don’t want data modified. 

As noted above, Office files such as .docx remain common for business use along with other Office staples such as .xls and .ppt, but businesses are regularly tasked with converting other file types — often sent by customers or suppliers — into Microsoft-friendly formats.

The result is a landscape full of “free” tools that are long on document conversion promises but short on details about what’s supported, how conversion takes place, and who has access to your data. Given the scale of document conversion requests, the use of free tools can bridge functional gaps, even as they create more distance between documents and key defensive measures. 

Application switching is also a challenge. Since most free tools convert only a subset of file types, users may need to navigate multiple apps and conversion steps for a single file. As noted by Forbes, this continual app switching can waste up to 32 days worth of productivity per year.


Speaking the Same Language

Accusoft’s ImageGear SDK solves the conversion challenge by putting more than 100 file types under one digital roof. Some of the most popular conversion processes include:

  • Microsoft Office ImageGear offers support for Word, Excel, Powerpoint, JPG, and more with enhanced rendering for near-native Office support.
  • CAD Convert AutoCAD files such as DWG, DXF, or DGN to PDF, JPEG, and SVG. CAD conversion supports both 2D and 3D images along with changes in light source, layers, and perspective.
  • Adobe/PDFAs noted above, “convert to PDF” is one of the web’s most popular searches. Easily convert to and from EPS, PDF, or PDF/A with ImageGear’s comprehensive PDF API.
  • Raster Images Edit, compress, and annotate dozens of raster files including TIFF, JPEG, PNG, PSD, RAW, and PDF.
  • Medical Images Part of the ImageGear Collection, ImageGear Medical preserves medical image consistency and quality with conversion to and from DICOM, JPEG 2000, and other popular file types. ImageGear Medical also includes full DICOM metadata support.
  • Vector Images Dozens of vector images including SVG, EPS, PDF, and DXF can be easily converted with ImageGear.

Find the full list of supported file types here.


Security by Design

Data security matters. From legal firms to financial institutions, the reputational risks and regulatory penalties facing companies that don’t secure data by default are on the rise. The ability to quickly and seamlessly convert files from editable to read-only formats both enhances document security and improves overall defense. 

The easiest way to achieve this goal? Integrated, in-app file conversion. 

By removing the external risk of third-party apps and leveraging advanced SDKs that integrate into your own secure software, organizations can protect both the process of document conversion and deploy the annotations, permissions, and redactions necessary to keep documents safe. Simplify the switch. Deliver in-app, secure document conversion on-demand with ImageGear.