Technical FAQs

Question

I am trying to perform OCR on a PDF created from a scanned document. I need to rasterize the PDF page before importing the page into the recognition engine. When rasterizing the PDF page I want to set the bit depth of the generated page to be equal to the bit depth of the embedded image so I may use better compression methods for 1-bit and 8-bit images.

ImGearPDFPage.DIB.BitDepth will always return 24 for the bit depth of a PDF. Is there a way to detect the bit depth based on the PDF’s embedded content?

Answer

To do this:

  1. Use the ImGearPDFPage.GetContent() function to get the elements stored in the PDF page.
  2. Then loop through these elements and check if they are of the type ImGearPDEImage.
  3. Convert the image to an ImGearPage and find it’s bit depth.
  4. Use the highest bit depth detected from the images as the bit depth when rasterizing the page.

The code below demonstrates how to do detect the bit depth of a PDF page for all pages in a PDF document, perform OCR, and save the output while using compression.

private static void Recognize(ImGearRecognition engine, string sourceFile, ImGearPDFDocument doc)
    {
        using (ImGearPDFDocument outDoc = new ImGearPDFDocument())
        {
            // Import pages
            foreach (ImGearPDFPage pdfPage in doc.Pages)
            {
                int highestBitDepth = 0;
                ImGearPDEContent pdeContent = pdfPage.GetContent();
                int contentLength = pdeContent.ElementCount;
                for (int i = 0; i < contentLength; i++)
                {
                    ImGearPDEElement el = pdeContent.GetElement(i);
                    if (el is ImGearPDEImage)
                    {
                        //create an imGearPage from the embedded image and find its bit depth
                        int bitDepth = (el as ImGearPDEImage).ToImGearPage().DIB.BitDepth; 
                        if (bitDepth > highestBitDepth)
                        {
                            highestBitDepth = bitDepth;
                        }
                    }
                }
                if(highestBitDepth == 0)
                {
                    //if no images found in document or the images are embedded deeper in containers we set to a default bitDepth of 24 to be safe
                    highestBitDepth = 24;
                }
                ImGearRasterPage rasterPage = pdfPage.Rasterize(highestBitDepth, 200, 200);
                using (ImGearRecPage recogPage = engine.ImportPage(rasterPage))
                {
                    recogPage.Image.Preprocess();
                    recogPage.Recognize();
                    ImGearRecPDFOutputOptions options = new ImGearRecPDFOutputOptions() { VisibleImage = true, VisibleText = false, OptimizeForPdfa = true, ImageCompression = ImGearCompressions.AUTO, UseUnicodeText = false };
                    recogPage.CreatePDFPage(outDoc, options);
                }
            }
            outDoc.SaveCompressed(sourceFile + ".result.pdf");
        }
    }

For the compression type, I would recommend setting it to AUTO. AUTO will set the compression type depending on the image’s bit depth. The compression types that AUTO uses for each bit depth are: 

  • 1 Bit Per Pixel – ImGearCompressions.CCITT_G4
  • 8 Bits Per Pixel – ImGearCompressions.DEFLATE
  • 24 Bits Per Pixel – ImGearCompressions.JPEG

Disclaimer: This may not work for all PDF documents due to some PDF’s structure. If you’re unfamiliar with how PDF content is structured, we have an explanation in our documentation. The above implementation of this only checks one layer into the PDF, so if there were containers that had images embedded in them, then it will not detect them.

However, this should work for documents created by scanners, as the scanned image should be embedded in the first PDF layer. If you have more complex documents, you could write a recursive function that goes through the layers of the PDF to find the images.

The above code will set the bit depth to 24 if it wasn’t able to detect any images in the first layer, just to be on the safe side.

On July 12, 2022, Accusoft announced the latest update to PrizmDoc, its industry-leading document processing integration. The PrizmDoc 13.21 update improves existing features and adds key functionality related to format support, redaction capabilities, content conversion, and more, allowing developers to offer enhanced functionality within their applications. 

One of the main improvements in this release is to PrizmDoc’s Content Conversion Service (CCS). PrizmDoc now provides the ability to convert PDF documents to MS Word (DOCX) documents, making shared collaboration easier than ever before.

Other features and updates in this release include: 

  • High-Efficiency Image File Format (HEIF, HEIC) support for viewing, redaction, and conversion to JPG/JPEG, PDF, PNG, SVG and TIFF. 
  • PrizmDoc Viewer Markup Burner API now provides the ability to burn in redaction reason text for transparent (draft mode) redactions and provides the ability to remove PDF AcroForm fields. 
  • Improved performance of the PAS GET MarkupLayers API when using AWS S3 storage, which significantly reduces network traffic between PAS and S3.

PrizmDoc provides customizable document processing to help developers deliver in-browser document creation, editing, and collaboration functionality, to enhance their software applications.

For more information about PrizmDoc or to download a free trial, please visit our website.

About Accusoft: 

Founded in 1991, Accusoft is a software development company specializing in document processing, conversion, and automation solutions. From out-of-the-box and configurable applications to APIs built for developers, Accusoft software enables users to solve their most complex workflow challenges and gain insights from content in any format, on any device. Backed by 40 patents, the company’s flagship products, including OnTask, PrizmDoc™ Viewer, and ImageGear, are designed to improve productivity, provide actionable data, and deliver results that matter. The Accusoft team is dedicated to continuous innovation through customer-centric product development, new version release, and a passion for understanding industry trends that drive consumer demand. Visit us at www.accusoft.com.

Implementing any technology solution within an established organization can be a monumental challenge for a developer. Doing so for a legal firm that has a strong culture and longstanding processes can be even more difficult. That’s why LegalTech developers need to take a few key factors into consideration as they work on applications for the legal industry.

Build vs. Buy

One of the first questions any firm needs to ask is whether it wants to build a specialized solution or turn to an existing LegalTech application. In many cases, this comes down to a question of resources. For larger “big law” firms or legal departments within an enterprise business, internal developers may be available to build a customized application that caters to specific organizational needs. 

If the resources and development skills are on hand, building a dedicated solution can be an effective strategy. Developers can focus narrowly on the established processes used at the firm and design technology that targets clear pain points more effectively than an “off-the-shelf” product.

More importantly, as Kelly Wehbi, Head of Product for Gravity Stack, points out, building doesn’t necessarily mean starting from nothing

“I think a lot about how to leverage the platforms we have or could potentially purchase, but then add our own expertise and strengths on top of it. That doesn’t have to mean you have to build some entirely new interface or have to invent some new technology. It could be there’s a tool that’s out there that does exactly what you need and maybe you have to build a few customizations on top of that.”

Of course, building a solution also presents a number of challenges, especially if the project’s requirements are not well defined from the beginning. There’s a great deal of overhead involved with building new technology in terms of maintenance and ongoing support. It’s easy to fall into the trap of focusing on technology at the expense of services. But legal firms are not product companies; they need to focus instead on finding ways they can use technology to leverage their core services.

It’s that emphasis on services that drives many firms to buy the technology solutions they need rather than to build them. Existing software integrations are typically better positioned to maintain security and don’t need to be maintained as extensively. Deploying proven software integrations also helps organizations to maximize their on-premises resources and enhance their flexibility in the long-term. 

“I tend to default toward leveraging an existing platform when possible,” Wehbi admits. “Security ends up being a huge part of this and when you can leverage a company that’s solved that really well, that goes a long, long way. It offers you a bunch of options you wouldn’t have if you had to build it yourself,” Wehbi says. “That’s a pretty big undertaking to start from scratch.”

Getting Buy-In for LegalTech Solutions

Once the build or buy decision is finally made, there’s still the critical matter of executing and putting the new solution into practice. Getting feedback throughout the development and integration process is important, whether it’s gathered from anecdotal observations or some form of usage analytics. 

Neeraj Raijpal, CIO at Schroock & Schroock & Lavan, finds that implementations tend to go smoother when the development team is able to get rapid feedback from key decision makers: “The faster you get the feedback, the faster you know you’re down the right path or not. It is very frightening when the stakeholder…looks at something and says ‘This is exactly the opposite of what I expected.’ You don’t want to be in that situation.”

Ultimately, a LegalTech application’s success depends largely upon whether or not the firm as a whole embraces it. When developers are seeking to implement a solution, they need to be especially careful to take the culture of the firm into consideration. Without buy-in at the top, it will be difficult to convince anyone in the organization to commit to change. 

“If you’re trying to solve a problem because you have a deficiency in a current business process, but you’re not willing to change the process…that’s (a) disaster,” Raijpal warns. Although LegalTech solutions are designed to enhance efficiency and reduce errors, they often require people to learn how to use them or to abandon existing technology solutions.

Take, for example, a legal firm that needs to redact documents during the discovery process. The existing process likely involves printing out documents and then laboriously redacting them by hand with marker. Once that process is finished, they are scanned and saved as image-based PDFs. An HTML5 viewer with redaction capabilities could easily streamline this process to make it faster, more flexible, and more secure. Unfortunately, if the firm’s attorneys aren’t willing to adopt the new process, all of the potential efficiency benefits go to waste.

The Importance of Communication

Communication and ongoing support are critical to ensuring a successful LegalTech implementation. Developers can begin this important conversation right from the beginning when they’re designing application features. Whether they’re building everything from scratch or turning to software integrations, they need to have honest and thorough discussions to determine what specific features are needed to support legal processes. Implementing a LegalTech solution is more likely to be successful if that solution is closely aligned with the firm’s existing needs and future goals.

But the conversation doesn’t stop once the application goes live. Ongoing support and education is often necessary to help firms adopt new technology and make the most of its potential. Developers may even need to adjust some features over time as needs change. If they utilized third party software integrations to add key functionality, they need to know they can count on those vendors to support them as the LegalTech application evolves.

Make Your LegalTech Implementation a Success with Accusoft

Accusoft’s family of software integrations allow LegalTech developers to quickly add the features their clients need to modernize workflows and improve efficiency. Whether it’s PrizmDoc’s extensive document redaction capabilities that make it easier to protect privacy during eDiscovery or the automated document assembly features of PrizmDoc, developers can lean on our 30 years of document processing expertise so they can focus on building the tools legal teams require

As part of our ongoing work with the LegalTech industry, Accusoft recently sponsored a Law.com webinar on the subject of building vs buying technology solutions for legal firms. You can listen to some of the highlights with contributors Kelly Wehbi and Neeraj Rajpal along with host Zach Warren, editor-in-chief of LegalTech News, on the Law.com Perspectives podcast.

Document image cleanup is a vital step in building an efficient and accurate processing workflow. In a perfect world, every file an organization receives would be in pristine, high-resolution condition so it could be processed quickly and easily. Unfortunately, the reality is that documents come in all sizes, conditions, and formats. Companies can receive vital information in the form of email, traditional mail, fax, or even text. Documents scanned into a crooked, low-resolution file are just as likely to be received alongside digital versions submitted entirely through a web application.

This poses a significant challenge for software developers building the next generation of automation solutions. Without some way of cleaning up document images, companies that still rely upon manual processes will struggle to read and process files. More importantly, poor image quality interferes with optical character recognition (OCR) engine accuracy, making more human interaction necessary to verify recognition results. By integrating document image cleanup tools into their applications, developers can enhance the speed and accuracy of their automated processes and help their customers leverage the full potential of digital transformation.

7 Essential Document Image Cleanup Features Your Application Needs

There are a few essential document image cleanup tools that should be considered absolutely essential for any application that has to manage multiple file formats. To see these tools in action and understand why they’re so vital, let’s take a look at how these features work in ImageGear, Accusoft’s powerful document and image processing SDK integration.

1. Despeckling

Speckles can appear on document images for a variety of reasons. In some cases, they are unwanted image noise created during the original scanning process (the classic “salt and pepper” noise), but in other instances, they’re simply the result of dust particles on the surface of a scanned document or on the scanner itself. They are frequently encountered when converting old documents into digital form. Speckling not only interferes with OCR engine performance, but can also make it difficult to maintain image fidelity when compressing or converting files. 

ImageGear can reduce or eliminate speckling as part of the document image cleanup process. There are two ways to approach speckle removal:

  • Despeckle Method: This function removes color noise from 1-bit images by taking the average color value in a square area around the speckle and replacing its pixels with that value.
  • GeomDespeckle Method: This function uses the Crimmins algorithm to send the image through a geometric filter, reducing the undesired noise while preserving edges of the original image. This process is applied only to 8-bit grayscale images.

2. Image Inversion

With so many documents being scanned, converted, and transferred between applications, there’s a greater likelihood of something going wrong along the way. One of the most frequent problems is image inversion, which swaps pixel colors and turns a standard white background with black text into a black background with white text. This mix-up can render documents completely unreadable by OCR engines.

ImageGear can be configured to automatically recognize when image inversion is necessary. The invert method can also be used to immediately change the color of each pixel contained in the entire image, turning white to black and black to white.

3. Deskewing

Skewed document images are both cumbersome to manage and challenging for OCR engines to read accurately. Unfortunately, manually scanned documents are often uneven, and the problem is only becoming worse now that many people are using their phone cameras as makeshift document scanners. That’s why the first step in the document image cleanup process is often deskewing, which rotates and aligns the images to enhance recognition accuracy.

The deskewing process often involves more than just rotating a document, especially where images taken by a digital camera are concerned. ImageGear’s 3D deskew feature corrects for perception distortion, which can occur whenever a document is scanned by a handheld camera, using a sophisticated algorithm.

4. Blank Page Detection

Many documents converted into digital format contain information on both sides. If they are fed into a scanner along with single page documents, the resulting file will contain multiple blank pages. This might not seem like much of a problem, but if there is enough speckling or noise around the edge of the image, an application may try to apply an OCR engine to it and generate an error result. Blank page detection can quickly identify any image that is blank or mostly white and flag it for deletion.

5. Line Removal

Although they may not seem very troublesome at first glance, lines can create a number of problems for OCR engines. When lines and printed text overlap, it can be difficult for the engine to distinguish between the two. In some instances, the engine may even misread a line as a letter or number. Removing lines from a document prior to OCR reading ensures that the remaining text will be recognized more quickly and analyzed more accurately.

ImageGear supports both solid line removal and dotted line removal. The first method automatically detects and removes any horizontal and vertical lines contained in the document (like frames or tables), while the second method determines which dotted lines to remove by measuring the number and diameter of dots.

6. Border Removal

When scanned documents don’t align properly with the boundaries of the scanner or were copied onto paper that was larger than the original image at some point, the remaining space is often filled in with black. These borders are not only unsightly, but they also interfere with other document image cleanup processes. Although they can usually be cropped out easily, the cropping process alters the proportions of the image, which could create more problems later.

Removing these large black regions is easy with ImageGear’s CleanBorders option. It focuses on the areas near the edge of the page, which typically should not contain any important image data. 

7. Remove Hole Punches

Important documents were often stored in binders before they were prepared for digitization. When scanned, the blank space from the hole punch leaves a large, black dot along the edge of the document. Unfortunately, these holes sometimes overlap with text or could be picked up as filled-in bubbles by an optical mark recognition (OMR) engine.

ImageGear can identify and remove punch holes created by common hole punchers, including two, three, and five hole configurations. The RemovePunchHoles method can be adjusted to account for differing hold diameters in addition to different locations.

Unlock Your Application’s Document Image Cleanup Potential with ImageGear

Although ImageGear can perform a variety of document handling functions such as viewing, conversion, annotation, compression, and OCR processing, its document image cleanup capabilities help applications overcome key content management challenges and enhance performance in other areas. Improved document image quality allows data to be extracted more quickly, enhances the viewing experience, and reduces complications when it comes to file compression and conversion.

Learn more about the ImageGear collection of SDKs to discover how they can deliver versatile document and image processing to your applications.

The Top 4 Benefits of On-Premise Document Viewing
 

FinTech companies may be on the cutting edge of software innovation, but even their most sophisticated applications need the ability to accommodate a variety of document-heavy processes used in the financial services industry. That’s why 94 percent of them leverage some form of digital document management solution, whether it’s one they built in-house (43 percent) or a platform developed by a third party provider (51 percent). By using these tools and other integrations to implement document automation across their business, FinTechs can revolutionize the way they manage the document lifecycle.

What is the Document Lifecycle?

Document lifecycle refers to the many stages a document goes through as it moves through an organization’s processes. The lifecycle usually begins with the document’s creation or entry into a system, where it’s then reviewed, has its information extracted, and then routed to a database for storage. From there, it can be retrieved and distributed until it’s finally deleted, marking the end of its lifecycle.

For FinTech organizations that prioritize efficiency and speed, automation technology allows them to streamline their document lifecycle management and eliminate tedious manual processes that make it hard for them to adapt to rapidly changing market conditions. This greatly enhances their ability to scale operations and deliver a better overall customer experience. 

4 Revolutionary Document Automation Benefits

1. Faster Data Capture

When documents and forms are submitted through a FinTech application, their information needs to be gathered and transferred to a separate system of record. In most cases, this system is a database of some sort. Once information is deposited there, it can be readily accessed by other systems whenever it’s needed. Document automation technology can deploy capabilities like optical character recognition (OCR) to read and extract text from submitted documents and forms. 

Relying on manual processes to capture data is both slow and inefficient. Human employees are limited by how many keystrokes they can enter each hour, and that’s even before considering how fatigue and distraction could impact their performance. Rather than reviewing information by hand and laboriously keying it into the database manually, FinTechs can collect more data from more places faster and put it to use right away by using automated data capture. Loan decisions, for instance, can be processed much more quickly when customers don’t have to wait for their application to be entered into the system manually.

2. Reduced Errors

Any process that’s completed manually is highly prone to human error. Even the most highly trained and experienced professionals can make mistakes when entering data into a computer system. While some errors are little more than minor inconveniences, others can lead to serious problems over time. An empirical study from 2015, for instance, found that 28 percent of participants committed at least one error during data entry, many of which could distort future data analysis. Errors can also be made when creating documents, leading to unnecessary rounds of revisions and costly delays.

By automating data capture and document generation, FinTech applications can eliminate keystroke errors and other mistakes related to fatigue, inattention, and inexperience. This translates into more accurate datasets, fewer document revisions, and less time spent tracking down and remediating errors.

3. Streamlined Contract Management

As financial organizations, FinTechs need to manage a lot of contracts. For each one, they must gather information about the parties involved, determine what language needs to go into the contract, draft the actual document, and then send it out for review and signatures. Managing that process can be a challenge without the right automation tools in place. Whether it’s a copy and paste error, a clause being left out of a contract, or a missing signature, there are many problems that could slow down the process when it’s being done manually.

Document automation technology can streamline contract management by assembling documents programmatically and routing them for review and signature. Rather than tasking someone with building a contract from scratch, software can simply be pointed in the direction of a searchable database to plug the correct information into a contract’s fields. This allows organizations to generate and share contracts much faster and minimize the amount of revisions needed due to typographical errors.

4. Increased Visibility

For an organization that relies heavily upon manual processes, submitting or requesting a document can feel like casting something into a deep, dark hole. That’s because documents can easily be lost or overlooked when they’re being passed around by email and reviewed by hand. It’s hard to know exactly who is responsible for taking the next action or what steps have already been completed. Document automation platforms use a workflow structure to enhance visibility and efficiency, ensuring that nothing gets lost in the shuffle.

Search capabilities that can quickly locate documents or text also help to improve visibility within a document management system. Rather than laboriously pouring through folder after folder in search of the right document, FinTech teams can save time and avoid frustration while also keeping projects on track. Better visibility also means less confusion, which helps improve version control. Since it’s easier to identify which document someone should be working on, they’re less likely to create or distribute alternate versions that may not be fully updated.

Expanding Document Automation with Accusoft Integrations

With more than 30 years of experience working with digital documents, Accusoft provides a broad range of document automation solutions that can help FinTechs improve efficiency, reduce errors, and deliver a better overall user experience. Whether you need to extract data from structured forms, view and convert multiple file types, or build a dedicated workflow solution from scratch, our collection of SDKs, APIs, and cloud solutions make it easy for FinTechs to incorporate the functionality they need without having to rethink their tech stack.

To learn more about how Accusoft integrations can revolutionize the way you manage the document lifecycle, talk to one of our solutions experts today.

SmartZone powershell
 

Continuous innovation has allowed Accusoft to build sustained success over the course of three decades. Much of that innovation comes from talented developers creating novel solutions to everyday problems, many of which go on to become patented technologies that provide the company with an edge over competitors. 

Others, however, are the byproduct of looking at problems from a different perspective or using existing technologies in unique ways. Accusoft supports both approaches by hosting special “hackathon” events each year. These events encourage developers to spend time working on their own unique projects or try out ideas they think may have potential but have never been implemented.

For this year’s hackathon, I took a closer look at how our SmartZone SDK could be implemented as part of an automation solution within a .NET environment without creating an entire application from the ground up. What I discovered was that PowerShell modules offer a quick and easy way to deploy character recognition for limited, unique use cases.

.NET and PowerShell

One of the underestimated abilities of the .NET infrastructure is support loading and executing assemblies out of box from the command line using a shell module. Although there are many shell variants available, PowerShell comes preinstalled on most Windows machines and is the only tool required to make the scripts and keep them running. PowerShell also runs on Linux and macOS, which makes it a true cross-platform task automation solution for inventive developers who crave flexibility in their scripting tools. 

Incorporating the best features of other popular shells, PowerShell consists of a command-line shell, a scripting language, and a configuration management framework. One of the unique features of PowerShell, however, is that unlike most shells which can only accept and return text, it can do the same with .NET objects. This means PowerShell modules can be used to build, test, and deploy solutions as well as manage any technology as part of an extensible automation platform.

Implementing SmartZone Character Recognition

Accusoft’s SmartZone technology allows developers to incorporate advanced zonal character recognition to capture both machine-printed and hand-printed data from document fields. It also supports full page optical character recognition (OCR) and allows developers to set confidence values to determine when manual review of recognition results are necessary. 

Implementing those features into an application through a third-party integration is the best way to incorporate recognition capabilities, but there are some use cases where they might need to be used for general tasks outside of a conventional workflow. A number of Accusoft customers, for instance, had inquired about simple ways to use some of SmartZone’s features in their existing process automation software without having to spend weeks of development time integrating those capabilities on a larger scale.

Thanks to the versatility of PowerShell, there’s no reason to build such an application from scratch. SmartZone’s zonal recognition technology can easily be incorporated into any .NET environment with just a few snippets of code. PowerShell syntax itself is not very difficult to understand and for a quick start it should be enough to use a Windows Notepad application, but we recommend using your favorite integrated development environment (IDE) for a better experience.

Getting Started

First, you need to download SmartZoneV7.0DotNet-AnyCPU.zip from the Accusoft SmartZone download page and unpack it to any suitable directory. This bundle contains all required binaries to run SmartZone.

Create a Simple.ps1 file inside the unpacked directory and start typing your script:


using namespace System.Drawing
using namespace System.Reflection
using namespace Accusoft.SmartZoneOCRSdk

# Load assemblies.
Add-Type -AssemblyName System.Drawing
$szPath = Resolve-Path ".\bin\netstandard2.0\Accusoft.SmartZoneOCR.Net.dll"
[Assembly]::LoadFrom($szPath)

# Create a SmartZone instance.
$szObj = [SmartZoneOCR]::new()
$szAssetsPath = Resolve-Path ".\bin\assets"
$szObj.OCRDataPath = $szAssetsPath.Path

# Licensing
# $szObj.Licensing.SetSolutionName("Contact Accusoft for getting the license.")
# $szObj.Licensing.SetSolutionKey(+1, 800, 875, 7009)
# $szObj.Licensing.SetOEMLicenseKey("https://www.accusoft.com/company/legal/licensing/");

# Load test image.
$bitmapPath = Resolve-Path ".\demos\images\OCR\MultiLine.bmp"
[Bitmap] $bitmap = [Image]::FromFile($bitmapPath.Path)

# Recognize the image and print the result.
$result = $szObj.Reader.AnalyzeField([Bitmap] $bitmap);
Write-Host $result.Text

# Free the resources.
$bitmap.Dispose();
$szObj.Dispose();


This simple code snippet allows you to use SmartZone together with PowerShell in task automation processes like recognizing screenshots, email attachments, and images downloaded by the web browser. It can also be deployed in other similar cases where the advantages of PowerShell modules and cmdlets can help to achieve results faster than writing an application from scratch.

Another Hackathon Success

Identifying a new way to deploy existing Accusoft solutions is one of the reasons why the hackathon event was first created. This script may not reinvent the wheel, but it will help developers save time and money in a lot of situations, which means fewer missed deadlines and faster time to market for software products. Developing unique approaches to existing problems can be difficult with deadlines and coding demands hanging over a developer’s head, so Accusoft’s hackathons are incredibly important for helping the company stay at the forefront of innovation. 

To learn more about how that innovation can help your team implement powerful new features into your applications, talk to one of our solutions experts today!

When it comes to downloading or viewing documents over the internet, PDFs have long served as a de facto standard for most organizations. Since PDFs are not a proprietary file format, there’s rarely any risk that someone will be unable to open them. However, just because PDFs have become so commonplace doesn’t mean that they all share the same characteristics. For anyone who has ever wondered why some PDFs seem to take so much longer to load than others, the answer often has less to do with connection and processing speeds as it does with the way the PDF’s content is organized.

More specifically, it’s a matter of whether or not the document is a linearized PDF.

What Is a Linearized PDF?

Sometimes called “fast web view,” linearization is a special way of saving a PDF file that organizes its internal components to make them easier to read when the file is streamed over a network connection. While a standard, non-linearized PDF stores information associated with each page across the entire file, linearized PDFs use an object tree format to consolidate page elements in an ordered, page by page basis. When a reader opens a linearized PDF, then, all of the information needed to render the first page is readily available, allowing it to load the page quickly without having to search the entire document for a specific object like an embedded font.

Originally introduced with the PDF 1.2 standard in 1996, linearized PDFs were critical to the format’s early internet success. In order to view a non-linearized PDF, the entire document needs to be downloaded or read via HTTP request-response transactions. Given the bandwidth limitations of early internet connections (often still between 28.8k and 33.6k in 1996), this created a serious bottleneck problem when it came to document viewing. While it was possible to view a document without downloading it, the multiple HTTP requests needed to do so could easily be disrupted if the connection was lost, something that was all too common in the days before reliable broadband connections were introduced.

Non-Linearized vs Linearized PDFs

To visualize the difference between a non-linearized PDF and a linearized PDF, imagine two separate people sitting down to file their business taxes. One person has all of their receipts, invoices, and financial documents scattered across their office, with some stacked in unordered piles, others crammed into unlabeled folders, and even more stuffed into assorted drawers and file cabinets. Finding and organizing all of this documentation would take almost as much time as actually filing the taxes themselves! The second person, however, has all of the records they need stored in a neatly labeled file cabinet, allowing them to retrieve everything quickly and easily.

The first example is similar to a non-linearized PDF, while the second shows how much easier it is for a reader to access the information it needs to render the file. Even better, since each page is organized in the same way, jumping to a different page in a multi-page PDF doesn’t require the reader to reload the entire file. It can simply read the current page and get everything necessary to display the PDF correctly.

Why Linearized PDFs Are Still Valuable

In a world dominated by high speed internet connections, it’s fair to wonder whether or not PDF linearization is still necessary. For small PDFs that are only a few pages, linearization may not be essential, but when it comes to larger documents, linearization can still deliver substantial performance and user experience benefits.

Consider, for instance, a document that consists of several hundred, or even several thousand, pages. Loading that entire document and keeping it cached may be possible, but it’s an inefficient use of processing and bandwidth resources. With a linearized PDF, a reader typically encounters a linearization directory and hint tables at the top of the document, which provides it with instructions on where to locate any necessary resources within the file. After loading the hint tables and the first page, the reader stops the download process rather than opening the entire file. When the user navigates to another page, the reader can quickly reference the hint tables and jump to that page.

This ensures that the reader is only ever loading the pages that actually need to be displayed, which helps to conserve memory, processing resources, and bandwidth. For mobile devices with limited file and cache storage, linearized PDFs are much easier to manage than their non-linearized counterparts. They also provide some protection against network interruptions, which could make it difficult to download and view an entire document.

How to Linearize PDFs

Although the linearization process is well laid out in the current PDF standards documentation, many PDFs are created using software that doesn’t automatically linearize the content. More importantly, some linearized PDFs are “broken” by a process called incremental saving, which saves minor updates at the end of the file, rather than changing existing structure. Over time, too much incremental saving can undermine the effectiveness of a linearized PDF.

The best way to resolve such problems and linearize the PDF is to save a new, linearized version of the file using PDF editing and conversion tools.

Take Control of PDFs with PrizmDoc

Accusoft’s PrizmDoc provides a broad range of document functionality that allows applications to more effectively create, convert, and compress PDF files.

For a closer look at PrizmDoc and to see its powerful document processing capabilities in action, download a free trial today.

Redacting documents is critically important for legal departments and government agencies. By removing sensitive information from a digital file before sharing it publicly, it’s possible to protect private data or classified materials from being exposed. 

In the days before digital documents, redaction involved a simple, if crude, process of covering text with a black marker. Since redactions were done by hand, it was easy for mistakes to be made, which could range from using insufficiently dark ink to leaving portions of text exposed. The development of high-powered photo enhancement has rendered this approach all but useless, as even inexpensive image processing technology can distinguish blacked-out text.

With the transition to digital documents, organizations finally have access to true redaction capabilities. Unfortunately, they still tend to make mistakes when it comes to flattened PDFs that could leave redacted context exposed and vulnerable.

What Is a Flattened PDF?

A modern PDF file consists of multiple layers, each of which can contain separate elements. One layer might feature text, another image, and yet another a fillable form. The flattening process removes all interactive elements from form fields and combines all of the document’s elements into a single layer. 

Organizations frequently used this process to “lock in” form content to prevent anyone from altering the information after a user completes the forms. It also removes elements like dropdown selections within form fields and can burn in other annotations or markups, making them a permanently visible element of the document.

Flattened PDF Redactions

Unfortunately, simply flattening a PDF is usually not sufficient to securely redact a document. That’s because obscured elements are still present in the document; they’re just not visible when the file is viewed and printed. 

Recovering improperly redacted content is actually quite trivial in many cases. Two of the most infamous recent examples include information released during the investigation of political campaign chairman Paul Manafort in 2019 and court documents related to Facebook’s use of personal data in 2017. In both cases, journalists were able to copy redacted text from PDF files and paste it into a text editor to reveal the obscured content.

There are typically two ways that improper redactions occur:

  1. Covering Text with Boxes: This frequent mistake occurs when people try to treat a digital document like a physical piece of paper. They place annotations over the sensitive content, usually in the form of a black box, and then save a flattened version of the PDF thinking that no one will be able to separate the text from the annotation element. As the Manafort and Facebook cases demonstrate, however, getting around these “redactions” is usually quite easy.
  2. Changing the Color of Text: Another common redaction error involves altering the color of the sensitive text to match the document background. Changing the text color to white, for instance, might make it invisible to the human eye, but it does nothing to alter the content itself. The text can be made visible again by using the copy/paste trick described above or by altering the background characteristics in another program. 

The only way to make these methods viable for true redactions would be to actually print the documents with the content hidden and then scan them back into digital form, where OCR could be used to reconstruct a new file. But even in this case, there’s a chance that a powerful OCR engine might be able to pick up the hidden elements.

Using Proper Redaction Prior to Flattening with PrizmDoc Viewer

In order to redact documents securely, applications need to have access to specialized redaction tools that are capable of actually removing content from the document itself before applying redaction indicators. PrizmDoc Viewer’s redaction API can find and extract key text while also providing single or multiple reasons for the removal. 

This not only allows organizations to redact documents quickly, but it also ensures that the redacted information won’t be exposed later because it no longer even exists within the document. More importantly, the outputted document is entirely new, so there is no deleted information to recover. 

While most people are familiar with the distinctive black bars that indicate redacted content, even this leaves behind significant context clues that could provide hints of what was removed. Consider, for instance, a document involving multiple parties where the names of conversation participants have been redacted.

The following information:

PDF Redaction

The length of the redaction, then, would at least indicate when the redaction did not involve one person or the other. There are also many instances involving government documents where the length of the redacted information in classified material might suggest its relevance or importance.

When it comes to GovTech applications that need to remove large portions of information for security reasons, it often helps to perform redaction BEFORE turning a document into a flattened PDF. The PrizmDoc Viewer redaction API can be used to quickly extract text from a document and then redact it as a plain text file

Unlike a static PDF document, plain text accounts for width variations, so all redactions can be replaced with a standardized <Text Redacted> marker that makes it impossible to know the length of the redacted content. The text could then be converted into a PDF after the redaction process is complete.

Take Control of PDFs with PrizmDoc Viewer

As a fully-featured HTML5 viewer, Accusoft’s PrizmDoc Viewer delivers powerful viewing, annotation, and conversion functionality to your web application. It provides a broad range of redaction capabilities that allow legal, financial, and government organizations to keep their sensitive data secure and protect their customers. 

By integrating these complex features into your applications, you can focus your development efforts on building the tools that set your solution apart from the competition while our proven technology powers your customers’ viewing and redaction needs. To learn more about PrizmDoc Viewer’s powerful capabilities, download a free trial and test how it can support and enhance your application.