Technical FAQs

Question

When should I apply image cleanup operations on my document images?

Answer

There are a number of cleanup operations that you can use to make an image more suitable for a particular application. What you observe visually on the image and how you perceive its impact on your project is the most important. For example, if you’re noticing very many random specks on your image, and you’re planning to use OCR, then you may want to try a depseckle or blob removal operation first. If the content in your image looks a bit slanted, you could try a deskew or rotate operation. In some cases, using a line removal operation on forms that have grid fields could be helpful also. The amount of image cleaning you may need to do can very from project to project. There’s not a one shot cleaning operation that will always work for all images. But, observe the nature of the noise and interference in your images to determine what general parameters appear to provide the best results.

using gradle for single click deployment

An initial query for readers out there. What is this text below? Did my cat walk across the keyboard as I was typing this blog article? Is this simply modem line noise?

Many, I’m sure, will recognize this text block as a regex, specifically a regex that validates whether or not a particular block of text is a valid email. For its part, it does a fantastic job, but clearly a non-trivial amount of work was put into the construction of this regex and many other regexes like it.


(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])

(Source)

Whoever wrote it had a clear understanding of RFC-5322 and the intricacies therein. If I were to write my own email regex validator, it would likely be far too restrictive and there are a host of potential problems and pitfalls that I would probably fall into. There is a great deal of domain-specific knowledge that goes into developing these, and many developers can run afoul of problems introduced by the potentially high complexity involved.

Also, this regex is an example of a strict classifier. It represents a boolean way of separating whether or not a particular string of text is in one class or another, specifically the set of: {is_a_valid_email_address, is_not_a_valid_email_address}.

Strictly binary classifiers (a hard true or false) are very useful for validation tasks, but what I’m interested in investigating are the changes to “fuzzier” classifiers. Those classifiers that seek to ask, under ambiguous circumstances: “How likely is this text to be an email? How likely is this picture to be a dog? Where in this image a barcode?” In cases like this, strict classifiers are not the tool we want to work with.

The question I seek to answer is: How has the industry previously solved these questions, and how is this changing?

 

How the Industry Used to Do Things

Most of the products I work on have to do with, broadly speaking, image recognition and detection. Let’s begin there. I’ll start with an example that’s near and dear to my heart – barcodes.

highlighted QR code

I’m sure most people out there have seen these before. QR codes are a two-dimensional barcode that were invented in 1994 by Japanese auto manufacturers. They’ve since exploded in popularity and you’ll see them all over the place: soda cans, fliers, magazine articles, etc. You’ll scan them with your cell phone, and an app might take you to a website, or show some metadata for the QR code.

What our API needs to do is find instances of QR codes in an image whether it be a fax, scanned document, or photo, and it needs to do it quickly and accurately. Now, as a software developer, this has represented some particular challenges over the years. How might we identify areas of an image that contain QR codes?

The biggest and most obvious feature we can see are concentric rings of the three position patterns, so let’s focus on these and do some free thinking on how to find them. We might do some connected component analysis, or perhaps do some run-length calculations to see if we can find instances of the 1:1:3:1:1 ratios of the pattern.

We might also decided to run an edge-detection filter on the image to find the lines of the pattern. If we look at enough images of QR codes, we’d note that the ratio of white to black blocks tends toward 1:1, and we could use that as a heuristic to guide our generalized search of the image.

All of these methods have varying degrees of difficulty in implementation and high complexity. These approaches, and those that we have used in our software, have taken years to develop and are highly specialized. I, myself, have been working on them for over 10 years now. They’ve been written with an intimate understanding of the various specifications of the barcodes we read.

Now let’s throw another wrinkle in things. Let’s imagine you’ve implemented the algorithms above based on what the QR code specification says, and now you run it on data from actual customers.

blurry QR code example

When dealing with the real world, your expectations can be thrown awry. You’ll often go in expecting sane inputs, but what you can end up with are blurry perspective warped noisy messes. Now you have to bring even more advanced concepts to the party: things like despeckling algorithms, 2D homographies, etc.

This is how the software industry has largely done things in the past over many different contexts (eg. facial recognition technology). It sounds pretty grim, that to get that long tail of hard-to-read images, you need exceedingly intricate and complex domain-specific algorithms.

Thankfully, this has been changing lately. Now, companies have begun to leverage their large amounts of data, often supplied by customers or synthetically generated, to create general-purpose algorithms to solve their needs.

Discover how machine learning can help your business in the rest of my article here.

Organized each year by ALM, LegalTech is one of the most important events for the legal industry. The conference brings together a broad variety of experienced legal professionals and innovative LegalTech providers to highlight the business, regulatory, technology, and talent trends in the market. In previous years, LegalTech was held in New York City and attended by more than 8000 people.

LegalTech 2021 Is Now Legalweek(year)

This year, however, the COVID-19 pandemic has forced the organizers to take a different approach. The first decision involved shifting LegalTech from an in-person conference to a fully virtual event in order to protect the health of both attendees and organizers. While many industry events have made a similar transition, the LegalTech team went a step further by breaking the conference into a series of five interactive virtual events held over the course of 2021. This new virtual series was dubbed Legalweek(year) and aims to provide legal professionals with a powerful resource for working through an unprecedented era.

“This decision was made to address the needs of our legal community during these trying times of COVID-19 and to provide the type of innovative education, solutions, and connections that is so crucial to legal leaders,” said ALM’s Mark Fried. “The 2021 series will set the stage for a resurgence in the legal sector and a big ‘Welcome Back’ to attendees for our in-person Legalweek event (in 2022).”

The first virtual Legalweek(year) event is scheduled for February 2-4, 2021 and will feature bestselling author and political leader Stacey Abrams, legal AI expert Josua Walker, and former New Jersey governor and federal prosecutor Chris Christie as keynote speakers. Attendees will not only be able to participate remotely, but they will also have an additional six months worth of on-demand access to virtual content following each event.

Visit the Accusoft Legalweek(year) Virtual Booth

As a longtime sponsor of LegalTech, Accusoft is proud to participate in this groundbreaking series of virtual events. The conference has historically been a great opportunity for us to speak directly with the independent software vendors and legal IT professionals about the latest industry trends and LegalTech applications. 

This year, we’ll be hosting a “virtual booth” through the Legalweek(year) event site. Whether you’re a developer looking to solve a particular software challenge or a project manager building an in-house solution for your firm, you’ll find plenty of resources and support at the Accusoft booth. Read through our numerous case studies and LegalTech whitepapers or schedule a meeting with one of our product specialists to learn more about our SDK and API integrations for legal software. You can even chat with someone in real time if you need a quick answer!

After completing registration, Legalweek(year) attendees can access the Accusoft virtual booth during the event simply by logging into their account.

Our LegalTech Solutions

Accusoft’s combination of content processing and conversion integrations help today’s innovative LegalTech applications reach their full potential. As law firms and legal departments incorporate more technology into their everyday operations, they need software tools capable of automating workflows, simplifying eDiscovery, and facilitating secure collaboration.

PrizmDoc Viewer

Our feature-rich HTML5 document viewer allows users to seamlessly view a variety of document and image files within their secure web application. Thanks to PrizmDoc Viewer’s powerful REST APIs, developers can provide additional functionality, such as annotations and redactions, that is essential for legal organizations.

PrizmDoc Editor

In addition to allowing users to edit DOCX files within the secure confines of their LegalTech applications, PrizmDoc Editor’s automated document assembly features streamlines the contract creation process to improve efficiency and accuracy. Documents can be assembled programmatically, incorporating commonly used or specific clauses, special language, and client data to eliminate “cut and paste” errors. Once documents are assembled, PrizmDoc Editor’s sharing tools allow firms to control access and ensure that everyone is working from the same up-to-date version.

ImageGear

With the ability to read, convert, and compress a wide range of files, our ImageGear SDK integration provides LegalTech applications with the tools they need to manage almost any type of file collected during the eDiscovery process. Powerful optical character recognition (OCR) capabilities allow ImageGear to read a wide variety of languages from around the world and convert scanned documents into searchable plain text or PDF files.

LegalTech in 2021 and Beyond

As legal organizations continue to make strides toward achieving true digital transformation, they will need versatile LegalTech applications capable of adapting along with them. Accusoft’s family of SDK and API integrations can help developers leverage the power of their innovative software tools and free up resources to focus on improving their core capabilities.

We hope you’ll join us at Legalweek(year) on February 2-4, 2021. Our booth will be available throughout the virtual event, so stop by to find out how Accusoft can help you realize the potential of your LegalTech applications.

scalable vector graphics

The scalable vector graphic (SVG) format continues to enjoy steady adoption across the web. According to data from W3Techs, SVG now accounts for 25 percent of website images worldwide. But it wasn’t always this way. In 1998, it became apparent that vector-based graphics had a future on the web, and the W3C received six different file format submissions from technology companies that year. Some were mere proposals ready for a complete revamp, while others were proprietary products that W3C wasn’t permitted to modify. Instead of forging a format from one of the submissions, however, W3C’s SVG working group decided to start from the ground up — and SVG was born.

While the file format had lofty ambitions, focusing on common use rather than specific syntax, the original iteration was cumbersome and complex. However, SVG has improved year after year after year. With increased support came more streamlined functionality and usable features. Now, SVG is often the first choice for meeting the evolving demands of scalable, responsive, and accessible web content.


What is a Scalable Vector Graphic (SVG) and how does it work?

Today, SVG is the de-facto standard for vector-based browser graphics. But what exactly is this file format, and how does it work?

Based on XML, SVG supports three broad types of objects: 

  • Vector graphics including paths and outlines that are both straight and curved
  • Bitmap images such as .jpeg, .gif, and .png
  • Text

What sets SVG apart from bitmap-based images is the use of lines and curves along the edges of graphical objects. Because bitmap images use a fixed set of pixels, scaling them up creates blurriness where the edges of pixels meet. In the case of vector images, meanwhile, a fixed-shape approach allows the preservation of smooth lines and curves no matter the image size.

SVG also offers the benefit of interoperability. Because it’s a W3C open standard, SVG plays well with both other image format and web markup languages including JavaScript, DOM, CSS, and HTML. This allows the format to easily support responsive design approaches that scale websites and web content based on the user device rather than defining standardized size parameters. Thanks to the curves and lines of SVG, scaling presents no problem for responsive designers looking to ensure consistency across device types.


The Benefits of SVG

While scalability is often cited as the biggest benefit of SVG, this format also offers other advantages, including:

  • Responsiveness — Images can be easily scaled up or down and modified as necessary to meet web design and development demands.
  • Accessibility — Since SVG is text-based, content can be indexed and searched, allowing both users and developers to quickly find what they’re looking for.
  • Performance Image rendering is quick and doesn’t require substantive resources, allowing sites to load quickly and completely.
  • Use in Web ApplicationsBrowser incompatibilities and missing functions often frustrate web design efforts, forcing developers to use multiple tool sets and spend time checking content and images for potential format conflicts. SVG, meanwhile, offers powerful scripting and event support, in turn allowing developers to leverage it as a platform for both graphically rich applications and user interfaces. The result? Better-looking sites that enhance the overall user experience.
  • InteroperabilityBecause SVG is based on W3C standards, the format is entirely interoperable, meaning developers aren’t tied to any specific implementation, vendor, or authoring tool. From building their own framework from the ground up to leveraging third-party SVG applications, web developers can find their format best-fit.

SVG in PrizmDoc Viewer

Accusoft’s PrizmDoc Viewer offers multiple ways for developers to make the most of SVG elements at scale, such as:

  • File TransformationConversion is critical for effective and efficient web design. If development teams need different file transformation tools for every format, the timeline for web projects expands significantly. PrizmDoc Viewer streamlines this process with support for the conversion of more than 100 file types — including PDFs, Microsoft Office files, HTML, EML, rich text, and images — into browser-compliant SVG outputs. In practice, this permits near-native document and image rendering that’s not only fast, but also accessible anytime, anywhere, and from any device.
  • HTML5 FunctionalityUsing SVG in PrizmDoc Viewer is made easier thanks to native HTML5 design. The use of HTML5-native framework not only improves load times with smaller document sizes but means that PrizmDoc Viewer works in all modern web browsers — while also dramatically enhancing document display quality.
  • Pre-Conversion One of the biggest challenges with viewing large documents in a browser is delay. Pages toward the end of the document may take longer to load and frustrate users looking to quickly find a specific image or piece of information. PrizmDoc Viewer solves this problem with a pre-conversion API that returns the first page as an SVG while the rest of the document is being converted, allowing users to interact with documents as conversion takes place and lowering the chance that files will experience format-based delays.

SVG hasn’t always been the go-to web image format. Despite a promising start based on open, interoperable standards, the lack of early support and specific use cases for vector-based file formats saw SVG sitting on the sidelines for decades. 

The advent of on-demand access requirements and mobile-first development realities has changed the conversation. SVG is now continuously gaining ground as companies see the benefit in this scalable, streamlined, and superior-quality file format. Get the big picture and see SVG in action with our online document viewing demo, or start a free PrizmDoc Viewer trial today!

According to reports, 1 in 4 law firms are victims of a data breach. Chilling statistics, particularly when you consider the sheer breadth of discovery data that firms possess—trade secrets, private client information, undisclosed corporate mergers—all of which makes them a highly attractive target to cybercriminals.

An article from Legaltech News discusses the evolution of cybersecurity threats to the legal industry in recent years. “There was a time when cyberattacks in the legal industry could be thought of merely as a consequence of law firms representing or taking on the powerful, connected, or controversial;
Fast-forward a few years, and cyberattacks start to look less like case-specific spectacles… and more like a daily assault by burglars and common criminals.”

Legal law concept image gavel on computer laptop with book in background


And hackers aren’t only targeting biglaw; with the migration of eDiscovery data to the cloud in favor of paperless offices, every firm is at high risk for becoming the next victim.

Legal firms do not have to be at the mercy of malicious hackers, and they certainly should not sit idly without taking action. The increasing number of cyber threats has prompted many firms to take the initial steps to safeguard against future attacks. Here are some strategies that firms can implement to help govern their data in the cloud:

 

Lock Down Documents with Digital Rights Management

You may think that sharing digital files via email or Dropbox is fairly secure, but those programs cannot guarantee that your files won’t end up in unauthorized hands. Legal firms that are managing their discovery data in the cloud should take document security a step further and look for tools that provide built-in digital rights management (DRM). DRM security controls let you set user permissions at the document level for viewing, printing, editing, and downloading files to ensure that during case review the only users accessing your proprietary documents are the ones that have permission to do so. With DRM, individual permissions can be turned on/off at any time or revoked altogether when there is no longer the need for a user to access files.

 

Integrate Collaborative Security Tools

To ensure compliance with today’s eDiscovery standards and the rules around electronically stored information (ESI), firms should look for solutions, often from third-party software providers, that provide robust and collaborative security tools for functions like redaction, advanced search, and watermarking. When implemented properly, these tools help create a secure and functioning framework to govern long-term data security.

 

Auto-Redaction & Search

Most personally identifiable information (PII) data follows a typical pattern (think social security numbers and credit card information), which means that, if compromised, this data is essentially low-hanging fruit for hackers. Advanced search tools should be able to quickly search case files for matches on keywords, phrases, and regular expressions. With auto-redaction, you can permanently remove the confidential data for each match. The end result is a PDF with no traces of the redacted information. After all, the easiest and most surefire way to preserve sensitive client information is to eliminate it altogether.

 

Watermarking & Digital Signatures

Watermarking and digital signatures are other features that can be easily integrated into legal applications to help prevent forgery and unauthorized file sharing. Many software companies offer these features and more as APIs that can be easily integrated to enhance client confidentiality and legal applications.

 

Encrypt Discovery Data

Encryption is a basic but vital component of information governance strategy, and one that many firms have previously overlooked. Properly encrypted data will protect files both in-transit and also at rest, to ensure that content is secure during all stages of the document lifecycle—from uploading, to storing, sharing, and downloading. So even if hackers are able to access a system, any data they find will be inaccessible without the proper decryption codes.

 

The Case for Third-Party Software Providers

Many firms are looking to third-party software providers to help deliver the security standards and tools necessary to help govern their data. This is a particularly appealing option since third-party providers have the tools, resources, documentation, and in-house expertise to help implement the functionality necessary for a proper information governance strategy.

Accusoft provides document and imaging solutions that are completely compliant with North American standards and the European Union’s Data Protection Directive for data security. If you’re looking for a fully integrated suite of scalable security tools for digital rights management, encryption, redaction, watermarking, and more, check out Accusoft’s PrizmDoc suite.