Skip to Main Content

How to Find and Extract Text for Contract Automation in C#

Accusoft’s PrizmDoc Editor was designed to help developers quickly embed customizable document controls into their applications. In a previous article, we walked through how to dynamically assemble documents and programmatically insert commonly used content into them.  

This time around, we’ll illustrate another document assembly use case: finding and extracting raw text from a contract before presenting it to the end-user. Using C# code as an example, the following overview will provide you with a few simple steps on how to implement two of our most commonly used APIs. Be sure to check out our getting started guide first if you haven’t already. You’ll want to download our docker image and set up a proxy route to the PrizmDoc Editor server to follow along.

Now that you’re ready to go, let’s dive into our latest PrizmDoc Editor example, this time involving programmatic search for contract automation.

Finding and Extracting Raw Text in a Document

PrizmDoc Editor’s API functionality allows you to programmatically retrieve raw text from a Word (DOCX) document and analyze it before sharing it with a potential end user for review. In the following example, we’ll look at a contract being prepared for review online. Before granting a client access to view the document, we want to make sure it contains all the necessary terminology. 

In this example, we will be using a sample commercial lease agreement and searching for three different legal clauses (“indemnity,” “right to enter,” and “pet policy”) using a different search approach for each. Each step shows how to call the PrizmDoc APIs using C#.

Step 1: Add the Contract Template

To begin, you’ll need to upload an existing DOCX file to the PrizmDoc Editor server using the Upload Document API. Note that uploading the file creates a unique documentID which you will need in the next step. 

 using System;
 using System.IO;
 using System.Net.Http;
 using System.Net.Http.Headers;
 using System.Threading.Tasks;
 using Newtonsoft.Json.Linq;

 namespace Example
 {
   class UploadDocument
   {
     internal static readonly HttpClient httpClient = new HttpClient();

     static void Main(string[] args)
     {
       MainAsync().GetAwaiter().GetResult();
     }

     static async Task MainAsync()
     {
       using (FileStream file = File.OpenRead("commercial-lease-agreement.docx"))
       {
         var PRIZMDOC_EDITOR_ROOT = "http://localhost:21412";
         var API_ROOT = $"{PRIZMDOC_EDITOR_ROOT}/api/v1";
         var request = new HttpRequestMessage(HttpMethod.Post, $"{API_ROOT}/documents");
         request.Content = new StreamContent(file);
         request.Content.Headers.ContentType = new MediaTypeHeaderValue("application/vnd.openxmlformats-officedocument.wordprocessingml.document");

         using (var response = await httpClient.SendAsync(request))
         {
           response.EnsureSuccessStatusCode();

           var body = await response.Content.ReadAsStringAsync();

           // Parse the documentId from the returned JSON
           var documentId = JObject.Parse(body)["documentId"];

           Console.WriteLine($"Document uploaded successfully, documentId: {documentId}");
         }
       }
     }
   }
 }

Step 2: Make a Get Text Request

Now that the document is uploaded, you’ll need to use the Get Text API to get a plain text representation of the document’s main body content. Make sure you replace the “document_Id” placeholder in the code with the one that was returned in Step 1.

using System;
 using System.IO;
 using System.Net.Http;
 using System.Net.Http.Headers;
 using System.Text;
 using System.Threading.Tasks;
 using Newtonsoft.Json.Linq;

 namespace Example
 {
   class GetText
   {
     internal static readonly HttpClient httpClient = new HttpClient();

     static void Main(string[] args)
     {
       MainAsync().GetAwaiter().GetResult();
     }

     static async Task MainAsync()
     {
       var PRIZMDOC_EDITOR_ROOT = "http://localhost:21412";
       var API_ROOT = $"{PRIZMDOC_EDITOR_ROOT}/api/v1";
       var DOCUMENT_ID = "MY_DOCUMENT_ID";

       var request = new HttpRequestMessage(HttpMethod.Get, $"{API_ROOT}/documents/{DOCUMENT_ID}/text");

       using (var response = await httpClient.SendAsync(request))
       {
         response.EnsureSuccessStatusCode();

         var body = await response.Content.ReadAsStringAsync();
         var text = (string)JObject.Parse(body)["body"];

         Console.WriteLine($"Document body text: {text}");
       }
     }
   }
 }

Step 3: Locate Matching Text

After calling up the document’s body content, it’s time to find the text you’re looking for. No changes are being made to the original document in this case, so matching can be performed with the returned string entirely within your application. PrizmDoc Editor supports fuzzy text search, exact matching, and regular expressions, so we’ll use a different search type for each of the clauses we’re looking for in the contract. This example assumes that the “body” variable in Step 2 is still available and only a simple “exact match” is required. Also, note in this code example, a third party library is being used for fuzzy matching support.

 using System;
 using System.Linq;
 using System.Text.RegularExpressions;
 using System.Net.Http;
 using System.Threading.Tasks;
 using Newtonsoft.Json.Linq;

 // This example uses a third party library, FuzzySharp, to perform fuzzy matching.
 using FuzzySharp;


 namespace Example
 {
     class TextMatcherExample
     {
         internal static readonly HttpClient httpClient = new HttpClient();

         static void Main(string[] args)
         {
           MainAsync().GetAwaiter().GetResult();
         }

         static async Task MainAsync()
         {
           var PRIZMDOC_EDITOR_ROOT = "http://localhost:21412";
           var API_ROOT = $"{PRIZMDOC_EDITOR_ROOT}/api/v1";
           var DOCUMENT_ID = "MY_DOCUMENT_ID";

           var request = new HttpRequestMessage(HttpMethod.Get, $"{API_ROOT}/documents/{DOCUMENT_ID}/text");

           using (var response = await httpClient.SendAsync(request))
           {
             response.EnsureSuccessStatusCode();

             var body = await response.Content.ReadAsStringAsync();
             var text = (string)JObject.Parse(body)["body"];

             TextMatcherExample.getMatches(text);
           }
         }

         static void getMatches(string text)
         {
             // Exact match (Indemnity clause)
             var indemnityFound = text.Contains("The indemnity agreement by the Guarantor will be attached as a schedule to this Lease and will serve as a form of guarantee to this Lease."); // true

             // Fuzzy match (Right to enter clause)
             var rightToEnterFound = text.Split("\n").Max((string clause) => {
                 return Fuzz.Ratio(clause, "landlord right to enter premises");
             }) > 33;

             // Regular expression match (Pets clause)
             // "\b" represents a word boundary in a regular expression.
             var regex = new Regex(@"/\bpets\b/i");
             var petPolicyFound = regex.IsMatch(text); // false

             Console.WriteLine($"indemnityFound: {indemnityFound}\nrightToEnterFound: {rightToEnterFound}\npetPolicyFound: {petPolicyFound}");
         }
     }
 }

You can see here that two of our searches returned a matching result. In each case, PrizmDoc Editor displays the relevant text containing the search terms. The third search, however, returned a “false” result, indicating that the term was not found in the document.

search_terms

Using Programmatic Search in Practice

The ability to programmatically search raw text is quite valuable for applications that prioritize consistency and efficiency. PrizmDoc Editor’s API functionality allows end-users to quickly search lengthy documents for unresolved comments and redlines or choose specific clauses for insertion into other documents, all without having to switch between multiple applications. This not only centralizes workflows and document management, but also minimizes the risk of human error and greatly streamlines work processes for contract automation.

PrizmDoc Editor is already helping developers looking to integrate enhanced document functionality into their applications to better meet customer needs. Our REST API technology allows you to quickly embed the ability to assemble and edit DOCX files quickly and easily so you can keep your focus on your product’s core features. 

To see how PrizmDoc Editor brought an entirely new set of editing features to a powerful governance and risk management platform, check out our ENGAIZ case study. If you have any questions about PrizmDoc Editor’s capabilities and how it can help your applications get to market faster, contact us today.