Machine Learning and the Death of Old-School Classifiers
An initial query for readers out there. What is this text below? Did my cat walk across the keyboard as I was typing this blog article? Is this simply modem line noise?
Many, I’m sure, will recognize this text block as a regex, specifically a regex that validates whether or not a particular block of text is a valid email. For its part, it does a fantastic job, but clearly a non-trivial amount of work was put into the construction of this regex and many other regexes like it.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[x01-x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f]|\[x01-x09x0bx0cx0e-x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f]|\[x01-x09x0bx0cx0e-x7f])+)])
Whoever wrote it had a clear understanding of RFC-5322 and the intricacies therein. If I were to write my own email regex validator, it would likely be far too restrictive and there are a host of potential problems and pitfalls that I would probably fall into. There is a great deal of domain-specific knowledge that goes into developing these, and many developers can run afoul of problems introduced by the potentially high complexity involved.
Also, this regex is an example of a strict classifier. It represents a boolean way of separating whether or not a particular string of text is in one class or another, specifically the set of: {is_a_valid_email_address, is_not_a_valid_email_address}.
Strictly binary classifiers (a hard true or false) are very useful for validation tasks, but what I’m interested in investigating are the changes to “fuzzier” classifiers. Those classifiers that seek to ask, under ambiguous circumstances: “How likely is this text to be an email? How likely is this picture to be a dog? Where in this image a barcode?” In cases like this, strict classifiers are not the tool we want to work with.
The question I seek to answer is: How has the industry previously solved these questions, and how is this changing?
How the Industry Used to Do Things
Most of the products I work on have to do with, broadly speaking, image recognition and detection. Let’s begin there. I’ll start with an example that’s near and dear to my heart – barcodes.
I’m sure most people out there have seen these before. QR codes are a two-dimensional barcode that were invented in 1994 by Japanese auto manufacturers. They’ve since exploded in popularity and you’ll see them all over the place: soda cans, fliers, magazine articles, etc. You’ll scan them with your cell phone, and an app might take you to a website, or show some metadata for the QR code.
What our API needs to do is find instances of QR codes in an image whether it be a fax, scanned document, or photo, and it needs to do it quickly and accurately. Now, as a software developer, this has represented some particular challenges over the years. How might we identify areas of an image that contain QR codes?
The biggest and most obvious feature we can see are concentric rings of the three position patterns, so let’s focus on these and do some free thinking on how to find them. We might do some connected component analysis, or perhaps do some run-length calculations to see if we can find instances of the 1:1:3:1:1 ratios of the pattern.
We might also decided to run an edge-detection filter on the image to find the lines of the pattern. If we look at enough images of QR codes, we’d note that the ratio of white to black blocks tends toward 1:1, and we could use that as a heuristic to guide our generalized search of the image.
All of these methods have varying degrees of difficulty in implementation and high complexity. These approaches, and those that we have used in our software, have taken years to develop and are highly specialized. I, myself, have been working on them for over 10 years now. They’ve been written with an intimate understanding of the various specifications of the barcodes we read.
Now let’s throw another wrinkle in things. Let’s imagine you’ve implemented the algorithms above based on what the QR code specification says, and now you run it on data from actual customers.
When dealing with the real world, your expectations can be thrown awry. You’ll often go in expecting sane inputs, but what you can end up with are blurry perspective warped noisy messes. Now you have to bring even more advanced concepts to the party: things like despeckling algorithms, 2D homographies, etc.
This is how the software industry has largely done things in the past over many different contexts (eg. facial recognition technology). It sounds pretty grim, that to get that long tail of hard-to-read images, you need exceedingly intricate and complex domain-specific algorithms.
Thankfully, this has been changing lately. Now, companies have begun to leverage their large amounts of data, often supplied by customers or synthetically generated, to create general-purpose algorithms to solve their needs.
Discover how machine learning can help your business in the rest of my article here.