Effective Automated Testing in a Microservice World - Part 2

Gerry Hernandez, Accusoft Senior Software Engineer

This is a continuation of our series of blog posts that share our experience with functional test automation in a real-world microservice product base. In part two, we will share our philosophical approach to SURGE: Simulate User Requirements Good-Enough. Be sure to read part one before getting started.

The SURGE Methodology

Much thought went into deliberating why we think we ran into the problems discussed in the first part of this series. Immediately, we stopped ourselves and realized that we needed to stop relying on theory and jump straight into practice. After all, on paper, Cucumber sounded like a silver bullet until we tested it with our products. So here’s where we landed.

Prototype Everything

Every single design decision in the SURGE methodology, and in turn, our Node implementation of our framework, was prototyped and tested in real-world scenarios with real-world code. We know that not all code is perfect; technical debt exists everywhere. SURGE works well with theoretically optimal (i.e. fictional) codebases, but it also has zero problems with dirty applications that were a result of not enough coffee on a Monday morning. This is the reality we face as software engineers and QA analysts alike, so we feel the methodology should be centered around imperfect situations.

With this philosophy, we found both Node and Python to be very suitable languages, as each one is a borderline RAD tool, if you compare it to broader languages such as C++ and Java. But to be extremely clear, SURGE is just a set of patterns and practices; any set of technologies may be suitable for implementation. The cloud services team ended up picking Node because it was quick, easy, and fun.

Behavior is Contextual

Humans are good at communicating because we’re social beings. We can give each other simple instructions and follow the spirit of the words, as opposed to the literal meaning. Computers are utter morons when it comes to natural language, so let’s not try to make them something they’re not. Well, at least not for our functional test suites!

So we thought about this. The reason a person can understand the two different example Gherkin features given in part one is because they understand that each one has its own context, its own meaning, and its own vocabulary. This is very important when you have a product with a wide range of capabilities, from reading barcodes to document storage and workflows. For instance, the word “scanning” has two completely different meanings when discussing a barcode versus, for example, a sheet of paper. We want to maintain this philosophy, and in turn, we urge our developers to write natural scenarios that make sense to a human, as opposed to making sense to a Gherkin parsing engine.

What we end up with is Gherkin being coupled directly to step functions, as opposed to magically matched by a parsing engine. This means that Gherkin statements can be repeated independently in separate areas of functionality of the test suite without ever colliding. We believe this to be the most critical difference between traditional BDD and SURGE.

Tests Are Inherently Stateful

When making a peanut butter and jelly sandwich, you would put the knife in the jelly after removing the lid. This implies that you are already aware that the lid has been removed. Not only that, but you must also be aware that you specifically removed the lid to the jar of jelly, rather than the peanut butter. Otherwise, you may end up with glass shards on your PB&J, which is not desirable. These same implications are shared with functional tests.

Functional tests, whether they’re for acceptance or regression, follow a set of ordered steps. Each step either mutates the state of the test or verifies the state against an expectation. Through experimentation, we discovered that traditional BDD test execution makes it very hard to comprehend state, since every step is global and may be invoked in any arbitrary order, from any arbitrary scenario, from any arbitrary feature. This is what leads to the cyclomatic complexity issue described earlier in part one.

With this in mind, we wanted SURGE to promote a very simple, lightweight way of maintaining state within a functional test context that is isolated from shared code. That is, no shared code should ever depend on or mutate state directly.

Reuse Only the Code That Matters

We find no value in reusing Gherkin statements, and therefore, we find no value in reusing step definitions. I understand how that might sound counter-productive, but suspend your disbelief for a moment.

One anti-pattern we immediately noticed while prototyping with the traditional BDD frameworks was that our code started to reflect the limitations of the frameworks, as opposed to reflecting good, sound software engineering best practices. It doesn’t make sense to treat a codebase of functional tests any differently than you would a production codebase. Clean-code practices that promote maintainability and general quality have been established and proven for decades; why not use them?

So our best practice is to write a series of client libraries for our own products. These client libraries are stateless and are reused throughout the entire test suite. If two independent features need to perform some common actions, they each would implement a step function that uses the shared library code.

The beauty of this pattern is that if we were to completely delete all of the testing-specific code (i.e. the feature files and step definitions), we would still have a fully functioning codebase that is properly factored and follows all our standards. This is good, and this is simple.

The elephant in the room is that combined with coupling Gherkin feature files to step definitions, factoring all shared code into stateless libraries means that a step function must be mapped to it for each feature, so there is some code repetition. While this is true, and many theorists would say not to repeat yourself, we feel that it’s intentional and meaningful repetition. Ignore the fact that the same text exists in multiple parts of the code for a moment and realize that the location of each step definition function is the unique part. Again, since each feature is considered its own context, just like how a human considers a new conversation a separate context from another, it can be said that each step definition is unique due to where it resides, not necessarily the text that defines it.

But we did say that we want to be practical. There are situations where reusing step functions makes a lot of sense, so we do allow for programmatic inclusions of steps from any file. It must be done deliberately; we intentionally do not want a framework to do it automatically. This is most definitely the exception to the rule, but it is certainly reasonable, so we allow it.

Above all, stop worrying about repeating minor boilerplate code. It simply does not matter. Move on and be productive. Get your work done and be happy.

To Be Continued…

Coming up next, we’ll talk about how we actually implemented SURGE as a framework, as well as our observed results. Spoiler alert: we became outrageously productive.

Until then, if this stuff is exciting to you, or even if you think we’re completely wrong and know you can kick it to the next level, we’d love to hear from you.

Happy coding! 🙂

Gerry Hernandez began his career as a researcher in various fields of digital image processing and computer vision, working on projects with NIST, NASA JPL, NSF, and Moffitt Cancer Center. He grew to love enterprise software engineering at JP Morgan, leading to his current technical interests in continuous integration and deployment, software quality automation, large scale refactoring, and tooling. He has oddball hobbies, such as his fully autonomous home theater system, and even Christmas lights powered by microservices.