Microservice applications grow in complexity as the number of services grows.
This is something I’ve seen time and again as I’ve worked in microservice-based application architectures – one minor change to an edge service and the main customer portal is taken down due to a broken API call.
Service contracts are a first step towards preventing these issues from arising, but like their legal counterparts, these contracts can often have loopholes, workarounds, and other elements that result in your application being brought down with a rogue deployment.
I wanted to take some time today to explore some of the issues with microservice contracts, and offer strategies on how to resolve them – preventing cascading system failure before it has a chance to happen.
Story Time: Contract Failures in Action
Let’s take a look at a common scenario. I was working on a payment system that integrated with a legacy application storing the bill of goods that was being billed to the end-user.
I’d spent days adding internationalization support, tested the payment system thoroughly, and was confident in the code I’d written.
I push the code to production, tests pass, everything is looking solid… right up until our first international customer tried to check out. In making the change, I’d added a new parameter to represent the desired currency for the transaction, and assigned it a default value of “USD” – a fair assumption, given that 90% of our user base was based in the United States.
After some investigation, the problem was blatantly obvious – the legacy system hadn’t been updated to account for the new currency parameter in its calls to the payment service.
While I’d been meticulous about updating the payment service’s API endpoints, and verifying that everything worked with a smoke test, the payment form’s validation broke on any currency type parameter other than USD – an issue that was masked by our front end’s internationalization efforts, which helpfully hid the currency type parameter when presenting the UI to users in the US.
Luckily this issue was relatively benign. We’d simply not had robust enough testing around our internationalization test suite, and got bit by the contract change when it was first used by a production user.
With this minor change to our API contract, we’d introduced a hard-to-detect bug that manifested in almost the worst way possible.
Issues Arising from Microservice Contracts
Developers run the risk of violating a contract when making changes to a microservice.
These violations can vary in complexity, from minor issues like response format to major issues like the removal of an API endpoint. When these violations occur, they can break the surrounding application in unexpected ways.
These breakages occur with varying degrees of visibility and often result in unexpected behavior in parts of the code far removed from the original changes.
Below are some examples of microservice contract violations:
- An API endpoint can change its required input parameters
- An API endpoint can change its provided output
- An API endpoint can change the response codes it uses
- An API endpoint can change the body of the response returned to the caller
Each of the above violations can result in broken functionality across the application as a whole.
This can result in subtle bugs that manifest in other systems unexpectedly, as these other portions of the system were developed against a contract that is no longer valid.
In the payment system I was building above, we’d gone to great pains to ensure we had clear documentation around our payment service and had end-to-end tests enforcing behavior throughout.
Yet we were still bit by a bug in an unexpected area of the application due to the tight focus on buttoning down the payment service.
We obviously had more finding and fixing work to do.
Detecting Microservice Contract Problems
There are several strategies that you can use to combat microservice contract violations like the payment system failure I was describing.
Our documentation, which we had thought of as our first line of defense, was up-to-date, but as documentation is essentially unenforceable – and a pain to maintain well – this still left gaping holes in our build system that could result in a service change bringing down the application in unexpected ways.
We ended up exploring several other methods of detecting contract problems in microservices, ultimately building a stronger and more robust testing framework that lent confidence to the team as we made successive changes.
1) Contract Testing
We started with contract testing.
Contract testing uses automated verification to enforce the contract, ensuring that the microservice being tested conforms to an expected contract. This is as simple as it sounds – if we have an endpoint available on our microservice, we should write tests that verify that this endpoint behaves responsibly in all potential scenarios.
Testing the microservice itself is only a small part of the fix, though. Despite giving us a dependable suite of API contract tests that would fail loudly when the contract was violated, we would have needed full end-to-end testing enhancements to detect the internationalization failure.
2) GraphQL
We then looked at using GraphQL to help codify our API and enforce behavior between the services that drove our application.
This allowed us to standardize the API used by our front-end, enforcing contract conformity on the back end while making changes relatively transparent to the front-end code calling the endpoints.
However, we quickly found that while this provided a unified interface that front-end developers could rely upon, it didn’t fully prevent contract violations from manifesting in production – it simply made the API response more dependable, but at its core, it was handling the same values that had resulted in the breakage in the first place.
It also didn’t help that we were adapting an existing system to use GraphQL – while this adds some value, GraphQL best shows its power when it is at the forefront of the design process and informing the evolution of the application itself.
Ultimately this means that transitioning to GraphQL can be very challenging if you didn’t start with a GraphQL-oriented architecture at first.
3) Distributed Tracing
Between the changes to include contract testing and using GraphQL to standardize communications between the application and the microservices driving it, we still had an issue with the systems behind the helpful GraphQL interface we’d built.
This manifested through an issue with our PDF receipt generation system – a change in one of the payment system’s API responses added a new field that the receipt generator hadn’t accounted for.
Given the number of services we had to talk to in order to build the receipts, tracking down the issue was a major pain.
Distributed tracing helped us find this issue by giving us a useful framework in which to trace the requests as they moved through the system.
There are a few tools that leverage distributed tracing that you should try.
Jaeger and Zipkin are two open-source distributed tracing systems that can help you track down your points of failure. These tools enable transaction monitoring, latency optimization, and data analysis.
Aspecto, which leverages distributed tracing and OpenTelemetry, is used as the Chrome DevTools for your distributed services, helping you quickly troubleshoot microservices and prevent issues in distributed applications. You work with Aspecto while making changes to your code so you can understand the impact of your changes, pre-production, and have full confidence nothing breaks before deploying to production.
Using distributed tracing wasn’t without its headaches either, though. The distributed tracing system we were using flagged every single change in an API response as a potential issue, even if the changed response didn’t violate the contract. With some simple tweaks to validate the distributed tracing issues the system identified, we were able to move forward with a much more solid system.
Conclusion
When you’re building a microservice system, the complexity scales with the number of interconnections between the component services.
Our payment system, with only four microservices, encountered multiple types of contract violation as I worked with it, leading us to investigate tools for automated verification of service interactions and behaviors.
By using a mix of distributed tracing, contract testing, and a robust well-documented GraphQL implementation, I was able to get the application into a more stable state, allowing myself and my fellow developers to work more confidently with the system as it grew.