David O’Neill is the CEO of APImetrics. APImetrics is a monitoring company, and we monitor APIs for various companies. This article by David discusses how APImetrics has helped look for potential challenges with API deployments.
We at APImetrics, sample data, normalize, and analyze the data to look for potential problems.
We score the policy of APIs. We have invented our own scoring system, and we call it C cask. It gives us a way of categorizing APIs against each other.
Office 365, Microsoft Teams, Google workspaces, DocuSign, and Box, is a bucket of about 20 different APIs, and their average score is about 8. and that’s incredibly consistent. The design of the score is for consistency. We don’t expect it to move around much. We’re not grading on a curve. We are just measuring the overall quality.
We will talk about some of the lessons we’ve learned from working with regulated APIs and some of our other challenges.
Lesson 1: Quicksand
Sandboxes turned into the bane of our lives. It was a regulatory requirement of British Open banking that each bank provided a sandbox. They were forced to provide the sandboxes, two years before they had their open banking system set up. Consequently, the sandboxes did not work, they did not represent anything related to the open banking architectures they went with. They didn’t use the same security. They sometimes didn’t even use the same types of APIs. And they didn’t use the same syntaxes. So many organizations found that they could be up on the Sandbox and think, great, I got the Sandbox working, try and roll out their open banking, implementation, and none of it would work. They had to go back to the basics. But sandboxes are inherently useful. You should pay more attention to how well they work, and they should be part of all the other DevOps people do.
They are important to the ecosystem’s health and shouldn’t be just there as an afterthought that you bought from a company that has nothing to do with your architecture. Use a stubbed-off version of your production environment; don’t build something completely separate.
Lesson 2: Document what you do, not what you think you do
For some documentation, we could download the spec and read the docs; it was just standard. For some, we had to ask queries, which took a bit more time to understand as the documentation had a few problems. But, in some cases, they were not responsive to support queries. They themselves did not seem to know what was wrong with their API stacks. One of the problems here can be the tooling involved. Test what you do in a production environment with somebody who doesn’t know it inside out, and you will learn some horrible things.
Lesson 3: Nothing beats end-to-end testing
People tend to test with tools that work inside their stack. The experience of using APIs is now an end-to-end distributed experience. The third parties you work with do not live in your data centers or your stack; they live in the cloud. The clouds don’t all work the same. They don’t work with your architectures the same. If you are on a private cloud yourself, your rules may work great with some data centers but not particularly well with others. We’ve seen differences between calls made from Google and Amazon, as much as two seconds per transaction. We’ve seen differences of over a second between different Amazon data centers in the same geographical region because they’re working with different standards. It is something to be aware of when setting things up. Nothing beats measuring from where your partners are.
Lesson 4: Everybody and everything lies, especially HTTP codes
Everybody lies. DevOps teams lie to each other to make things better for themselves. They like to say management tools lie to you to make you feel better about yourself. They lie about HTTP codes. Everything needs to be looked at in far more depth than it is. And one of the lessons we’ve learned is you can’t just rely on a 400 code being someone else’s problem.
An anecdote from our experience with a customer; their consent server broke. They had a massive outage. It didn’t last very long there. But because of the way the consent server broke, they lost the consents for the API for the production accounts. Because that manifests itself as a 400 error. That was not escalated through their internal triage processes because that’s 400. Because everybody was pushing it around to different groups without realizing we couldn’t monitor our systems. If you just rely on gateway logs and look at what the codes are telling you, you’ll often get a false sense of security around what’s going on with your stack. And these problems rarely go away. They tend to build and escalate over time until the root cause manifests itself, and you’ll have something much more serious to fix.
Lesson 5: Monitoring
Let us consider an API in production. It wasn’t very fast to start with, about one second per transaction. That’s about 12 seconds per transaction. Over the course of three months, it got slower and slower. There was a lack of monitoring.
Other variations we’ve had on this are you set your threshold for alerting very high. Your API is usually very fast. Your API can be functionally down for several hours a day, and it will never trigger an alarm. Because you’ve set the threshold at two seconds, the API normally responds in 200 milliseconds. So it takes a lot of latency and timeouts before you get told.
To conclude, large companies have more financial bandwidth to address these issues. Neobanks are more adaptable. Traditional banks with a legacy find it the most challenging to adapt to Open Banking.