Phil is a bike Nomad who creates API design tools for Stoplight.io, writes articles and books about pragmatic API development systems architecture, and runs a reforestation charity called Protect Earth. In this article, Phil has shared some API horror stories.
I was asked to join the company to help solve their API problems – API A would talk to B, C, D, and E entirely synchronously in web threads, which was probably part of the main problem. Some of these APIs would take two to five seconds to respond on a good day and 20 to 30 seconds on a bad day. Things chained up can take quite some time and suffer from both over-fetching and under-fetching. It means you’re making requests with way too much information. Also, the number of requests will be high. The database is just normalized instead of being based on what people need. There’s no HTTP caching anywhere. You have to make lots of these very slow requests whenever you want the information, even if nothing’s changed. Error formats for different APIs will be different, but different versions of the same API will give you subtly different error formats. People would often think they were putting a string into the interface, and it would just come up with an error Object, and customers would see this all the time. Auth was enabled per endpoint and disabled in testing, so some sensitive production endpoints had no authentication. The testing for the entire API was done as unit testing. And that meant they were attacking it at the class level instead of HTTP. So they would disable testing. And that meant that a lot of sensitive production endpoints didn’t have any authentication, which is not what you want, ever. No APIs have any documentation at all. So let’s dig into this one a little bit because that is a huge problem.
The mindset was, “We’re too busy for documentation; we don’t have any time. Let’s rewrite it if we’re struggling.” That is not a good approach, but this is the basis of a lot of API design first vs. code.
We would plan the API somehow, whether it was on a whiteboard in a meeting, just on a slack chat somewhere, like a random document or something. They would then write a bunch of code, and a month or two later, it would be ready for some customer feedback. Hopefully, the customer wouldn’t have too much feedback because they spent a month or two writing code; they now didn’t have much time left to implement that feedback until it was time to deploy. Once it’s deployed, it’s in production. Maybe it’s not quite what the customer needs, but close enough. Then, the mindset would be that we’ll write the documentation later. We have a few performance issues to fix, a few extra features to add, or random tech to solve, or we have a different project to go rush on. A new customer would appear some months later, and no documents were available. The API developer has to tell them how it works. And they’ve mostly forgotten because they’ve been working on another project. They look back at the code and all the awkward rewrites they had to do, making the code hard to read. They then create a new version, or an API with a new name and new concepts, because that might meet the customer demands. This happens over and over.
No clients will ever be able to use the newer versions of the API because –
- The new API has been designed for another client’s requirements
- There’s no documentation.
Of these horror stories or just moans, we have come up with a few solutions.
The API Design-First Workflow
It is a huge simplification of how things work. It involves writing your open API first, which is the part people think is extra time. But in reality, it’s generally quicker. A lot of companies report speeding up around 60% by using this. You design the open API first using something like iStudio or any of the other editors around, or you can write by hand if that’s your thing. You can then have these “Mocks and Docs” tools that take that open API and make a mock of your API, which can act as a prototype, and people can try it and make a request to it. You can find out if they’re making 100 requests because you normalized your database, if there is information they need but doesn’t exist, or if it’s in the wrong format. And then you turn that same file into documentation. You can use that to get customer feedback quickly, which gives you more time to iterate before you have to start writing the code. When writing the code, you can use that open API to simplify the codebase.
So in the past, you would write some validations in the model, a few in contract testing, and other validations on the controller to give good errors with a generic middleware. The middleware can say, “Hey, this string is required,” or “this should be an email address,” It rejects invalid data. This means you will deploy the code with documentation because the docs are just running off the open API. You know that the open API is correct because it’s powering your code and contract tests. When the customers want to request new functionality, you’ve already got an up-to-date open API to which you can add a new endpoint and add a few more properties. This is an excellent way to keep everything up to date.
Cache stampede Cluedo
The performance and stability of an application are based on having a good cache. Every request coming through an API would need to make a request to like one or five different APIs. So they just shut the cache on it. And that’s fine until the cash breaks. So the stampede is all of a sudden, all of these requests are unleashed on your dependencies, and they start washing things away.
We had this one monolith that a client was attacking, and we had only one clue, “Faraday: 091”, a popular HTTP client. All of our apps used it. Luckily they were so bad at keeping dependencies up-to-date that we could use that as a unique fingerprint to find out which app was doing the damage. We looked through them all and found that one of the applications had this specific version. So we just took it out of the server. It broke a bit of stuff, but it stopped the entire company from being broken because our monolith was no longer getting murdered by all these additional requests.
Some more solutions –
- Set User-Agent to “App Name (deploy version/hash) – to help trace transactions
- Setup open telemetry: opentelemetry.io to trace transactions
- Use a service mesh to specifically set which internal apps /APIs can talk to what. Service mesh usually has tracing ready to go. It is a massive help with the next most significant problem, the distributed monolith.
Microservices are pretty great unless you don’t design them very well. And then they’re worse. This is a brilliant quote by Scott, Wlaschin, “If you switch off one of the microservices, and anything else breaks, you don’t really have a microservice architecture; you have a distributed monolith.” The distributed monolith always starts off as a wonderfully designed microservice architecture, where there’s all this clean separation, and there’s domain logic owned by these different things. Over time, as different apps and different interfaces need more functionality, you start making more requests to more different APIs. Eventually, what happens is everything talks to everything. You might as well just have had it in one codebase because they could at least share memory that way. But all you’ve done is add multiple network calls and slow everything down.
In an unnamed coworking company, this was pretty much how the architecture looked. There were two huge monoliths in the middle that required everything, and everything needed them. There were loads and loads of smaller apps and APIs and everything else. One of the biggest problems was that they were just random and really slow things would happen all the time. There were no timeouts anywhere. You would quite often see these transactions taking 30 seconds. And the only reason they were only 30 seconds is that Heroku (a platform service company) cut them off at 30 seconds; else, they could have gone on a lot longer. APIs and external services, taking a long time, would slow down everything.
This is an image from New Relic. The dark green thing at the top spike is the web external. This spikes up the request queuing, which means requests coming in don’t have any available web workers to respond to their request. So some API has spiked, which means this API is doing poorly because there are no timeouts.
Let’s say we’ve got a client who’s talking to service A, talking to service B. There are a few different endpoints on service A; one is called “OK,” and the other is “slow.” “Slow” is talking to service B, which is having a bad day. Let’s say there are six processes available. A few requests come in. One comes to “Slow” and gets stuck. A second one comes to “Slow” and gets stuck. These will get processed, but not quickly enough. Our servers will go down as this API no longer exists because its entire capacity is spent waiting for a service that might not even respond. None of the people that need the “OK” endpoint can get it because there’s nothing left to talk to them. This would be a massive problem in a company where everything required everything because you’d have one API randomly go down. If a company runs a migration that locks a huge table and it was unplanned, they would say the migration is a problem. In reality, a query may affect the performance of the member’s network API and could take out everything around it. Anything that required that API would then start to go slow and processes would get stuck waiting for it to respond. Because there was interdependency between everything else, those systems would then go down. The entire company will crash and burn, and everything will be on fire simultaneously.
The post mortems will always be that the entire company crashed because this API had a problem with migrations, and we should be more careful with migrations in the future. They never really address the serious issue. People would bump up their server instances. There’s a real-world problem going on. There’s a climate crisis, and throwing more resources at poorly designed architecture instead of fixing the architecture is how we’ve got to the point where the internet is currently responsible for 3.7% of global emissions, which is about as much as flying. We, as software engineers, might not be able to do much about people flying too much. But we can stop wasting resources on poor architecture because 80% of the internet is API requests right now, and we control those here.
Some solutions –
- Don’t start adding network calls; it doesn’t always need to be microservices. Start off with a monolith, then spin out things that make sense based on functionality, not just normalizing your database, because then everything has to sync with everything else.
- Create SLAs and stick to the them.
- Set timeouts on every HTTP call, matching the SLA
- Expect to fail and then do something innovative. You can queue requests and hide features that aren’t working right now from your interface.
Don’t let the API crash.
Busy day in Australia, and nobody poops!
When people turn up to go to their office on the first day of the month, people are at the front desk trying to get their keycard to access the building. The front desk team would assign keycard A to Fred, and then Fred could go and do whatever on his floor in his building, places he’s allowed, and use the bathroom. The time taken to provide these keycards was huge. That was mostly because there were three servers worldwide, Europe, West Coast, and East Coast of America, and it worked well when we were small. When we got pretty big, it was still spread out just enough because most of the growth was in the States. But when we spread across the world, it was pretty busy. Australia being the first timezone, when they had a busy day, it would stall the entire system.
Because no one had any timeouts anywhere, there were no circuit breakers there was no way of avoiding bad behavior. It meant that the wobbly keycard API would start to crash. And that meant that everything was broken. The main user and company API would crash. The front desk team is sitting there trying to add this keycard, and it might take a minute or two to respond.
Meanwhile, there’s a queue of people literally out the door. Eventually, they have to say, “I’m sorry, Fred; I’ll let you in the front door; you go do what you want.” But they couldn’t go to the bathroom because that needed a key too! This happened every month, and it wasn’t good.
Because the “User and Company” monolith was also handling Auth tokens, the entire rest of the company crashed. The logging system didn’t show anything more than 100 milliseconds. And we tracked them doing two minutes, thanks to using a traffic proxy. We didn’t have any particular trust in them fixing it anytime soon. I copied and pasted all of the code from the key code to a new service; it was still synchronous; it just literally moved it to a different server, put a timeout, and redirected the traffic. The keycard service would still crash, but it didn’t wipe out the rest of the company.
- Demand SLA for third party services,
- Pipe external traffic through proxies
- Avoid hitting APIs in a web thread whenever possible, especially if they’re not under your control.
- Use background workers to sort things out.
Mutually Assured Destruction Problem
The two mega monoliths would require each other. And not just you know, A needs B and B needs A. When you try and request user information from the user and company service, it would be a massive blob of JSON, maybe 200 kilobytes or sometimes 1000s of lines. It can take 10 to 20 seconds. This is because, they’ve mixed basic user information, user profile information, and user locations to show what buildings you’re allowed into. That meant that whenever a client made a request to subscription and billing, it might look for some basic information like the locale; they have to get the entire user resource, which is going to the “user and company” API to solve it. But “user and company” isn’t in control of locations and subscriptions. So it has to get back to the subscriptions API. And that makes two requests, one to get the user’s memberships and find their locations. And once you get the company’s memberships to find their locations, those responses were pretty slow on a good day. It might take 20 seconds to put all that stuff together. The worst part is that the “subscription and billing” API doesn’t care about that information. But it’s being forced to wait for this other service to call it. The “user and company” API, if for any reason, started to have a slow time, it meant that the subscription service subscription API was starting to go slow too or vice versa. But either way, if either of them got slow, they both got slower and slower and slower until one of them crashed. And then the other one crashed. That meant the entire company was down because they all required one or both of these mega monoliths.
The simple solution there was, if you’re designing a new API, don’t jam all that stuff in there in the first place. But we just had the locale information on the new endpoint so that they could get that and then cache that.
- Stop designing for HTTP/1. (smashing everything in one mega call)
- Use HTTP/2 and HTTP/3 to multiplex multiple requests.
- If the clients want more data, they can ask for it. That’s what an HTTP request is. You only get what you want.
- Timeouts and circuit breakers, so simple requests can succeed.
- Get an API architecture or governance team to review changes.