I’ve surveyed hundreds of people at events, and my data tells me that 80% of API teams are trading off between their APIs’ delivery speed and quality. The approaches that we took in the past to become API-driven just aren’t scaling anymore as our landscapes and requirements are changing. This article will look at how we could automate the API lifecycle with API Ops to give you speed and quality at scale.
Let us start with ACME, a large fictional bank with a sprawling tech landscape. The company has been around for several decades. They’ve got a lot of legacy systems and tools and multiple siloed engineering teams. As part of their digital transformation, they’re migrating most of their workloads to the cloud and Kubernetes like everybody else. They’re adopting more of a consistent API driven approach. The mortgages team has just identified the next API they need to build. Emily has just finished designing it, and she is reviewing the spec with her team. They all agree it looks great. So as per their normal process, she sends it off to the API platform team for review, and she moves on to her next task. The API platform team owns that nice API platform and its overall architecture. They host and manage the platform on behalf of the rest of ACME with the goal of raising the overall engineering standards across the organization with reusable APIs. A group of them meet once a week to go through all the new APIs that have been submitted, and they check them for standards. Sadly, in this case, Emily’s spec is not approved. It turns out that there’s this whole set of standards that Emily doesn’t know about. They’re probably documented somewhere at some point, but probably out of date, and they’re certainly not very well communicated and not in a developer friendly way. So a week after she submits it for review, the platform team rejects Emily spec, which gets pushed back down to her. This is pretty embarrassing for Emily, she’s getting called out in front of her peers for not doing a good enough job, even though it wasn’t her fault. And this is also a huge waste of everyone’s time as Emily has to redo her work.
The platform team is doing these reviews manually at scheduled intervals. Several days are wasted waiting for that review to happen. It’s not just Emily and the mortgage team that suffers here either. ACME is following a best practice. They’re using a single API platform for Global Discovery and reuse across the business. As adoption grows, the platform team needs to support more and more teams across the organization. So then they get more and more APIs coming in for review, on top of all the other work that they have to do, so they end up being stretched thin. Rather than spending enough time fully reviewing every API, they have to prioritize. Compliance becomes a nightmare. Things start to fall through the cracks, which isn’t so good for the operations team responsible for maintaining the overall ideal state at ACME.
Enough has fallen through the cracks. If there are a lot of errors in production, nothing is guaranteed to be consistent, and deployments are pretty painful. In fact, the ops team refuses to deploy new code more than once a week because it causes so much instability every time.
Elsewhere in ACME, the mobile team operates a little differently. This team’s goal is to build rich digital experiences for ACME’s customers as a reaction to all the mobile-only banks threatening to displace them. This team has been given a lot of freedom to get their applications out as quickly as possible to provide experiences for these customers. They’re about to release the latest open banking app. This one’s a big deal for ACME because it’s the first time that they’re exposing actual API endpoints to customers and partners. Having seen the delays in getting APIs live elsewhere in ACME, the mobile teams decided to do things their own way, and they bypassed the API platform team all together. But they were in such a hurry to go live on time that they focused on the implementation code and missed some API best practices. And this means that their APIs are inconsistent. They’re hard to find, they’re hard to access, and they’re hard to use. And this puts people off; whether that’s internal or external consumers, their prospects are much more likely to go to one of their FinTech competitors who know how to treat APIs as products because this is what makes an API consumable. Making matters worse is that someone in the mobile team forgot to secure one of his APIs when he published it. This then got exploited and led to a data breach affecting 50 million customer accounts. They started with all the right intentions here, but they’ve traded off between speed and quality. It is the biggest pain point in API adoption right now.
API Ops
API Ops is the automation of the full API lifecycle. It combines DevOps philosophies for iterative design and continuous testing with GitHub philosophies in automated declarative deployments. Before we saw manual, costly, and error-prone activities at ACME, we’re now going to automate all of them.
Best practice means that we design an API before we build it. Once deployed, we add governance and operational policies to manage it before making it discoverable to consumers in a portal. There are ongoing operations, and this lifecycle continues going around until we retire the API. This is no different with API ops. We’re still following best practices. But the processes we follow at each step and between each step have changed. We use a design environment to create the API spec at design time easily. We also create a test suite for that spec.
Here, we should check several things like are we getting the responses we expect in certain conditions are the responses following the right structure. Now, what’s critical here is that the tooling we use gives us instant validation. That’s linting of the spec against best practices, including your organizations, the ability to run those tests locally and validate what you’re building as you build it. As the API designer, you need to have self-serve tooling that makes it easy to do the right thing from the beginning. You don’t want to end up like Emily. When you’ve created the spec and validated it locally, you push it into Git or whichever version control system you use and raise a pull request for this new API. Now, this triggers a governance checkpoint embedded in our pipeline. Before any time is spent building the API, we need to be sure that what’s going to be built follows our company standards and is aligned with everything else in the ecosystem. So we automatically invoke the API tests that we built earlier and any other governance checks that we want to include at this stage of the pipeline. For example, are we paginating consistently across many APIs? There will be checks that the platform owners will want to do for every API that Emily and the other API designers are unaware of. But now, this is not a manual review. This is automatic, and therefore it’s an instant process.
Can we enable this through the open-source command-line tool so that it validates your specs and runs your tests? If the spec fails any of those tests, it gets automatically pushed back for more work in the design phase. Emily doesn’t have to sit around waiting for a response from the platform team. She gets an instant automatic notification about what needs to change. Because this is an automated check embedded in the pipeline and triggered by default when a spec is pushed into Git, there is 100% coverage of these checks for every single API designed anywhere in ACME. We’re now consistently catching errors as close to the beginning of the pipeline as possible. This means that they’re much faster and cheaper to remediate. It has been estimated that finding and fixing at design time costs 1% of what it would in production.
When the tests pass, then we have a validated spec. We can now progress on to the build phase. Here we build our API in the normal best practice way. We use the spec as the contract to tell us what the API needs to do and what the interface should look like. We use the test as we go to validate that the API that we’re building meets the spec. When the developer commits their code, a series of tests are triggered, and we automatically execute the tests that we built at design time, again, to make sure that the API still meets our best practice. These tests are our unit tests. And they’ll also make sure that the implementation that we just built functions how it should. There will probably be other tests we want to carry out at this stage. If any of those tests fail, we know immediately. We do not deploy the API, go back, and make the necessary changes until our implementation is how we need it. We can keep executing these tests to validate what we’re doing continuously. When those tests pass, we progress forward to deployment.
We start to see even more of a Git Ops approach. Because when this round of automated tests has been passed, we automatically generate the declarative configuration file for this API. Git Ops is all about declarative rather than imperative ways of managing deployments. It’s the modern way of managing infrastructure because it’s got so many benefits in terms of accelerated deployment speed, better auditability, and better repeatability benefits that we need, when we consider the level of scale and complexity compared to a few years ago.
Here’s a quick note for those who aren’t familiar with a declarative approach. A declarative approach is a lot more streamlined than the traditional imperative approach. If we’re doing things declaratively, we specify what we want the result of something to be. Whereas with the imperative approach, we must specify how to get that result. This is a pain to set up and then a pain to debug if something goes wrong. It is a pain to rewrite if and when one of the underlying admin APIs changes. But if we’re doing it the declarative way, we don’t need to worry about any of that. We tell the platform what it needs to look like when that API has been deployed. And the platform itself takes care of how that’s achieved. And the same is true in API ops. The beauty here is that we shouldn’t even need to write that declarative config file ourselves. You can automatically generate it from the API spec in tools like Kong. Because it’s generated from the spec, it will be completely accurate and consistent with the spec. Nothing will be forgotten, and there’s no chance of human error in that deployment process.
A declarative configuration automatically generated as part of the pipeline instructs the API platform and what it needs to look like once the API has been deployed. The platform goes off and magically configures itself. So we end up with our API registered in the platform and with the various security governance and operational policies for that API configured. It’s also worth noting that we store this declarative config file in version control along with the spec, the tests, and the API’s implementation. You get a complete, searchable, and auditable history of every deployment you’ve made. If ever there’s a problem, once you’ve deployed an API, we can easily roll back to a previous state. We’ve not just made deployment easier but rollbacks as well.
Once we’ve deployed the API, we need to validate that it performs as expected and check that we haven’t caused any errors. We’re now in an environment where other APIs and other codes have been deployed. Depending on where you’re in your SDLC, we should do some integration testing, security testing, and performance testing. So we’ll run that series of release checks before we publish this API and make it discoverable. These checks should also be automated, although you may want a final sign-off as a manual step before pushing that publish button. But when you are ready to publish, register the API in the portal, enabling self-serve access. Adding the spec for that API to the portal should also be an automated process. Automating that process is the only way to ensure that every API is discoverable and documented in the portal.
As we’ve gone through the API lifecycle, we’ve built up an inventory of assets that enable us to operate this API on an ongoing basis in an almost entirely self-sufficient way. If we need to scale out the API, for example, to handle higher throughput, that can be completely automated using the declarative config. Since this is version-controlled, we will see a completely repeatable identical deployment to before. The overall result here when our API lifecycles follow API Ops is that the continuous automated testing and deployment means that we can catch and resolve errors and deviations from our standards early, speed up deployments, and raise quality and consistently consistency.
ACME has just adopted API ops. In the mortgages team, Emily’s working on another API. As before, she’s following best practices by doing this design first. But unlike before, the tool she’s using to create her design gives her instant feedback on it. She can make sure that the spec she’s building doesn’t violate any policies. She skipped several days of back and forth with the API platform team getting this right. Instead, it takes her just a few minutes. Once her spec meets standards, she then can push it directly into Git or whichever version control system you want to use so that it triggers the next part of the automated API Ops pipeline. This creates a pull request in Git for the API platform team to review and decide whether to approve it into the code base or reject it and send it back to Emily for more work. Life is very different now in the API platform team.
They’ve automated the API review process and applied it to every API coming in for review. So they have 100% coverage of every quality, security, and compliance check across every API built at ACME. This means their QA costs have gone way down, and they’re no longer the bottleneck for APIs being deployed. If there were a problem with Emily’s spec, she wouldn’t have to wait for the next scheduled review session because these automated reviews are triggered whenever a new API has been submitted. But this time, there weren’t any issues with Emily’s spec. The tool that she used at design time made it easy for her to do the right thing from the beginning. The chances of her API meeting ACME standards now are much higher.
Once the automated tests have passed, the last step is automatically generating the declarative config file from that spec. This is then added to get picked up by the operations team. The API platform configures itself based on the declarative config. It registers the endpoints and applies and configures all the necessary policies. So, no more forgetting the security. The platform also automatically makes the endpoint discoverable in the portal. So Emily’s API is deployed immediately and smoothly. Deployments are much more likely to go smoothly with API Ops. Because everything’s been tested, the chance of introducing problems is lower, and the deployment is completely automated and declarative. In fact, deployments are so repeatable that the operations team has removed their limit of one a week. They’re now deploying in a truly continuous fashion, and they can meet the increasing demands as API adoption accelerates across acne. Of course, they’ll still be times when things go wrong; that’s unavoidable. But the impact of something going wrong is now much easier to minimize. Since every version of each declarative configuration is in version control, we have a complete history of every deployment. Since these files are all declarative, it’s easy to revert to a previous state. The operations team needs to feed in one of the previous configurations to the platform, and it will revert itself. Things are quite different for API consumers now through API Ops. We ensure every API is consistent, discoverable, secured, documented, and reliable. This means ACME’s portal is now a thing of beauty. It’s a catalog of products where each product is a well-designed API. This means that ACME can now operate at a pace without lowering delivery quality. They’re increasing quality while reducing costs, which means that they’ve got much more resources to innovate than they could before. They’re constantly delivering new capabilities and experiences to their customers.
This is the power of API Ops. It’s not just open to companies like ACME; every single organization can automate the API lifecycle like this if you’ve got the right API tooling and the right API first mindset.