Jose Haro Peralta is a consultant, author, and instructor. He works with organizations of all kinds and shapes, from small companies to big corporations, like AIG and IKEA. He helps them build complex distributed systems with all the integrations. In this article, Jose discusses API Observability.
Consider, we’re building an API or a distributed system. We want to understand how users engage with the API, what the typical use of flow is, what it looks like, and how many unauthorized requests we have per second. With traditional access logs, the typical request logs that any web server produces, it is usually quite difficult to answer most of these questions. We don’t have enough information to understand how the system is working. We need something more sophisticated, and that’s where observability comes in.
Observability is collecting the necessary data to give you insights into how the API is being used. Without observability, you’re most likely getting hacked without knowing it, losing customers without knowing it, and missing out on crucial feedback about the quality of your APIs. These are really bad things if you’re working with APIs.
APIs are things we create ourselves, we control them, and we operate them. There is no reason not to know every single detail about what’s going on with the API. Observability is what will help us here. Observability won’t solve those problems, but it will highlight them. It’s going to give us visibility.
As per the standard definition, observability is the ability to measure and describe the internal states of a system based on its outputs, like traces, logs, and metrics. This is what we call the three pillars of observability.
OpenTelemetry is an open-source tool. The Cloud Native Foundation is sponsoring its development. The philosophy behind this tool is to make it easy to implement observability in our systems. The idea is to create tooling that is easy to inject into our systems and reap the benefits of observability.
Three pillars of observability
- Logs are records of specific events, such as information about the URL that was requested, the HTTP method, the status code of the request, and so on.
- Metrics are measures that capture system behavior, like availability and performance. They measure CPU utilization, memory utilization, number of requests per second, number of unauthorized requests per second, average latencies, etc.
- Traces allow us to trace the lifecycle of a request throughout our system. Often, in distributed systems, a request comes to our system. In turn, we may have to send some requests to other services. The problem is, once the request leaves the order service we lose track of it. So, whatever happens in service use, we don’t have visibility of it from the order service perspective. If an error occurs in using the service, connecting the events is difficult. This is what traces do. They inject a trace ID in every request from the system. This ID can help us connect the dots between the events.
Good API Observability
Good observability serves different stakeholders, whether we are from cybersecurity, governance, product management, or development. We should be able to look at our system’s output and get relevant information for ourselves. We should be able to draw a picture of our security posture in APIs. The governance team may want to answer questions about essential APIs or API usage. The development team may want to be able to trace or restore the system. We should be able to do all those things with the same output.
We should be able to reproduce and trace user flows with the API to reproduce errors. The API should be able to give us insight into user behavior. This feedback is important to improve our APIs.
Good observability is tailored to the specific needs of our business. In other words, observability speaks the language of the business. What we see an error in the logs is not a random POST request error, it is a specific error in a specific business context.
Ideally, we should be able to look into our logs, look for business events, and find the errors in the context. This is also going to help us as developers and API practitioners. It is going to help us look at these errors and communicate them to the business more effectively because these errors pick the language of the business.
Observability for API Security
According to research by IBM, it takes organizations, on average, nearly 300 days to identify and contain a data breach. We need to do better than that. Traditionally, data breaches happened through phishing attacks, but recently, there has been an increase in data breaches through APIs.
APIs are things we build, operate, and have full control of, so there is no reason for us not to be prepared to see everything going on with the API, detect any malicious activity, and flag it as soon as it happens.
Lack of observability means we’re not ready to tackle security issues in our APIs. We don’t have the necessary level of readiness to address and manage our API security posture. So that’s the first thing we have to address.
The security landscape is continuously evolving. Hackers are continuously finding new and creative ways to exploit our systems. Even if our systems are completely secure at a single point, they will continue evolving in the future. We will add functionality, endpoints, and parameters. Those new elements can open new holes in our system. We have to continuously assess and observe our security vulnerabilities.
The benefit of observability for API security is that we can monitor user behavior all day, all the time, looking for unusual behavior, unexpected flows, and unexpected ways of using the API. As soon as we detect unusual data transfer from that endpoint, we should flag it and investigate.
Observability for API Governance
As part of governance, we want to be able to trace user flows and analyze user experience. We want to be able to answer the question of whether the API is being used as intended, whether the design is good enough, whether the user experience is good, and whether there is something we have to do to optimize user experience. Our main focus is to ensure that the APIs are correctly documented and deliver business value and a good user experience.
Observability for API Operations
Questions that observability addresses here is, whether we have shadow APIs and zombie APIs. We should be able to trace problems across distributed applications. We should be able to identify and diagnose problems when they occur and not after the customer notifies us.
Finally, we should be able to understand our system’s topology and the dependencies between services. We might find that two services talk to each other more often than the other components, and we may be able to optimize this.
To summarize, API observability is hard, but it is necessary. It is a precondition to assess and address our API security posture. Good observability is tailored to our business needs. It speaks the language of the business. We can communicate with multiple stakeholders, align with them on errors, and communicate accordingly.