API Testing & Monitoring

Open Telemetry for API Monitoring

105views

Danielle Kayumbi has been a software engineer for 14 years, specializing in distributed architecture. She is currently a software engineer with Capgemini. In this article, she discusses Open Telemetry for API monitoring.

In software engineering, we come across fatal crashes in production. There are many reasons for these crashes, but the main reasons that we will focus on are errors in the code base and attacks from hackers. This can seriously affect businesses by shutting down their systems for a few hours or even more. One of the companies that suffered serious consequences was Amazon Web Service, the leader in cloud computing. In 2021, they were hit with a long outage, which took down popular websites. It impacted millions of people as they couldn’t use their services. It was a worldwide event.

The main reason behind this was a new automatic job to scale some services. This led to a huge volume of service requests, creating congestion and destroying all the infrastructure. Unfortunately, it made the network crash. Finally, the IT team resolved the issue. However, it took nine long hours to fix.

It is problematic to identify outages as soon as possible. Outages can come from errors, latency, bottlenecks, or third parties, which is even more complicated on distributed systems. So, it is therefore important to monitor the system to anticipate outages. However, nowadays, just monitoring systems is not enough, mainly because it’s too deterministic. Indeed, monitoring focuses on analyzing predefined data collected from individual systems. But now we need to analyze that continuously based on the combination of past experiences to detect vulnerable attack vectors. We also need to understand relationships between the components of system applications, micro services, servers, databases, and so on. For these, Observability is the key to success. Monitoring is the tip of the iceberg; Observability is below the waterline. However, they work together to boost insights into a system’s health. Observability relies on telemetry data. Telemetry refers to data emitted from a system about its behavior. The data can be traces, metrics, and logs. Once this data is defined, we must generate, collect, manage, and export them using a framework. The solution is a framework called Open Telemetry.

Open Telemetry is a mechanism by which application code is instrumented to help make a system observable. It was accepted by the Cloud Native Computing Foundation in 2019 and moved to the Maturity level in 2021. It is a combination of Open Tracing and Open Census. In Open Tracing, we distribute traces and metrics. Open Census collects and exports the traces and metrics.

Distributed tracing is fundamental to open telemetry. It tracks a single request throughout its journey from its source to its destination, unlike traditional tracing, which just follows a request through a single application domain. There are three main components in distributed tracing: logs, span, and traces.

A log is a message emitted by a service or a component. Unfortunately, logs aren’t extremely useful for tracking code execution, and they typically lack contextual information, such as where they were called from. However, they become far more useful when included in a span.

A span represents a unit of work or duration. It tracks specific operations in the application code.

The distributed trace records requests’ paths through multi-service architectures like microservices, servers, databases, etc. A trace is a collection of spans.

Once this base is defined, we can create dashboards. Dashboards can help collect and monitor aggregates, and analyze and visualize data in real-time. We can implement dashboards using the four golden signals framework. The first dashboard is latency. Latency will present how long it takes to service a request over a given time. Generally, traffic represents how many user requests are received over time. Errors here indicate application infrastructure errors or failure.

The Service Level Objective describes a measurable aspect of API, such as performance, scalability, or availability. The service level indicator represents a measurement of service behavior. It’s a metric associated with the service level objective.

To conclude, outages need to be addressed in three steps. First is prevention, which is Observability. Next is studying the observations, which involves monitoring and alerts. If it still fails, the final step is resolving issues based on your expertise.

Danielle Kayumbi
I have been a software engineer for over 12 years and currently Managing Director of my company DK Wave Technology, publisher of the innovative “Smart Attendance” payroll solution, dedicated to universities and training and learning centers. Sharing my knowledge is my main driving force. Since 2017, I have contributed to the development of open source solutions by presenting conferences on technical subjects that fascinate me.

APIdays | Events | News | Intelligence

Attend APIdays conferences

The Worlds leading API Conferences:

Singapore, Zurich, Helsinki, Amsterdam, San Francisco, Sydney, Barcelona, London, Paris.

Get the API Landscape

The essential 1,000+ companies

Get the API Landscape
Industry Reports

Download our free reports

The State Of Api Documentation: 2017 Edition
  • State of API Documentation
  • The State of Banking APIs
  • GraphQL: all your queries answered
  • APIE Serverless Architecture