Eugene Wong and Ryan Ashnell are Engineers with GOVTECH, Singapore. In this article, they discuss API Monitoring with SRE.
In this article, you will discover the underlying technologies that power Singapore government APIs and learn how we design our central API monitoring dashboards around the SRE principles.
APEX
APEX is a full-fledged API management platform of the Singapore Government Tech Stack (SGTS) that allows government agencies, businesses, and developers to manage and share their APIs. APEX gateways exist in two main zones: the internet and the government internet. This has allowed us to have a diverse range of different government agencies having the ability to host their APIs with us; they could be on the internet, they could be in their own intranet spaces, it doesn’t matter. We provide the support for them to coexist together. We have usable examples in our live production setting. Developers can use them and create meaningful APIs.
The other benefit APEX provides is that we follow the best practices of API Management, which helps our users stay secure and relevant. So, if you onboard with us, we will take care of all the heavy lifting for you, and you can focus on and enjoy your API innovations on your own time and pace.
StackOps
StackOps is based on the Elastic Stack. It is a key monitoring component of the SGTS, designed to boost observability and support SRE. As Apex developers, it has helped us a lot to reduce our operational overheads. Because it is software as a service, we can run this elastic stack without worrying about all the configuration and settings to get it running. We also have accelerated means of resolution. It has allowed us to monitor four golden signals: latency, traffic, error, and saturation.
Intersection of Apex and StackOps
We have a three-layer approach when it comes to monitoring and observability.
The very first layer is the observation stage of the API lifecycle. As active applications of our consumers consume public APIs in our platform, all their traffic and transactions are logged in StackOps. We take extra care to redact sensitive information. We allow the tenants to decide what they want to monitor.
The next layer is the users using our platforms. For example, our API Managers use API portals and perform different activities: subscribe to APIs, recycle their API keys, create new API keys, etc. These transactions are also locked into StackOps. So, if a feature has been published and breaks in production, it will be very quick to be notified because all these are being tracked in our consolidated monitoring ecosystem.
The last layer contains the critical metrics for our infrastructure. Its infrastructure configuration is also captured in StackOps.
Site Reliability
We would like to give special credit to our reference material, the Site Reliability workbook Practical Ways to Implement SRE, published by O’Reilly, specifically the third release. Though we speak about monitoring carried out in APEX, an API gateway, all these principles will apply to any API backend as well.
There are seven SRE principles –
- Risk
- SLOs
- Eliminating Toil
- Automation
- Release Engineering
- Simplicity
- Monitoring
Monitoring is used to measure the performance of systems as well as the latency and speed of processes. Monitoring is the cornerstone of SRE. Below are some principles related to monitoring –
Interfaces—This means that the same data can be displayed differently to different audiences. In APEX, we create multiple dashboards.
Ownership and Tooling—Use the same tooling regardless of function or job title. So, in APEX, we have implemented common monitoring tools to examine business metrics, API performance, and fraud logging.
Speed—Current data should be available when you need it. Stale data can lead to wrong interpretations. Earlier, we noticed that API metrics going through the logging pipeline could not provide live data. So, we evaluated our logging infrastructure and optimized the architecture to enable quick data turnaround.
Modeling: Critical user journeys help us capture our customers’ experience. Draw a high-level system architecture diagram; show critical components, request flow, data flow, and critical dependencies. Each metric should serve a purpose.
Dashboarding – Preparing for major events
Typically, when we design a custom dashboard, we follow the following principles –
- The dashboard will show the business metrics that are important for the event.
- Provide links to other critical data that will allow troubleshooting.
This will help the customer track status codes, latency, etc.
JWT Authentication
JWT authentication is a client assertion-based security mechanism for APIs, which allows us to incorporate authentication, authorization, data integrity, and non-repudiation. It is loosely based on a JWT authorization header, which the API consumer signs and the APEX system verifies the claims and the signature of the signed JWT. This dashboard, helped our publishers and consumers to troubleshoot authentication and authorization issues to pinpoint the errors quickly. It enabled users to diagnose issues using a self-serve model. This led to time savings for DevOps.
To conclude, the principles of SRE help us in API monitoring.