Ken Chua is a Technical Account Manager with New Relic. In this article, he discusses Observability at scale.
Change is a constant. In a digital business, there are changes every day. So, it’s about how we can withstand change, adapt to it, and become stronger. When we change things, they tend to break.
And when we are changing things, things will definitely break. If you make a wrong code change and deploy it to production, the application may fail in production. 31% of the faults are due to software, simple things like array out of bounds, and a null pointer exception. 21% of the bugs are due to data formats. Most of these incidents can be resolved by rolling back a code change.
So, to become resilient, we need to take a few steps –
- Put measurements in place – When you measure, you can analyze where things are going wrong and fix those things. If you want to improve, at least you can understand where you are currently and where you want to improve.
- Observe early in the development cycle to get ahead of problems. For example, if you are deploying a code change, you want to measure this new piece of code to check that it is not slowing things down, does not have loops, etc.
- Add business context to your observation to understand business events and business impact. Analyze the direct and indirect monetary impact if the application or part of it goes down because of a change.
Get your data right
About 20 years ago, we created our own silos and different teams for the same purpose. So, we have different tools for the same purpose. Through our conversations with our customers, we observed that 94% use two or more tools, and 82% use four or more. Sometimes certain software isn’t capable of monitoring everything, and you need another software to do that. But we need to try and avoid that. This is primarily because of the cost and complexity involved. More tools require more maintenance.
When you have multiple tools, and your system is down, or if the developers are trying to find the root cause of some event, if there are multiple tools, there may be a lot of redundant data to look at and analyze before we can conclude.
New Relic has a mission statement, “Data for all engineers.” In 2008, New Relic invented cloud APM for application engineers. Today, it is the source of truth for all engineers to make decisions with data, not opinions. It has more than 30 capabilities built in, from application monitoring to mobile monitoring.
Observability Maturity
We generally have three stages of observability maturity –
Reactive – If a tool is used to understand your system, measure the impact of change and reliability, and fix issues as they arise, it is called reactive maturity. That is where most companies are.
Proactive – If you use the data to optimize your software, resource usage, etc., that is called proactive maturity. Suppose you are using your monitoring tool to improve your business metric, such as reducing your bounce rate, improving the number of purchases through the system daily, etc. In that case, you are in the proactive maturity stage.
Preventative – The final stage is the holy grail stage, something to aspire to because not many people reach it. You are in the preventative stage, where you use data to prevent problems, even before they happen.
Practice the problem-solving methodology.
Now that you have a lot of data, you need to analyze it to get to the root of the problem. At New Relic, there’s a particular methodology that we usually recommend to customers if they are at a loss.
Symptom – We may start with customer complaints or other symptoms
Service – Check the service that is impacted. This may need analysis, as it may not be the service at the topmost layer.
Was there a change? – Once we know the impacted service, we need to check if there was a change to that service. If we find that there has been a change, we can roll back the change.
Are there any errors? – If you decide to move ahead instead of rollback, you can check the errors in the service. If there are errors, you can fix those.
Rule out local dependencies – If there are no errors, you can look at the dependencies. Local dependencies are your memory usage, CPU usage, etc. If there are errors, you can fix them.
Rule out external dependencies – If nothing is wrong there, you can check external dependencies, like the database, API calls, etc.
Next downstream service – You can check the next downstream service if there are no external dependencies.
Now we know how to get data. We know that there is a methodology to solve problems. Now, we need to ensure we sustain and grow it. We recommend having a framework, say, Observability Centre of Excellence. You can set up a team from multiple functional areas; get them together to explore and adopt the tools. They will be the ones to actually spearhead and then propagate the information to the entire organization. You can have volunteers that do not belong to the core team. They belong to individual agile teams.
So, with resiliency, measuring the correct data, analyzing it, and building a center of excellence, you can have Observability at scale.