Observability is an afterthought, it should not be

Suman Shil
4 min readMar 8, 2021

Today we will discuss the importance of observability and monitoring in modern cloud applications. In last two decades, the world has seen many technology revolutions. One of such revolutions was emergence of cloud based technology. Anyone with a credit card now create a data center and deploy applications. It also meant that the traditional way of development and deployment of software was not working as cloud is dynamic. The resources in cloud are treated as “herds” and not “pets”. For example servers can be terminated anytime and replaced by new servers. Software developers were exploring ways on how to write reliable software that can be deployed in a dynamic environment.

Enter Microservices

Microservices architecture was the answer to this challenge. Organizations started transitioning to the new paradigm of software development. Big organizations like Amazon and Netflix proved that this new architecture provides velocity, flexibility and competitive advantages to the organization. In this blog I will discuss my experience as a microservices developer. The project was to create software defined data centers(SDDC) in a cloud environment. We developed micorservices which were deployed in cloud. We created REST APIs which the customers used to manage the data centers.

So far so good

We were happy with the velocity that we achieved with Microservices. Teams were more focussed, features were being rolled out faster than ever before. We were able to scale out our applications easily which addressed the concerns of a rapidly growing company. We had a solid CI/CD pipeline to ensure a smooth product release. We spent a lot of time designing loosely coupled services, datastores, message brokers, writing unit tests etc.

Well, all was not well

During one of the project meetings when we were discussing our next feature, we got a message that teams are seeing exceptions in different microservices. Users are not able to access the APIs. That made all the teams to huddle into a “war room” to figure out the root cause. It was not an easy task as over time the services have become more complex in terms of code and functionality. It was like “finding needle in a haystack”. After hours of debugging we realized that one of the Database instances were down which caused increased Database operation latency and operation timeout in some cases. In next couple of months we have seen similar fire fighting which were called “incidents”.

What went wrong

My observation from different incidents is that some of those incidents were avoidable. Also I felt that the teams that I worked with so far didn’t focus on observability initially. The focus is mainly delivering as many features as possible. I can understand why this is a priority but we also need to understand that the challenges that the organizations face are different from the challenges that the organizations faced a decade back. With the advent of cloud computing and microservices we need to focus on application reliability, observability, fault tolerance as much as application performance, code correctness or software architecture. In cloud computing failure is considered to be a normal but the applications are expected to be online all the time. Organizations that have good business model but can’t ensure application availability and efficiency will loose competitive advantage.

What we could do differently

When we design a cloud based application, design and implementation of monitoring and observability should be part of the design discussion from day one itself. Organizations should invest on observability and monitoring before embarking on a new project. Observability and monitoring is a big subject itself and I will not discuss these topics in details in this blog. But in a nutshell, there are three pillars of observability.

  1. Metrics
  2. Traces
  3. Logs

We need agents that collect metrics. Metrics can be either system metrics and application metrics. Metrics are used to monitor application or system health. We also use metrics to setup alerts to detect any deterioration of application health before it becomes catastrophic. Traces and logs are used for observability that gives more context of “why” an incident happened. As I mentioned before I will not discuss more on these subject in details as there are many resources available online and there are many vendors in this domain. I will discuss more on how it improves development and operation of software.

How observability is related to the incident

Going back to the database failure incident that I mentioned above, If we could setup metrics and alerts for database instance health and application code that calls database API, there was a chance to detect the failure before it brought down the application. Alerts are setup by specifying a threshold, and if it crosses the threshold we either notify teams using slack, pagerduty etc. A database instance performance can degrade because of high memory, CPU or hardware issues. Most of these issues shows symptoms initially and becomes worse over time. Using monitoring we could detect those deviations. There are several open source tools (like prometheus) available which supports metric collection and alerting.

Conclusion

Initial investment in observability and monitoring saves many productive engineering hours. I have noticed that without effective monitoring teams spent more time in fire fighting than in actual design and development which burns out the engineers and affects their motivation. Implementing all aspects of monitoring and observability may not be feasible initially. But we can add basic metrics and alerting from the beginning and implement advanced features as project becomes more mature. Following are some of the well known tools and softwares for observability.

--

--

Suman Shil

Software developer, Father, Optimistic, Eternal learner