Setting up Service Monitoring — The Why’s and What’s
You created an app, you are ready to release it to the world. But are you monitoring it? How will you know if something is off or something simply stopped working? This is where monitoring comes into the picture.
You will find lots and lots of guides about monitoring your applications. However, most of them are just focused on setting up a framework using some integration with some language. What they don’t discuss is what metrics you should be monitoring. This blog post exclusively targets that topic.
For monitoring our micro-services, we can start with the four golden signals, as per the Google SRE book — latency, traffic, errors, and saturation.
Health checks
This is the most basic form of monitoring that you should be implementing for your application. You are most likely already using health checks if you use a load-balancer or Kubernetes with liveness probes or Consul.
Implementing a health check is relatively simple. Your service should respond within a reasonable timeout. You can design a “ping” endpoint for an HTTP service that returns a 200 when called. For other protocols, you can similarly create a ping functionality. This checks the basic reachability of your application. Depending on your code, you may want to return success only when pings to required services are green.
Infrastructure metrics
Your second step should be to monitor infrastructure metrics exported by your platform. Collect and monitor the data collected from the OS, such as CPU utilization, memory utilization, IO utilization, and network utilization. Your cloud provider may provide some data as well. If you are using Docker or Kubernetes, then collect those metrics as well. Your monitoring system would likely have integrations that export this data with little effort. If you are using Prometheus, installing node_exporter will collect data from all machines in your infrastructure. Similarly, the Prometheus Kubernetes operator will fetch metrics from your Kubernetes
Response codes and errors
We want as few errors as possible. So we start counting them. A simple way of tracking these is by utilizing your framework and adding middleware to track requests and responses. For an HTTP micro-service, you may track total requests, success and failures, and the time taken by each request.
Depending on your monitoring system, you will want to assign different labels or tags to indicate the context of each event so that your monitoring system will allow you to drill down stats for each endpoint. Some tags that I frequently find adding are the URI, response status codes to requests, some shard indicators for sharded services (like country code, app version), etc. It depends on your requirements.
Latency and timings
Timing is a crucial indicator of performance — you certainly want to track this. You should monitor your request and response timings across your stack in a fine-grained manner as possible. The most accessible place to start is your application server. Similar to the request/response status monitoring, you can begin with a middleware that logs request and response times. Extend that to any proxies or load-balancers. You have to break down what time is spent on routing decisions, network latency, SSL, etc.
After you have a high-level picture, focus on the time spent by your supporting infrastructure and business logic. If you have some number crunching, you should track those timings separately. If you are hitting a database or another service, track their timings as well.
Database performance metrics
I have come across it multiple times — if something is slow, start checking with your database. Databases are massive beasts that do a lot of time-consuming IO and unfortunately don’t scale that well. However, your database likely exports a ton of data that can be very useful in debugging issues or predicting a future performance regression. Stats like table sizes, query timings, tuples read per query, tuples returned, etc., are some things you should be monitoring. Database monitoring is so vast that it deserves its own post altogether.
Cache metrics
Cache helps to speed up your services by keeping frequently-used data readily available. Monitor the cache size, hit rate, miss rate, evict rate, and cache miss ratio. The goal of a cache is to maximize the number of hits. The miss ratio should be as low as possible without risking the storage of slate data. If it is too high, you should reconfigure your cache and maybe even look at your caching strategy.
Queue metrics
If you are utilizing message queues to process jobs, you must monitor them. Producer and consumer counts, queue produce rates, consumption rates, lag, and failures are critical numbers to track. Setting alerts when any of these metrics are abnormal is also essential. It will provide an early warning to any potential failure.
Additionally, monitor the additional metrics published by your queue software (Kafka or RabbitMQ) related to your queue’s performance.
Business Metrics
Apart from the generic metrics we discussed above, your application will likely have custom metrics that you want to track. Things like the number of completed orders, transactions in your application are application-specific business-specific. These numbers give an assurance about how the overall system is performing. In case of an incident, these metrics also help to assess the real impact.
Such metrics vary on a case-to-case basis. It could be a counter or a gauge, or a timer. You can use the libraries provided by your metrics platform to expose these metrics from your application.
Having good metrics is essential for setting up a reliable architecture. Metrics form the source of truth of service alerts and performance numbers. It gives a direction of what to do next in order to improve, fine-tune and optimize the performance and stability of your services.