Microservice Observability

Distributed System Observability

Overview

Distributed system observability includes but is not limited to log aggregation, application metrics, audit logging, distributed tracing, exception tracking, health check API, log deployments & changes. The sections below including some basic functionalities that accelerate the developer’s efficiency, system maintainability in general. Support the developer to easily monitor the system behavior(both hardware and software level), identify the bottleneck of the current implementation(both service and infrastructure level), debug and identify the bugs in the codebases, system-wise health check and alert/incident notification, new deployment rolling out and revert back progress/status, etc. in a large-scale system.

Logs

Log aggregation, provide full custom log tracing, waterfall logging connect to associated services, exception logging. The logging system should also provide easy to use web interface to support logger keyword lookup, help fast identify the issue, or doing internal business analytics.

tools option: ELK stack, one of the most popular logging stacks in the industry standard, GitLab also supports ELK cluster integration to do the full-text search in the repository → monitor → logs tab.

Metrics

Application metic including providing dashboard overview or detailed information about system traffic volume, overall API call success rate, API average/slowest response time, currently available pods count at the instance and system level, currently service healthiness, system operational/degraded/unavailable status, instance/hardware monitoring.

tools option: Prometheus +/ Grafana(dashboard monitoring), one of the most popular metric stacks in the industry standard, GitLab also supports Prometheus cluster integration to do the performance monitoring, custom metric monitoring, MR’s performance comparing, and so on.

Tracing

Distributed tracing export the details of system API calls behaviors(the request will span over multiple, same as logging). Provide the detailed response time, HTTP status code, operation performance(database queries, publish messages, waterfall associations etc.) for each operation in the entire span pipeline.

tools option: Jaeger, the tool also can support ELK export and storage for fast lookup and dashboard integration. Support GitLab integration.

Error Tracking

Tracking the exception that happened in the entire system. This could be an exception generated during the request or internal/background task exception.

tools option: Sentry. Support GitLab integration.

Alert

Alert triggering based on Metric or Error Tracking or custom settings. Could push out the alert with related monitoring thresholding, error message, or other operation details to the developer.

tools option: Integration with Prometheus +/ Sentry on the alert triggering and notification.


p.s. This summary was created and based on the available resource as it is for the day it was posted. Any of the tools mentioned above could be updated, deprecated, or replaced depending on the technology iterate in the feature.

p.s. Special thanks to Chris Richardson @ microservice.io provide microservice resources for me to study and summarize based on his knowledge and experience. I highly recommend you go to this site to further study the microservice architecture if you feel this post helped you in some way.

Xuhui Sun
Xuhui Sun
Senior Software Engineer

Modern C++ enthusiast exploring Artificial Intelligence and Machine Learning. Passion for learning and sharing knowledge! ❤️