Microservice Observability
Distributed System Observability
Overview
Distributed system observability includes but is not limited to log aggregation, application metrics, audit logging, distributed tracing, exception tracking, health check API, log deployments & changes. The sections below including some basic functionalities that accelerate the developer’s efficiency, system maintainability in general. Support the developer to easily monitor the system behavior(both hardware and software level), identify the bottleneck of the current implementation(both service and infrastructure level), debug and identify the bugs in the codebases, system-wise health check and alert/incident notification, new deployment rolling out and revert back progress/status, etc. in a large-scale system.
Logs
Log aggregation, provide full custom log tracing, waterfall logging connect to associated services, exception logging. The logging system should also provide easy to use web interface to support logger keyword lookup, help fast identify the issue, or doing internal business analytics.
tools option: ELK stack, one of the most popular logging stacks in the industry standard, GitLab also supports ELK cluster integration to do the full-text search in the repository → monitor → logs tab.
Metrics
Application metic including providing dashboard overview or detailed information about system traffic volume, overall API call success rate, API average/slowest response time, currently available pods count at the instance and system level, currently service healthiness, system operational/degraded/unavailable status, instance/hardware monitoring.
tools option: Prometheus +/ Grafana(dashboard monitoring), one of the most popular metric stacks in the industry standard, GitLab also supports Prometheus cluster integration to do the performance monitoring, custom metric monitoring, MR’s performance comparing, and so on.
Tracing
Distributed tracing export the details of system API calls behaviors(the request will span over multiple, same as logging). Provide the detailed response time, HTTP status code, operation performance(database queries, publish messages, waterfall associations etc.) for each operation in the entire span pipeline.
tools option: Jaeger, the tool also can support ELK export and storage for fast lookup and dashboard integration. Support GitLab integration.
Error Tracking
Tracking the exception that happened in the entire system. This could be an exception generated during the request or internal/background task exception.
tools option: Sentry. Support GitLab integration.
Alert
Alert triggering based on Metric or Error Tracking or custom settings. Could push out the alert with related monitoring thresholding, error message, or other operation details to the developer.
tools option: Integration with Prometheus +/ Sentry on the alert triggering and notification.
p.s. This summary was created and based on the available resource as it is for the day it was posted. Any of the tools mentioned above could be updated, deprecated, or replaced depending on the technology iterate in the feature.
p.s. Special thanks to Chris Richardson @ microservice.io provide microservice resources for me to study and summarize based on his knowledge and experience. I highly recommend you go to this site to further study the microservice architecture if you feel this post helped you in some way.