Monitoring Distributed Services

Filippo Berto

SESAR Lab - Università degli Studi di Milano

20/05/2025

Link to the slides

Agenda

Why monitoring?
Means of Monitoring
Logging Monitoring Events
Handling scalability
Extensions
Demo
Question time

Why monitoring?

Quality of Service

Systems’ healt status and history
Alerts and prevention
Resource usage

Security

Implementation errors
User misbehavior
Incident analysis

Means of Monitoring

Metrics

Timeseries-based measurements of system aspects

Numerical: e.g. CPU and RAM usage, available storage
Categorical: e.g. system on/off, pipeline execution stages

Metrics provide objective and structured measurements of a system’s behavior

Common formats: Prometheus metrics, JSON

Logs

Text records describing services’ operation

Organized in level of detail:

ERROR, WARNING, INFO, DEBUG, TRACE

Structured logs also include contextual information

e.g., variables values, timestamps, source code

Common formats: Logfmt, JSON, GELF

Traces and Spans

Traces record the stack of execution

Spans record contexts of execution

Timestamps (begin, end)
Variables
Logging level
Traces

Spans can be nested for instrumenting sections of code functions Common formats: OpenTelemetry, Jaeger, Zipkin

Example: logging and spans

Example: tracing

Metadata

All monitoring events can have associated metadata

Which application
Which host
Request ID
Environment
…

Monitoring pipelines

Logging Monitoring Events

For scalability we store monitoring events in append-only logs

Event type	Service
Metrics	Mimir
Logs	Loki
Traces	Tempo

Events are:

produced by monitoring agents (e.g., Alloy)
submitted by the application

Handling scalability

Grafana Mimir’s architecture

Extensions

OpenTelemetry

Alerts management

ML-based predictions

Microservice distributed tracing

Demo

Metrics in Grafana

Question time

Link to the slides