Monitoring Distributed Services

Filippo Berto

SESAR Lab - Università degli Studi di Milano

20/05/2025

Link to the slides

Agenda

  • Why monitoring?
  • Means of Monitoring
  • Logging Monitoring Events
  • Handling scalability
  • Extensions
  • Demo
  • Question time

Why monitoring?

Quality of Service

  • Systems’ healt status and history
  • Alerts and prevention
  • Resource usage

Security

  • Implementation errors
  • User misbehavior
  • Incident analysis

Means of Monitoring

Metrics

Timeseries-based measurements of system aspects

  • Numerical: e.g. CPU and RAM usage, available storage

  • Categorical: e.g. system on/off, pipeline execution stages

Metrics provide objective and structured measurements of a system’s behavior

Common formats: Prometheus metrics, JSON

Logs

Text records describing services’ operation

Organized in level of detail:

ERROR, WARNING, INFO, DEBUG, TRACE

Structured logs also include contextual information

e.g., variables values, timestamps, source code

Common formats: Logfmt, JSON, GELF

Traces and Spans

Traces record the stack of execution

Spans record contexts of execution

  • Timestamps (begin, end)
  • Variables
  • Logging level
  • Traces

Spans can be nested for instrumenting sections of code functions Common formats: OpenTelemetry, Jaeger, Zipkin

Example: logging and spans

Rust code
Rust execution

Example: tracing

Metadata

All monitoring events can have associated metadata

  • Which application
  • Which host
  • Request ID
  • Environment

Monitoring pipelines

Logging Monitoring Events

For scalability we store monitoring events in append-only logs

Event typeService
MetricsMimir
LogsLoki
TracesTempo

Events are:

  • produced by monitoring agents (e.g., Alloy)
  • submitted by the application

Handling scalability

Mimir architecture

Grafana Mimir’s architecture

Extensions

OpenTelemetry

Alerts management

ML-based predictions

Microservice distributed tracing

Demo

Metrics in Grafana

QR Code DCGM

Question time

Link to the slides