How many of each resources are used by each service and how many should each user/service have access to
Enforce strong security and QoS guarantees on the infrastructure and services running
Events are tuples
$$\langle{}t, v, m\rangle{}$$ where:
Time series are sets of events which values are related. E.g., temperature over time, service logs
$$ {\langle{}t_1,v_1,m_1\rangle{},\ldots,\langle{}t_n,v_n,m_n\rangle{}} $$
Events are stored in time series databases which are optimized for storage compression and data retrieval through queries (DSL)
Logging records textual events in time series
Monitoring records data events in time series
Generally values are sampled at fixed intervals (5s)
Traces record execution traces
Spans record execution contexts
Each request is associated with an ID Traces with that ID can be tracked across multiple microservices
Each record in time series can contain several metadata that provide additional context to the value
Especially useful when considering complex and dynamic environments
Prometheus: Scrapes metrics from monitoring-enabled services
Tempo: Monitoring-enabled services can push logs and traces
Loki: Collects monitoring data from Prometheus and Tempo and acts as a long-storage database
Grafana: Visualizer for metrics, logs and traces queries
Alloy: Creates monitoring data pipelines
Current monitoring techniques are not enough
Peers share monitoring data to idenitfy issues in the shared resources
Centralized knowledge on metrics definitions