MUSA Cluster monitoring and Moon Cloud integration

Filippo Berto

SESAR Lab logo

Why monitor a system?

Resource allocation

How many of each resources are used by each service and how many should each user/service have access to

Development & Debugging

  • Execution traces
  • Errors/warnings
  • Resource usage
  • Behavior analysis
  • Network traffic

Security & QoS Assurance

Enforce strong security and QoS guarantees on the infrastructure and services running

Events and Time series

Events are tuples

$$\langle{}t, v, m\rangle{}$$ where:

  • $t$ is a timestamp
  • $v$ is a value of some kind
  • $m$ are metadata

Time series are sets of events which values are related. E.g., temperature over time, service logs

$$ {\langle{}t_1,v_1,m_1\rangle{},\ldots,\langle{}t_n,v_n,m_n\rangle{}} $$

Events are stored in time series databases which are optimized for storage compression and data retrieval through queries (DSL)

Logs

Logging records textual events in time series

image.png

Metrics

Monitoring records data events in time series

Generally values are sampled at fixed intervals (5s)

image.png

Traces & Spans

Traces record execution traces

Spans record execution contexts

image.png

Distributed request tracing

Each request is associated with an ID Traces with that ID can be tracked across multiple microservices

Service Net

Metadata

Each record in time series can contain several metadata that provide additional context to the value

  • Application
  • Version
  • Host
  • Deployment identifier

Especially useful when considering complex and dynamic environments

Current monitoring infrastructure

Monitoring Alloy

Prometheus: Scrapes metrics from monitoring-enabled services

Tempo: Monitoring-enabled services can push logs and traces

Loki: Collects monitoring data from Prometheus and Tempo and acts as a long-storage database

Grafana: Visualizer for metrics, logs and traces queries

Alloy: Creates monitoring data pipelines

Moon Cloud

Current monitoring techniques are not enough

  • Complex queries
  • Multiple data-sources
  • Custom scripts

Probes targeting monitoring

  • Query monitoring infra. for data
  • Detect issues in the target
    • Behavior
    • Configuration
    • Security events

Demo

Future works

Probe integrating into monitoring

  • Security-specific probes in monitoring pipeline
  • Fits with the current Moon Cloud paradigm
  • Integrates standard monitoring infrastructure

Real-time anomaly detection

  • Feed monitoring data to anomaly detection models
    • Also from different sources
  • Correlate
    • Metrics
    • Events
    • Spans metadata

Real-Time Assurance & Certification

  • Integrate real-time monitoring into certification process
  • Define certification contracts on metrics and behavior
  • Can be extended with
    • Metrics prediction (e.g., satellites)
    • Distributional forecasts (validity likelihood)

Federated monitoring

Peers share monitoring data to idenitfy issues in the shared resources

  • multi-tenant environments
  • edge-cloud continuum

Centralized knowledge on metrics definitions

Questions?