Monitoring and Anomaly Detection in Distributed Systems: A Scalable Real-Time Solution
Distributed systems have become one of the pillars of modern computing, creating the backbone for technologies like cloud computing and the Internet of Things, while supporting many of the services that run on the Internet. However, with the use of this type of systems a unique set of challenges arises, particularly about ensuring properties such as their scalability, performances, reliability, and the detection of issues that could lead to system failures. Effective monitoring of such systems in complex environments is essential to maintain these properties.
This thesis presents a comprehensive approach to real-time monitoring and anomaly detection in distributed systems creating a scalable framework, designed to be versatile, modular, and capable of handling the complexity typical of such systems. The framework is built around four key components: Monitored Agents, Data Collection Agents, Detection Agents, and Visualization Agents. Monitored Agents continuously gather measurements from various parts of the system, while Data Collection Agent collects, aggregates and standardize them for further analysis. The Detection Agent then gathers historical and real-time data to identify anomalies. The results of the collection and anomaly detection processes are visualized by the Visualization Agent, providing real-time insights through intuitive visualizations and implementing alerts for any detected anomalies.
A significant contribution of the experimental part of this thesis is the creation of a scalable monitoring solution, based on the implementation of the proposed framework, that can be easily integrated into existing distributed infrastructures. The proposed system utilizes widely used monitoring platforms like Prometheus and Grafana, but with added capabilities for real-time anomaly detection through the use of the Prophet time series forecasting model. This extension allows for proactive monitoring by predicting expected system behavior and flagging anomalies when there is a significant deviation from the predicted behavior. A custom anomaly detection module was implemented to interface with Prometheus, which efficiently processes large volumes of measurements in real-time to find anomalous behaviors. The system was evaluated within a distributed environment setup, demonstrating its effectiveness at identifying anomalies while maintaining low resource consumption and high responsiveness.
This solution offers applicability in various types of distributed architectures, providing a robust and adaptable framework for real-time monitoring and anomaly detection that can meet the specific demands of distributed infrastructures and system operators. By combining artificial intelligence-based time series forecasting, real-time data processing and anomaly detection, this work sets the way for future advancements within the distributed systems monitoring field.