Gianluca Lo Vecchio
The idea behind this thesis originates from the perceived need in various sectors of the Italian economy to establish a data platform that can manage, organize, and connect different company information assets, making them available to internal or external expertise through contracts. The requirement is for a ``single point of truth’’ capable of connecting Proof of Concepts (PoCs) developed over the years, including pre-existing on-premise source systems and cloud production architectures, into a single, comprehensive, and expandable structure. Given this ambitious goal, it was decided to build an architecture that could demonstrate feasibility by first analyzing the available technological stack, stack synergy, and strategically selecting components to maintain a structure that is both streamlined and comprehensive.
Challenge
The challenge is to deploy a functional, modular and scalable architecture using Kubernetes with only open-source technologies. The solution has to be compatbile with the cloud so to exploit any new SaaS offered by the provider without being binded to it by the core system.
Objective
The objective is to build a complete architecture of a lakehouse using Kubernetes, from the storage layer to the visualization including RLS\footnote{Row Level Security.} and roles.
Contribution
During the initial preliminary phase, it was identified as crucial to use a container orchestrator to simplify the management of architecture components, considering all the advantages of containerization and the requirements for scalability and high availability. Other orchestrators like Docker Swarm, were identified as insufficient for deploying more complex architectures due to challenges with scaling, container management, and failover. The orchestrator chosen was Kubernetes, ideal for complex deployments and advanced architectures that primarily require the use of a diverse and inclusive technological stack while abstracting underlying hardware.
Once the orchestrator was selected, an analysis was conducted to choose the component for the storage layer based on Kubernetes and the thesis objectives. Likewise, the computational layer was strategically chosen to achieve synergy between Kubernetes and the storage layer. Several criteria were considered, including integration with Kubernetes, compatibility with distributed and existing cloud systems, scalability, ease of operation and management, simplicity of development and configuration, the ability and method for replication, and security.
For the storage layer, two evaluated components were MinIO and GlusterFS, both open-source and designed for distributed, scalable, and large-scale systems. The choice fell on MinIO because it didn’t require plugins to be compatible with Kubernetes or additional adjustments to communicate with the cloud. Additionally, the object storage based on the s3-like path format facilitates development and configurations.
In the case of the compute layer, Apache Hive and Trino were considered. Evaluation criteria included integration with Kubernetes and MinIO, performance, integration with the Apache suite, and ease of development. The choice was made in favor of Trino because it proved to be more performant. The architecture for the thesis is a LakeHouse, and the workloads are not designed to be live and on-demand, making Trino more suitable. Furthermore, both Trino and Hive require the same connector to MinIO (Apache Hive Metastore) and are compatible with Apache either natively in the case of Hive or through straightforward deployment configurations in the case of Trino.
Finally, Apache Superset was chosen as the visualization layer because it natively features RLS, role management, management analysis, and dashboards. It can be easily connected to Trino through the installation of a specific plugin. Once the architecture was implemented, tackling all the challenges arising from incomplete or incorrect documentation, it was tested using the Locust framework through Python scripts to simulate user numbers ranging from 10 to 50. These users, at increasing frequencies, programmatically executed queries on Trino identical in nature to those that would be caused by a dashboard on Apache Superset. The tests did not report any query management failures, but significant stress was observed on the Trino query engine when the peak of test set requests was reached. The goal is to maximize cloud hybridization for economic and security advantages while expanding the architecture to maintain the single point of truth vision without sacrificing speed or cloud provider capabilities.
Results obtained
The architecture was able to respond to the tests conducted without errors, but at the same time, the importance of Trino worker scalability became evident. Considering Kubernetes’ response time, it would be necessary to introduce a predictive algorithm based on requests per second and available resources to adjust the number of workers. This hypothesis has been analyzed in some papers referenced in the thesis, along with possible developments to maximize cloud hybridization, both from an economic and security perspective. The goal is to expand the architecture in a way that maintains the vision of a single point of truth without sacrificing speed and the capabilities provided by cloud providers.