Home /Research /RTilience: Fault-Tolerant Time-Critical Kubernetes
OTHER

RTilience: Fault-Tolerant Time-Critical Kubernetes

Harald Gustafsson, Raquel A. F. Mini, Luca Abeni, Remo Andreoli, Tommaso Cucinotta

Year
2025
Citations
2

Abstract

This paper tackles the problem of optimal configuration and deployment of fault-tolerant time-critical service chains with arbitrary DAG-alike topologies. We propose RTilience, designed according to a scalable cloud microservice paradigm, and prototyped on top of the well-known Kubernetes cloud orchestrator. It features real-time reservation scheduling of containers to guarantee temporal isolation of time-critical tasks, leading to fine-grained control of compute latencies, while allowing for sharing physical CPUs among containers. A distributed routing library, ReqRoute, is configured with a timeout and primary and secondary routes, enabling autonomous and decentralized handling of failing requests. The routes are configured by a centralized controller that performs admission control, resource management of microservice instances, task placement, and fault detection and recovery, extending the features available in Kubernetes. Admission control is based on a theoretical framework enclosing a worst-case performance model for the experienced end-to-end response-time under various fault handling options, and an optimization framework that computes the optimum resource allocation for admitted services. Extensive experimentation of the proposed solution has been performed with synthetic examples, and an autonomous transport robot use-case, verifying that end-to-end deadlines are effectively respected, even in presence of high fault rates of individual microservice instances, according to the theoretical expectations. RTilience is made available as open-source software, released under a MIT license.

Keywords

ReservationScalabilityCloud computingTimeoutSoftware deploymentScheduling (production processes)Fault toleranceResource allocationTask (project management)Routing (electronic design automation)

Related papers

Browse all OTHER papers