Toward Fault Tolerance in Multi-Agent Reinforcement Learning

Yuchen Shi, Huaxin Pei, Liang Feng, Yi Zhang, Danya Yao

Year: 2025
Citations: 3

Abstract

Agent faults pose a significant threat to the performance of multi-agent reinforcement learning (MARL) algorithms, introducing two key challenges. First, agents often struggle to extract critical information from the chaotic state space created by unexpected faults. Second, transitions recorded before and after faults in the replay buffer affect training unevenly, leading to a sample imbalance problem. To overcome these challenges, this paper enhances the fault tolerance of MARL by combining optimized model architecture with a tailored training data sampling strategy. Specifically, an attention mechanism is incorporated into the actor and critic networks to effectively and automatically detect fault information and dynamically regulate the attention given to faulty agents. Additionally, a prioritization mechanism is introduced to selectively sample transitions critical to current training needs. To further support research in this area, we design and open-source a highly decoupled code platform for fault-tolerant MARL, aimed at improving the efficiency of studying related problems. Experimental results demonstrate the effectiveness of our method in handling various types of faults, faults occurring in any agent, and faults arising at random times. Note to Practitioners—Multi-agent systems based on MARL outperform those using traditional control methods in terms of performance but remain highly vulnerable to unexpected faults. To improve fault tolerance in such systems, we introduce an attention mechanism that enables the neural network to dynamically adjust its focus on fault-related information. Additionally, a prioritization sampling strategy is employed to select critical samples from collected experiences that are most relevant to current training needs. Experimental results across various fault types demonstrate significant improvements in fault tolerance, validating the robustness of our approach. These findings suggest that the proposed method has the potential to be applied to real-world scenarios, such as multi-robot systems and autonomous vehicle fleets.

Keywords

Fault toleranceReinforcement learningReinforcementComputer scienceMulti-agent systemDistributed computingArtificial intelligenceEngineeringStructural engineering

Toward Fault Tolerance in Multi-Agent Reinforcement Learning

Abstract

Keywords

Related papers

Statistical Learning Theory

Artificial intelligence: a modern approach

Applied Nonlinear Control

A new optimizer using particle swarm theory