🏗️ Principles of Federated MARL
Literature Review
Federated Multi-Agent Reinforcement Learning (FMARL) [1] adapts the federated learning paradigm for multi-agent reinforcement learning. Therefore, enabling distributed training across agents under privacy and communication constraints.
Each agent trains its model locally on-device and shares only model parameters and never raw data.
- For aggregation, fostering collaborative learning while preserving data privacy.
The decision-making environment of each agent can be modeled as either:
- a Markov Decision Process (MDP) or a
- Decentralised Partially Observable Markov Decision Process (Dec-POMDP).
For the HFMARL architecture proposed in this work, the Dec-POMDP formulation is adopted at the cluster head level.
FMARL Problem Formulation and Reward Structures
For a system with agents, the FMARL decision-making environment is the tuple [1] :
where
- is the set of agents,
- is the private state space of agent ,
- is its action set,
- is the reward received at time step ,
- is the transition probability from state to after action , and
- is the discount factor weighting long-term rewards.
At each time step , agent observes its local state , selects action , receives reward , and transitions to state . Over time steps, a local dataset is formed:
Each agent learns a local policy mapping states to actions, parameterised by :
Reward structures in FMARL fall into three categories:
- local independent objectives,
- global collaborative objectives, and
- hybrid objectives.
The hybrid formulation expresses the reward of agent as a weighted combination:
where
- is the expected cumulative individual reward,
- is the collective reward, and
- controls the balance between individual and collective objectives.
The individual expected cumulative reward is:
The collective reward is:
where and represent the joint states and actions. Each agent therefore seeks to optimise:
The choice of determines the learning paradigm.
- For independent learners such as IPPO (), each agent optimises only , treating other agents as part of the environment dynamics.
- For centralised training with decentralised execution (CTDE) methods such as MASAC, the hybrid objective is implemented through shared reward signals , with critics accessing global information during training to enable coordination.
Model Aggregation in FMARL
Each agent adapts its local policy to its environment and transmits its parameters (or, in some cases, gradients) for aggregation. With denoting the policy parameters of agent after local updates and the aggregation operator, federated aggregation integrates the local models as:
The standard aggregation algorithm is FedAvg [2] :
with weights proportional to local dataset sizes:
The aggregation is subject to a set of constraints :
The communication constraint encompasses communication interval , bandwidth usage , and synchronisation scheme :
where is the set of allowable synchronisation protocols.
The privacy constraint defines bounds on the privacy budget and perturbation mechanism :
The resource constraint encompasses memory , energy , and processing power :
FMARL Architecture Designs
The HFMARL architecture employs both centralised and decentralised aggregation in a hierarchical structure.
Within each cluster, a centralised client-server model is used:
- the cluster head aggregates updates from follower agents.
Between clusters, a decentralised peer-to-peer model is used:
- cluster heads exchange parameters directly with neighbouring cluster heads without a central entity.
Horizontal federated learning applies when agents share the same feature space, identical state and action spaces, but operate in distinct environments.
Formally, if is the set of all environments with environments for agents:
Since all tasks are similar in nature, transfer learning opportunities arise across clusters. FMARL addresses data privacy concerns by avoiding the direct sharing of raw data and supporting different encryption methods within the system.
In terms of synchronisation, a semi-synchronous approach is adopted.
- Intra-cluster communication from follower agents to the cluster head operates asynchronously, accommodating heterogeneous agent speeds and intermittent connectivity.
- Inter-cluster communication between cluster heads operates synchronously when possible, providing coordination advantages for global model updates.
- This accounts for varying communication conditions across agents while maintaining coherent high-level coordination.
RL Algorithm Selection
For the heterogeneous HFMARL architecture, cluster heads were initially designed to run MASAC and follower agents run IPPO, both actor-critic methods suited to continuous control in complex environments. Actor-critic algorithms provide the exploration-exploitation balance necessary for tasks requiring safe path planning while adapting to dynamic conditions, and support continuous action spaces needed for fine-grained locomotion control.
Privacy Protection — Differential Privacy
Differential privacy (DP) [1] provides a flexible privacy mechanism suitable for large-scale distributed systems.
- DP works by injecting calibrated noise, typically drawn from the Laplace or Gaussian distribution, into model gradients before they are transmitted for aggregation.
- This preserves the statistical properties of the data while mitigating the risk of reverse engineering.
The privacy budget controls the trade-off between privacy and model utility:
- smaller provides stronger privacy guarantees at the cost of noisier gradients, potentially degrading learning performance.
In HFMARL, the privacy budget can be dynamically adjusted based on the stage of learning, sensitivity of the data, or the specific privacy requirements of each aggregation round.
Distributed Optimisation
For the HFMARL framework, parameter sharing in segments is adopted as the communication method, following the approach of [3]. This reduces bandwidth requirements compared to transmitting full model parameters or gradients, which is particularly important in communication-constrained environments.
Convergence Considerations
Convergence in FMARL can be challenging due to non-IID data distributions, delayed synchronisation, noisy gradient updates, and the inherently non-stationary dynamics of multi-agent systems. Xu et al. [4] derive a gradient convergence bound for federated multi-agent reinforcement learning. For a system of agents, where each agent has local policy parameters , performs local update steps between aggregation rounds, and denotes the aggregated parameters from round , the expected gradient norm over rounds satisfies (with proper step size ):
under the following assumptions:
- the global objective is -smooth, i.e. ,
- local gradient noise is bounded as , where is the stochastic gradient computed by agent , and
- gradient heterogeneity is bounded as for all agents.
This bound reveals a fundamental trade-off that increasing the number of local updates reduces communication frequency but increases the divergence term , which grows with the heterogeneity of local environments.
- Balancing communication efficiency against model divergence, particularly in systems with limited bandwidth and latency sensitivity, motivates the adaptive federated aggregation strategies adopted in the HFMARL framework.
References
[1] Y. Jing, B. Guo, N. Li, R. Xu, and Z. Yu, "Federated multi-agent reinforcement learning: A comprehensive survey of methods, applications and challenges," Expert Systems with Applications, 2025. doi:10.1016/j.eswa.2025.128729.
[2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 1273–1282.
[3] X. Yu, R. Li, C. Liang, and Z. Zhao, "Communication-efficient soft actor-critic policy collaboration via regulated segment mixture," arXiv:2312.10123, 2024.
[4] X. Xu, R. Li, Z. Zhao, and H. Zhang, "The gradient convergence bound of federated multi-agent reinforcement learning with efficient communication," IEEE Trans. Wireless Commun., vol. 23, no. 1, pp. 507–528, Jan. 2024.