All docs
Literature

🏗️ Principles of Federated MARL

Literature Review

Federated Multi-Agent Reinforcement Learning (FMARL) [1] adapts the federated learning paradigm for multi-agent reinforcement learning. Therefore, enabling distributed training across agents under privacy and communication constraints.

Each agent trains its model locally on-device and shares only model parameters and never raw data.

  • For aggregation, fostering collaborative learning while preserving data privacy.

The decision-making environment of each agent can be modeled as either:

  • a Markov Decision Process (MDP) or a
  • Decentralised Partially Observable Markov Decision Process (Dec-POMDP).

For the HFMARL architecture proposed in this work, the Dec-POMDP formulation is adopted at the cluster head level.

FMARL Problem Formulation and Reward Structures

For a system with nn agents, the FMARL decision-making environment is the tuple [1] :

Γ=(I,SiiI,AiiI,PiiI,RiiI,γ)(1)\Gamma = (I, {S_i}_{i\in I}, {A_i}_{i\in I}, {P_i}_{i\in I}, {R_i}_{i\in I}, \gamma) \qquad (1)

where

  • II is the set of nn agents,
  • SiS_i is the private state space of agent ii,
  • AiA_i is its action set,
  • rit=ri(sit,ait)r_i^t = r_i(s_i^t, a_i^t) is the reward received at time step tt,
  • PiP_i is the transition probability from state sits_i^t to sit+1s_i^{t+1} after action aa, and
  • γ\gamma is the discount factor weighting long-term rewards.

At each time step tt, agent ii observes its local state sits_i^t, selects action aita_i^t, receives reward ritr_i^t, and transitions to state sit+1s_i^{t+1}. Over TT time steps, a local dataset is formed:

Di=sit,ait,rit,sit+1t=1T(2)D_i = {s_i^{t}, a_i^{t}, r_i^{t}, s_i^{t+1}}_{t=1}^{T} \qquad (2)

Each agent learns a local policy πi\pi_i mapping states to actions, parameterised by θi\theta_i:

πi(aitsit;θi)(3)\pi_i(a_i^{t} \mid s_i^{t}; \theta_i) \qquad (3)

Reward structures in FMARL fall into three categories:

  • local independent objectives,
  • global collaborative objectives, and
  • hybrid objectives.

The hybrid formulation expresses the reward of agent ii as a weighted combination:

J(θi)=λJi(θi)+(1λ)Jglobal(θ1,,θn)(4)J(\theta_i) = \lambda J_i(\theta_i) + (1 - \lambda) J_{\text{global}}(\theta_1, \ldots, \theta_n) \qquad (4)

where

  • Ji(θi)J_i(\theta_i) is the expected cumulative individual reward,
  • Jglobal(θ1,,θn)J_{\text{global}}(\theta_1, \ldots, \theta_n) is the collective reward, and
  • λ[0,1]\lambda \in [0, 1] controls the balance between individual and collective objectives.

The individual expected cumulative reward is:

Ji(θi)=Eπi[t=0γtri(sit,ait)](5)J_i(\theta_i) = \mathbb{E}_{\pi_i} \left[ \sum_{t=0}^{\infty} \gamma^t r_i(s_i^t, a_i^t) \right] \qquad (5)

The collective reward is:

Jglobal(θ1,,θn)=Eπ[t=0γtrglobal(st,at)](6)J_{\text{global}}(\theta_1, \ldots, \theta_n) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_{\text{global}}(s^t, a^t) \right] \qquad (6)

where st=(s1t,,snt)s^t = (s_1^t, \ldots, s_n^t) and at=(a1t,,ant)a^t = (a_1^t, \ldots, a_n^t) represent the joint states and actions. Each agent therefore seeks to optimise:

θi=argmaxθi(λJi(θi)+(1λ)Jglobal(θ1,,θn))(7)\theta_i^* = \arg \max_{\theta_i} \left( \lambda J_i(\theta_i) + (1 - \lambda) J_{\text{global}}(\theta_1, \ldots, \theta_n) \right) \qquad (7)

The choice of λ\lambda determines the learning paradigm.

  • For independent learners such as IPPO (λ=1\lambda = 1), each agent optimises only Ji(θi)J_i(\theta_i), treating other agents as part of the environment dynamics.
  • For centralised training with decentralised execution (CTDE) methods such as MASAC, the hybrid objective is implemented through shared reward signals rit=λri(sit,ait)+(1λ)rglobal(st,at)r_i^t = \lambda r_i(s_i^t, a_i^t) + (1-\lambda) r_{\text{global}}(s^t, a^t), with critics accessing global information during training to enable coordination.

Model Aggregation in FMARL

Each agent adapts its local policy to its environment and transmits its parameters (or, in some cases, gradients) for aggregation. With θik\boldsymbol{\theta}_i^k denoting the policy parameters of agent ii after kk local updates and FaggF_{\text{agg}} the aggregation operator, federated aggregation integrates the local models as:

θaggk+1=Fagg(θiki=1nC)(8)\boldsymbol{\theta}_{\text{agg}}^{k+1} = F_{\text{agg}}\left({\boldsymbol{\theta}_i^k}_{i=1}^n \mid \mathcal{C}\right) \qquad (8)

The standard aggregation algorithm is FedAvg [2] :

θaggk+1=i=1nαiθik(9)\boldsymbol{\theta}_{\text{agg}}^{k+1} = \sum_{i=1}^n \alpha_i \boldsymbol{\theta}_i^k \qquad (9)

with weights proportional to local dataset sizes:

αi=Dij=1nDj(10)\alpha_i = \frac{|D_i|}{\sum_{j=1}^n |D_j|} \qquad (10)

The aggregation is subject to a set of constraints C\mathcal{C}:

C=Ccomm,Cprivacy,Cresource(11)\mathcal{C} = {\mathcal{C}_{\text{comm}}, \mathcal{C}_{\text{privacy}}, \mathcal{C}_{\text{resource}}} \qquad (11)

The communication constraint encompasses communication interval τ\tau, bandwidth usage bb, and synchronisation scheme ϕ\phi:

Ccomm:=(τ,b,ϕ)ττmax,bbmax,ϕΦ(12)\mathcal{C}_{\text{comm}} := {(\tau, b, \phi) \mid \tau \leq \tau_{\max}, b \leq b_{\max}, \phi \in \Phi} \qquad (12)

where Φ\Phi is the set of allowable synchronisation protocols.

The privacy constraint defines bounds on the privacy budget ϵ\epsilon and perturbation mechanism M\mathcal{M}:

Cprivacy:=(ϵ,M)ϵϵmax,MM(13)\mathcal{C}_{\text{privacy}} := {(\epsilon, \mathcal{M}) \mid \epsilon \leq \epsilon_{\max}, \mathcal{M} \in \mathbb{M}} \qquad (13)

The resource constraint encompasses memory mm, energy ee, and processing power cc:

Cresource:=(m,e,c)mmmax,eemax,ccmax(14)\mathcal{C}_{\text{resource}} := {(m, e, c) \mid m \leq m_{\max}, e \leq e_{\max}, c \leq c_{\max}} \qquad (14)

FMARL Architecture Designs

The HFMARL architecture employs both centralised and decentralised aggregation in a hierarchical structure.

Within each cluster, a centralised client-server model is used:

  • the cluster head aggregates updates from follower agents.

Between clusters, a decentralised peer-to-peer model is used:

  • cluster heads exchange parameters directly with neighbouring cluster heads without a central entity.

Horizontal federated learning applies when agents share the same feature space, identical state and action spaces, but operate in distinct environments.

Formally, if GG is the set of all environments with nn environments Eii=1n{E_i}_{i=1}^n for nn agents:

Si=Sj,Ai=Aj,EiEj,i,j1,2,,n,Ei,EjG,ij(15)S_i = S_j, A_i = A_j, E_i \ne E_j, \forall i, j \in {1, 2, \ldots, n}, E_i, E_j \in G, i \ne j \qquad (15)

Since all tasks are similar in nature, transfer learning opportunities arise across clusters. FMARL addresses data privacy concerns by avoiding the direct sharing of raw data and supporting different encryption methods within the system.

In terms of synchronisation, a semi-synchronous approach is adopted.

  • Intra-cluster communication from follower agents to the cluster head operates asynchronously, accommodating heterogeneous agent speeds and intermittent connectivity.
  • Inter-cluster communication between cluster heads operates synchronously when possible, providing coordination advantages for global model updates.
    • This accounts for varying communication conditions across agents while maintaining coherent high-level coordination.

RL Algorithm Selection

For the heterogeneous HFMARL architecture, cluster heads were initially designed to run MASAC and follower agents run IPPO, both actor-critic methods suited to continuous control in complex environments. Actor-critic algorithms provide the exploration-exploitation balance necessary for tasks requiring safe path planning while adapting to dynamic conditions, and support continuous action spaces needed for fine-grained locomotion control.

Privacy Protection — Differential Privacy

Differential privacy (DP) [1] provides a flexible privacy mechanism suitable for large-scale distributed systems.

  • DP works by injecting calibrated noise, typically drawn from the Laplace or Gaussian distribution, into model gradients before they are transmitted for aggregation.
  • This preserves the statistical properties of the data while mitigating the risk of reverse engineering.

The privacy budget ϵ\epsilon controls the trade-off between privacy and model utility:

  • smaller ϵ\epsilon provides stronger privacy guarantees at the cost of noisier gradients, potentially degrading learning performance.

In HFMARL, the privacy budget can be dynamically adjusted based on the stage of learning, sensitivity of the data, or the specific privacy requirements of each aggregation round.

Distributed Optimisation

For the HFMARL framework, parameter sharing in segments is adopted as the communication method, following the approach of [3]. This reduces bandwidth requirements compared to transmitting full model parameters or gradients, which is particularly important in communication-constrained environments.

Convergence Considerations

Convergence in FMARL can be challenging due to non-IID data distributions, delayed synchronisation, noisy gradient updates, and the inherently non-stationary dynamics of multi-agent systems. Xu et al. [4] derive a gradient convergence bound for federated multi-agent reinforcement learning. For a system of nn agents, where each agent ii has local policy parameters θiRd\theta_i \in \mathbb{R}^d, performs TT local update steps between aggregation rounds, and θr\theta^r denotes the aggregated parameters from round rr, the expected gradient norm over RR rounds satisfies (with proper step size η\eta):

1Rr=0R1E[J(θ(r))2]O(1nRT+Tζ2R)(16)\frac{1}{R} \sum_{r=0}^{R-1} \mathbb{E}\left[|\nabla J(\theta^{(r)})|^2\right] \le \mathcal{O}\left(\frac{1}{\sqrt{nRT}} + \frac{T \zeta^2}{R}\right) \qquad (16)

under the following assumptions:

  • the global objective J(θ)J(\theta) is LL-smooth, i.e. J(θ)J(θ)Lθθ|\nabla J(\theta) - \nabla J(\theta')| \le L|\theta - \theta'|,
  • local gradient noise is bounded as E[Ji(θ)gi(θ)2]σ2\mathbb{E}[|\nabla J_i(\theta) - g_i(\theta)|^2] \le \sigma^2, where gi(θ)g_i(\theta) is the stochastic gradient computed by agent ii, and
  • gradient heterogeneity is bounded as Ji(θ)J(θ)ζ|\nabla J_i(\theta) - \nabla J(\theta)| \le \zeta for all agents.

This bound reveals a fundamental trade-off that increasing the number of local updates TT reduces communication frequency but increases the divergence term Tζ2/RT\zeta^2/R, which grows with the heterogeneity of local environments.

  • Balancing communication efficiency against model divergence, particularly in systems with limited bandwidth and latency sensitivity, motivates the adaptive federated aggregation strategies adopted in the HFMARL framework.

References

[1] Y. Jing, B. Guo, N. Li, R. Xu, and Z. Yu, "Federated multi-agent reinforcement learning: A comprehensive survey of methods, applications and challenges," Expert Systems with Applications, 2025. doi:10.1016/j.eswa.2025.128729.

[2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 1273–1282.

[3] X. Yu, R. Li, C. Liang, and Z. Zhao, "Communication-efficient soft actor-critic policy collaboration via regulated segment mixture," arXiv:2312.10123, 2024.

[4] X. Xu, R. Li, Z. Zhao, and H. Zhang, "The gradient convergence bound of federated multi-agent reinforcement learning with efficient communication," IEEE Trans. Wireless Commun., vol. 23, no. 1, pp. 507–528, Jan. 2024.