All docs
Literature

🪛 RSM-MASAC — Adaptive Decentralised Federated MARL

Literature Review

Yu et al. [1] proposed RSM-MASAC, a decentralised federated MARL framework that extends SAC to multi-agent settings with peer-to-peer communication. The framework was validated in mixed-autonomy traffic control scenarios, where it approached the converged performance of centralised FMARL while eliminating the single point of failure inherent to centralised architectures.

In RSM-MASAC, NN agents each run a local SAC instance. Agents train independently and periodically exchange only their policy parameters θ\theta with neighbours. A communication round is initiated every UU policy updates, at which point agent ii receives policy parameters from its neighbour set Ωi\Omega_i and aggregates them using an adaptive function f()f(\cdot). The communication phase is formulated as a constrained optimisation that minimises transmission cost subject to a cumulative reward threshold, with agents mixing their local parameters with an aggregated referential policy via a regulated mixing metric ζ\zeta (see [1] for the full formulation).

The central insight of [1] is that naive parameter averaging, the standard approach in decentralised federated learning:

θnew(i)=1Ωi+1(θ(i)+jΩiθ(j))(1)\theta_{\text{new}}^{(i)} = \frac{1}{|\Omega_i| + 1} \left( \theta^{(i)} + \sum_{j \in \Omega_i} \theta^{(j)} \right) \qquad (1)

provides no guarantee of policy improvement, since not all neighbours necessarily have superior policies. RSM-MASAC replaces this with an adaptive mixing approach. The mixed policy distribution is defined as:

πmix(as)=(1β)π(as)+βπ~(as)(2)\pi_{\text{mix}}(a \mid s) = (1 - \beta)\pi(a \mid s) + \beta \tilde{\pi}(a \mid s) \qquad (2)

where agent ii retains (1β)(1-\beta) weight on its own policy π\pi and borrows β\beta from a referential policy π~\tilde{\pi} constructed from neighbours' parameters.

To ensure that this mixing actually improves performance, [1] establishes two theoretical results under the MERL framework. Theorem 1 provides a lower bound on the performance gain from mixing, showing that improvement is guaranteed when the referential policy has positive advantage under MERL, after accounting for a distribution shift penalty quadratic in β\beta and an entropy bonus from the Jensen-Shannon divergence between policies. The key condition is that the MERL advantage of the referential policy must be positive:

Aπ+(π~):=Esdπ,aπ~[Aπ(s,a)+αH(π~(s))]>0(3)A_{\pi}^{+}(\tilde{\pi}) := \mathbb{E}_{s \sim d^{\pi}, a \sim \tilde{\pi}}[A^{\pi}(s,a) + \alpha H(\tilde{\pi}(\cdot \mid s))] > 0 \qquad (3)

This means the referential policy must either select better actions, maintain higher entropy, i.e. being more exploratory, or both.

Theorem 2 converts this result to parameter space: given positive MERL advantage, an agent can guarantee improvement by updating to mixed parameters θmix\theta_{\text{mix}} provided the mixing metric satisfies 0<ζ<[2Aπ+(π~)/C(θ~θ)F(θ)(θ~θ)]1/20 < \zeta < [2A_{\pi}^{+}(\tilde{\pi}) / C(\tilde{\theta}-\theta)^{\top}F(\theta)(\tilde{\theta}-\theta)]^{1/2}, where F(θ)F(\theta) is the Fisher Information Matrix:

F(θ)=Esdπ,aπθ[logπθ(as)θ(logπθ(as)θ)](4)F(\theta) = \mathbb{E}_{s \sim d^{\pi}, a \sim \pi_{\theta}} \left[ \frac{\partial \log \pi_{\theta}(a|s)}{\partial \theta} \left( \frac{\partial \log \pi_{\theta}(a|s)}{\partial \theta} \right)^{\top} \right] \qquad (4)

The Fisher Information Matrix measures the local curvature of the policy space, i.e. how sensitive the policy distribution is to parameter changes, and serves to convert the distribution-space improvement bound into a parameter-space constraint on ζ\zeta.

  • A further practical contribution of [1] is that agents transmit parameter segments rather than full models, reducing communication overhead. The process repeats over multiple replicas, each time requesting different segments from different neighbours to ensure diversity in the aggregated referential policy.

However, RSM-MASAC assumes synchronous communication rounds: all agents pause local training simultaneously for the mixing phase. This assumption limits applicability to systems with reliable, low-latency communication.

References

[1] X. Yu, R. Li, C. Liang, and Z. Zhao, "Communication-efficient soft actor-critic policy collaboration via regulated segment mixture," arXiv:2312.10123, 2024.