🪨 Asynchronous Federated Policy Gradient
Literature Review
Lan et al. [1] proposed AFedPG, the first asynchronous federated reinforcement learning framework with policy gradient updates and provable convergence guarantees.
- The framework addresses a fundamental limitation of synchronous federated RL:
- in heterogeneous multi-agent systems, synchronous aggregation forces the server to wait for the slowest agent at every global step, making the time complexity dependent on the worst-case agent .
- AFedPG eliminates this bottleneck by allowing the server to update the global policy as soon as any agent completes its local computation, while other agents continue sampling and computing in parallel.
- The key technical challenge this introduces is that agents compute gradients based on stale policies:
- the global model has been updated times since agent last received it,
- and in RL, unlike supervised FL, stale policies mean agents are collecting trajectories from a different distribution than the current global policy.
To handle this staleness, Lan et al. design a delay-adaptive lookahead technique specific to federated RL.
- At the -th global iteration with delay , instead of using the received model parameters directly, the agent computes a lookahead estimate , where is a delay-dependent learning rate.
- The agent then collects trajectories according to the policy and computes the updating direction.
- This mechanism is designed to cancel out second-order Hessian correction terms that arise uniquely in the RL setting due to policy-dependent data collection.
- On the server side, updates are normalised as to ensure controllable step sizes regardless of delay magnitude.
The convergence analysis establishes two results.
- For global convergence, AFedPG achieves sample complexity at each agent, matching the single-agent rate of with a linear speedup in the number of agents .
- The time complexity improves from synchronous FedPG's to
- ... the harmonic mean of agent computation times, which is always less than or equal to and significantly smaller when .
- The average delay is shown to be naturally bounded by the average concurrency without requiring any assumption on the maximum delay.
- The time complexity improves from synchronous FedPG's to
However, AFedPG assumes a centralised server, homogeneous agents running identical policy gradient algorithms, and identical environments across agents. Extending the delay-adaptive convergence analysis to decentralised aggregation with heterogeneous RL algorithms (e.g., SAC and PPO within the same federation), non-IID data across clusters, and off-policy learning remains an open problem .
References
[1] G. Lan, D.-J. Han, A. Hashemi, V. Aggarwal, and C. G. Brinton, "Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis," in Proc. Int. Conf. Learn. Representations (ICLR), 2025, arXiv:2404.08003.