What is a physics-informed neural network (PINN)?

A physics-informed neural network is a neural network whose loss function includes the residual of governing differential equations (ODEs or PDEs), so that the model learns from both sensor data and the known physics of the system. PINNs generalize from sparse data where purely data-driven models memorize.

What problems does Astraea Intelligence solve?

Astraea closes the sim-to-real gap in robotics. Simulators use approximated contact, linearized friction, and simplified dynamics, so policies trained in simulation fail on real hardware. Astraea builds physics-informed dynamics models from your governing equations and sparse sensor data, ready to drop into MPC, RL, or state-estimation pipelines.

Is Astraea open source?

Yes. The Astraea Core PINN engine and equation library are open source under the Apache 2.0 license. The Astraea Pro tier — automated architecture search, managed GPU training, and production deployment integrations — is commercial.

Which equations does Astraea support?

Astraea ships validated templates for Fossen 6-DOF underwater dynamics, Cosserat rods, Navier–Stokes, advection–diffusion, heat equation, linear elasticity, and reaction–diffusion. It also supports arbitrary custom ODE/PDE systems.

Which deployment targets are supported?

Astraea exports to CasADi for MPC, Drake, ROS 2, ONNX, and TorchScript. Trained surrogates run at 100+ Hz on GPU.

Literature

🏗️ Principles of Federated MARL

Literature Review

Federated Multi-Agent Reinforcement Learning (FMARL) [1] adapts the federated learning paradigm for multi-agent reinforcement learning. Therefore, enabling distributed training across agents under privacy and communication constraints.

Each agent trains its model locally on-device and shares only model parameters and never raw data.

For aggregation, fostering collaborative learning while preserving data privacy.

The decision-making environment of each agent can be modeled as either:

a Markov Decision Process (MDP) or a
Decentralised Partially Observable Markov Decision Process (Dec-POMDP).

For the HFMARL architecture proposed in this work, the Dec-POMDP formulation is adopted at the cluster head level.

FMARL Problem Formulation and Reward Structures

For a system with $n$ agents, the FMARL decision-making environment is the tuple [1] :

\Gamma = (I, {S_i}_{i\in I}, {A_i}_{i\in I}, {P_i}_{i\in I}, {R_i}_{i\in I}, \gamma) \qquad (1)

where

$I$ is the set of $n$ agents,
$S_i$ is the private state space of agent $i$ ,
$A_i$ is its action set,
$r_i^t = r_i(s_i^t, a_i^t)$ is the reward received at time step $t$ ,
$P_i$ is the transition probability from state $s_i^t$ to $s_i^{t+1}$ after action $a$ , and
$\gamma$ is the discount factor weighting long-term rewards.

At each time step $t$ , agent $i$ observes its local state $s_i^t$ , selects action $a_i^t$ , receives reward $r_i^t$ , and transitions to state $s_i^{t+1}$ . Over $T$ time steps, a local dataset is formed:

D_i = {s_i^{t}, a_i^{t}, r_i^{t}, s_i^{t+1}}_{t=1}^{T} \qquad (2)

Each agent learns a local policy $\pi_i$ mapping states to actions, parameterised by $\theta_i$ :

\pi_i(a_i^{t} \mid s_i^{t}; \theta_i) \qquad (3)

Reward structures in FMARL fall into three categories:

local independent objectives,
global collaborative objectives, and
hybrid objectives.

The hybrid formulation expresses the reward of agent $i$ as a weighted combination:

J(\theta_i) = \lambda J_i(\theta_i) + (1 - \lambda) J_{\text{global}}(\theta_1, \ldots, \theta_n) \qquad (4)

where

$J_i(\theta_i)$ is the expected cumulative individual reward,
$J_{\text{global}}(\theta_1, \ldots, \theta_n)$ is the collective reward, and
$\lambda \in [0, 1]$ controls the balance between individual and collective objectives.

The individual expected cumulative reward is:

J_i(\theta_i) = \mathbb{E}_{\pi_i} \left[ \sum_{t=0}^{\infty} \gamma^t r_i(s_i^t, a_i^t) \right] \qquad (5)

The collective reward is:

J_{\text{global}}(\theta_1, \ldots, \theta_n) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{\infty} \gamma^t r_{\text{global}}(s^t, a^t) \right] \qquad (6)

where $s^t = (s_1^t, \ldots, s_n^t)$ and $a^t = (a_1^t, \ldots, a_n^t)$ represent the joint states and actions. Each agent therefore seeks to optimise:

\theta_i^* = \arg \max_{\theta_i} \left( \lambda J_i(\theta_i) + (1 - \lambda) J_{\text{global}}(\theta_1, \ldots, \theta_n) \right) \qquad (7)

The choice of $\lambda$ determines the learning paradigm.

For independent learners such as IPPO ( $\lambda = 1$ ), each agent optimises only $J_i(\theta_i)$ , treating other agents as part of the environment dynamics.
For centralised training with decentralised execution (CTDE) methods such as MASAC, the hybrid objective is implemented through shared reward signals $r_i^t = \lambda r_i(s_i^t, a_i^t) + (1-\lambda) r_{\text{global}}(s^t, a^t)$ , with critics accessing global information during training to enable coordination.

Model Aggregation in FMARL

Each agent adapts its local policy to its environment and transmits its parameters (or, in some cases, gradients) for aggregation. With $\boldsymbol{\theta}_i^k$ denoting the policy parameters of agent $i$ after $k$ local updates and $F_{\text{agg}}$ the aggregation operator, federated aggregation integrates the local models as:

\boldsymbol{\theta}_{\text{agg}}^{k+1} = F_{\text{agg}}\left({\boldsymbol{\theta}_i^k}_{i=1}^n \mid \mathcal{C}\right) \qquad (8)

The standard aggregation algorithm is FedAvg [2] :

\boldsymbol{\theta}_{\text{agg}}^{k+1} = \sum_{i=1}^n \alpha_i \boldsymbol{\theta}_i^k \qquad (9)

with weights proportional to local dataset sizes:

\alpha_i = \frac{|D_i|}{\sum_{j=1}^n |D_j|} \qquad (10)

The aggregation is subject to a set of constraints $\mathcal{C}$ :

\mathcal{C} = {\mathcal{C}_{\text{comm}}, \mathcal{C}_{\text{privacy}}, \mathcal{C}_{\text{resource}}} \qquad (11)

The communication constraint encompasses communication interval $\tau$ , bandwidth usage $b$ , and synchronisation scheme $\phi$ :

\mathcal{C}_{\text{comm}} := {(\tau, b, \phi) \mid \tau \leq \tau_{\max}, b \leq b_{\max}, \phi \in \Phi} \qquad (12)

where $\Phi$ is the set of allowable synchronisation protocols.

The privacy constraint defines bounds on the privacy budget $\epsilon$ and perturbation mechanism $\mathcal{M}$ :

\mathcal{C}_{\text{privacy}} := {(\epsilon, \mathcal{M}) \mid \epsilon \leq \epsilon_{\max}, \mathcal{M} \in \mathbb{M}} \qquad (13)

The resource constraint encompasses memory $m$ , energy $e$ , and processing power $c$ :

\mathcal{C}_{\text{resource}} := {(m, e, c) \mid m \leq m_{\max}, e \leq e_{\max}, c \leq c_{\max}} \qquad (14)

FMARL Architecture Designs

The HFMARL architecture employs both centralised and decentralised aggregation in a hierarchical structure.

Within each cluster, a centralised client-server model is used:

the cluster head aggregates updates from follower agents.

Between clusters, a decentralised peer-to-peer model is used:

cluster heads exchange parameters directly with neighbouring cluster heads without a central entity.

Horizontal federated learning applies when agents share the same feature space, identical state and action spaces, but operate in distinct environments.

Formally, if $G$ is the set of all environments with $n$ environments ${E_i}_{i=1}^n$ for $n$ agents:

S_i = S_j, A_i = A_j, E_i \ne E_j, \forall i, j \in {1, 2, \ldots, n}, E_i, E_j \in G, i \ne j \qquad (15)

Since all tasks are similar in nature, transfer learning opportunities arise across clusters. FMARL addresses data privacy concerns by avoiding the direct sharing of raw data and supporting different encryption methods within the system.

In terms of synchronisation, a semi-synchronous approach is adopted.

Intra-cluster communication from follower agents to the cluster head operates asynchronously, accommodating heterogeneous agent speeds and intermittent connectivity.
Inter-cluster communication between cluster heads operates synchronously when possible, providing coordination advantages for global model updates.
- This accounts for varying communication conditions across agents while maintaining coherent high-level coordination.

RL Algorithm Selection

For the heterogeneous HFMARL architecture, cluster heads were initially designed to run MASAC and follower agents run IPPO, both actor-critic methods suited to continuous control in complex environments. Actor-critic algorithms provide the exploration-exploitation balance necessary for tasks requiring safe path planning while adapting to dynamic conditions, and support continuous action spaces needed for fine-grained locomotion control.

Privacy Protection — Differential Privacy

Differential privacy (DP) [1] provides a flexible privacy mechanism suitable for large-scale distributed systems.

DP works by injecting calibrated noise, typically drawn from the Laplace or Gaussian distribution, into model gradients before they are transmitted for aggregation.
This preserves the statistical properties of the data while mitigating the risk of reverse engineering.

The privacy budget $\epsilon$ controls the trade-off between privacy and model utility:

smaller $\epsilon$ provides stronger privacy guarantees at the cost of noisier gradients, potentially degrading learning performance.

In HFMARL, the privacy budget can be dynamically adjusted based on the stage of learning, sensitivity of the data, or the specific privacy requirements of each aggregation round.

Distributed Optimisation

For the HFMARL framework, parameter sharing in segments is adopted as the communication method, following the approach of [3]. This reduces bandwidth requirements compared to transmitting full model parameters or gradients, which is particularly important in communication-constrained environments.

Convergence Considerations

Convergence in FMARL can be challenging due to non-IID data distributions, delayed synchronisation, noisy gradient updates, and the inherently non-stationary dynamics of multi-agent systems. Xu et al. [4] derive a gradient convergence bound for federated multi-agent reinforcement learning. For a system of $n$ agents, where each agent $i$ has local policy parameters $\theta_i \in \mathbb{R}^d$ , performs $T$ local update steps between aggregation rounds, and $\theta^r$ denotes the aggregated parameters from round $r$ , the expected gradient norm over $R$ rounds satisfies (with proper step size $\eta$ ):

\frac{1}{R} \sum_{r=0}^{R-1} \mathbb{E}\left[|\nabla J(\theta^{(r)})|^2\right] \le \mathcal{O}\left(\frac{1}{\sqrt{nRT}} + \frac{T \zeta^2}{R}\right) \qquad (16)

under the following assumptions:

the global objective $J(\theta)$ is $L$ -smooth, i.e. $|\nabla J(\theta) - \nabla J(\theta')| \le L|\theta - \theta'|$ ,
local gradient noise is bounded as $\mathbb{E}[|\nabla J_i(\theta) - g_i(\theta)|^2] \le \sigma^2$ , where $g_i(\theta)$ is the stochastic gradient computed by agent $i$ , and
gradient heterogeneity is bounded as $|\nabla J_i(\theta) - \nabla J(\theta)| \le \zeta$ for all agents.

This bound reveals a fundamental trade-off that increasing the number of local updates $T$ reduces communication frequency but increases the divergence term $T\zeta^2/R$ , which grows with the heterogeneity of local environments.

Balancing communication efficiency against model divergence, particularly in systems with limited bandwidth and latency sensitivity, motivates the adaptive federated aggregation strategies adopted in the HFMARL framework.

References

[1] Y. Jing, B. Guo, N. Li, R. Xu, and Z. Yu, "Federated multi-agent reinforcement learning: A comprehensive survey of methods, applications and challenges," Expert Systems with Applications, 2025. doi:10.1016/j.eswa.2025.128729.

[2] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, "Communication-efficient learning of deep networks from decentralized data," in Proc. Int. Conf. Artif. Intell. Statist. (AISTATS), 2017, pp. 1273–1282.

[3] X. Yu, R. Li, C. Liang, and Z. Zhao, "Communication-efficient soft actor-critic policy collaboration via regulated segment mixture," arXiv:2312.10123, 2024.

[4] X. Xu, R. Li, Z. Zhao, and H. Zhang, "The gradient convergence bound of federated multi-agent reinforcement learning with efficient communication," IEEE Trans. Wireless Commun., vol. 23, no. 1, pp. 507–528, Jan. 2024.

← All docs