Buffer#

Single-Agent Buffer#

class safepo.common.buffer.VectorizedOnPolicyBuffer(obs_space, act_space, size: int, gamma: float = 0.99, lam: float = 0.95, lam_c: float = 0.95, standardized_adv_r: bool = True, standardized_adv_c: bool = True, device: device = 'cpu', num_envs: int = 1)#

Bases: object

A buffer for storing vectorized on-policy data for reinforcement learning.

Parameters:
  • obs_space (gymnasium.Space) – The observation space.

  • act_space (gymnasium.Space) – The action space.

  • size (int) – The maximum size of the buffer.

  • gamma (float, optional) – The discount factor for rewards. Defaults to 0.99.

  • lam (float, optional) – The lambda parameter for GAE computation. Defaults to 0.95.

  • lam_c (float, optional) – The lambda parameter for cost GAE computation. Defaults to 0.95.

  • standardized_adv_r (bool, optional) – Whether to standardize advantage rewards. Defaults to True.

  • standardized_adv_c (bool, optional) – Whether to standardize advantage costs. Defaults to True.

  • device (torch.device, optional) – The device to store tensors on. Defaults to “cpu”.

  • num_envs (int, optional) – The number of parallel environments. Defaults to 1.

finish_path(last_value_r: torch.Tensor | None = None, last_value_c: torch.Tensor | None = None, idx: int = 0) None#

Finalize the trajectory path and compute advantages and value targets.

Parameters:
  • last_value_r (torch.Tensor, optional) – The last value estimate for rewards. Defaults to None.

  • last_value_c (torch.Tensor, optional) – The last value estimate for costs. Defaults to None.

  • idx (int, optional) – Index of the environment. Defaults to 0.

get() dict[str, torch.Tensor]#

Retrieve collected data from the buffer.

Returns:

dict[str, torch.Tensor] – A dictionary containing collected data tensors.

store(**data: Tensor) None#

Store vectorized data into the buffer.

Parameters:

**data – Keyword arguments specifying data tensors to be stored.

Multi Agent Buffer#

class safepo.common.buffer.SeparatedReplayBuffer(config, obs_space, share_obs_space, act_space)#

Bases: object

Buffer for storing and managing data collected during training.

Parameters:
  • config (dict) – Configuration parameters for the replay buffer.

  • obs_space – Observation space of the environment.

  • share_obs_space – Shared observation space of the environment (if applicable).

  • act_space – Action space of the environment.

after_update()#
chooseafter_update()#
compute_cost_returns(next_cost, value_normalizer=None)#
compute_returns(next_value, value_normalizer=None)#

Computes the discounted cumulative returns for each time step.

Parameters:
  • next_value – Estimated value of the next time step.

  • value_normalizer – Normalizer for value predictions (optional).

Note

This method calculates the discounted cumulative returns (GAE or regular) for each time step, taking into account various buffer settings and optional value normalization.

Returns:

None

feed_forward_generator(advantages, num_mini_batch=None, mini_batch_size=None, cost_adv=None)#
insert(share_obs, obs, rnn_states, rnn_states_critic, actions, action_log_probs, value_preds, rewards, masks, bad_masks=None, active_masks=None, available_actions=None, costs=None, cost_preds=None, rnn_states_cost=None, done_episodes_costs_aver=None, aver_episode_costs=0)#

Inserts data from a single time step into the replay buffer.

Parameters:
  • share_obs – Shared observations for the time step.

  • obs – Observations for the time step.

  • rnn_states – RNN states for the main network.

  • rnn_states_critic – RNN states for the critic network.

  • actions – Actions taken at the time step.

  • action_log_probs – Log probabilities of the actions.

  • value_preds – Value predictions at the time step.

  • rewards – Rewards received at the time step.

  • masks – Masks indicating whether the episode is done.

  • bad_masks – Masks indicating bad episodes (optional).

  • active_masks – Masks indicating active episodes (optional).

  • available_actions – Available actions for discrete action spaces (optional).

  • costs – Costs associated with the time step (optional).

  • cost_preds – Cost predictions at the time step (optional).

  • rnn_states_cost – RNN states for cost prediction (optional).

  • done_episodes_costs_aver – Average costs of done episodes (optional).

  • aver_episode_costs – Average episode costs (optional).

Note

This method inserts data for a single time step into the replay buffer and updates the internal step counter.

return_aver_insert(aver_episode_costs)#
update_factor(factor)#