Buffer#
Single-Agent Buffer#
- class safepo.common.buffer.VectorizedOnPolicyBuffer(obs_space, act_space, size: int, gamma: float = 0.99, lam: float = 0.95, lam_c: float = 0.95, standardized_adv_r: bool = True, standardized_adv_c: bool = True, device: device = 'cpu', num_envs: int = 1)#
Bases:
object
A buffer for storing vectorized on-policy data for reinforcement learning.
- Parameters:
obs_space (gymnasium.Space) – The observation space.
act_space (gymnasium.Space) – The action space.
size (int) – The maximum size of the buffer.
gamma (float, optional) – The discount factor for rewards. Defaults to 0.99.
lam (float, optional) – The lambda parameter for GAE computation. Defaults to 0.95.
lam_c (float, optional) – The lambda parameter for cost GAE computation. Defaults to 0.95.
standardized_adv_r (bool, optional) – Whether to standardize advantage rewards. Defaults to True.
standardized_adv_c (bool, optional) – Whether to standardize advantage costs. Defaults to True.
device (torch.device, optional) – The device to store tensors on. Defaults to “cpu”.
num_envs (int, optional) – The number of parallel environments. Defaults to 1.
- finish_path(last_value_r: torch.Tensor | None = None, last_value_c: torch.Tensor | None = None, idx: int = 0) None #
Finalize the trajectory path and compute advantages and value targets.
- Parameters:
last_value_r (torch.Tensor, optional) – The last value estimate for rewards. Defaults to None.
last_value_c (torch.Tensor, optional) – The last value estimate for costs. Defaults to None.
idx (int, optional) – Index of the environment. Defaults to 0.
- get() dict[str, torch.Tensor] #
Retrieve collected data from the buffer.
- Returns:
dict[str, torch.Tensor] – A dictionary containing collected data tensors.
- store(**data: Tensor) None #
Store vectorized data into the buffer.
- Parameters:
**data – Keyword arguments specifying data tensors to be stored.
Multi Agent Buffer#
- class safepo.common.buffer.SeparatedReplayBuffer(config, obs_space, share_obs_space, act_space)#
Bases:
object
Buffer for storing and managing data collected during training.
- Parameters:
config (dict) – Configuration parameters for the replay buffer.
obs_space – Observation space of the environment.
share_obs_space – Shared observation space of the environment (if applicable).
act_space – Action space of the environment.
- after_update()#
- chooseafter_update()#
- compute_cost_returns(next_cost, value_normalizer=None)#
- compute_returns(next_value, value_normalizer=None)#
Computes the discounted cumulative returns for each time step.
- Parameters:
next_value – Estimated value of the next time step.
value_normalizer – Normalizer for value predictions (optional).
Note
This method calculates the discounted cumulative returns (GAE or regular) for each time step, taking into account various buffer settings and optional value normalization.
- Returns:
None
- feed_forward_generator(advantages, num_mini_batch=None, mini_batch_size=None, cost_adv=None)#
- insert(share_obs, obs, rnn_states, rnn_states_critic, actions, action_log_probs, value_preds, rewards, masks, bad_masks=None, active_masks=None, available_actions=None, costs=None, cost_preds=None, rnn_states_cost=None, done_episodes_costs_aver=None, aver_episode_costs=0)#
Inserts data from a single time step into the replay buffer.
- Parameters:
share_obs – Shared observations for the time step.
obs – Observations for the time step.
rnn_states – RNN states for the main network.
rnn_states_critic – RNN states for the critic network.
actions – Actions taken at the time step.
action_log_probs – Log probabilities of the actions.
value_preds – Value predictions at the time step.
rewards – Rewards received at the time step.
masks – Masks indicating whether the episode is done.
bad_masks – Masks indicating bad episodes (optional).
active_masks – Masks indicating active episodes (optional).
available_actions – Available actions for discrete action spaces (optional).
costs – Costs associated with the time step (optional).
cost_preds – Cost predictions at the time step (optional).
rnn_states_cost – RNN states for cost prediction (optional).
done_episodes_costs_aver – Average costs of done episodes (optional).
aver_episode_costs – Average episode costs (optional).
Note
This method inserts data for a single time step into the replay buffer and updates the internal step counter.
- return_aver_insert(aver_episode_costs)#
- update_factor(factor)#