First Order Projection Methods#

Experiment Results#

Implementation Details#

Note

All experiments are ran under total 1e7 steps, while in the Doggo agent, 1e8 steps are used. This setting is the same as Safety-Gym

Environment Wrapper#

In the course of our experimental investigations, we have discerned certain hyperparameters that wield a discernible influence upon the algorithm’s performance:

The parameter denoted as

  • obs_normalize, which pertains to the normalization of observations.

  • reward_normalize, governing the normalization of rewards.

  • cost_normalize, governing the normalization of costs.

Throughout the experimental trials, a consistent pattern emerged, wherein the setting obs_normalize=True consistently yielded superior results.

Note

Significantly, the outcome is not uniformly corroborated when it comes to the reward_normalize parameter. Its affirmative setting reward_normalize=True does not invariably outperform the negative counterpart reward_normalize=False, a trend particularly pronounced in the SafetyHopperVelocity-v1 and SafetyWalker2dVelocity-v1 environments.

Therefore, We make the environment wrapper to control the normalization of observations, rewards and costs:

env = safety_gymnasium.make(env_id)
env.reset(seed=seed)
obs_space = env.observation_space
act_space = env.action_space
env = SafeAutoResetWrapper(env)
env = SafeRescaleAction(env, -1.0, 1.0)
env = SafeNormalizeObservation(env)
env = SafeUnsqueeze(env)

 return env, obs_space, act_space

Lagrangian Multiplier#

Lagrangian-based algorithms use Lagrangian Multiplier to control the safety constraint. The Lagrangian Multiplier is an Integrated part of SafePO.

Some key points:

  • The implementation of Lagrangian Multiplier is based on Adam optimizer for a smooth update.

  • The Lagrangian Multiplier is updated every epoch based on the total cost violation of current episodes.

Key implementation:

from safepo.common.lagrange import Lagrange

# setup lagrangian multiplier
COST_LIMIT = 25.0
LAGRANGIAN_MULTIPLIER_INIT = 0.001
LAGRANGIAN_MULTIPLIER_LR = 0.035
lagrange = Lagrange(
    cost_limit=COST_LIMIT,
    lagrangian_multiplier_init=LAGRANGIAN_MULTIPLIER_INIT,
    lagrangian_multiplier_lr=LAGRANGIAN_MULTIPLIER_LR,
)

# update lagrangian multiplier
# suppose ep_cost is 50.0
ep_cost = 50.0
lagrange.update_lagrange_multiplier(ep_cost)

# use lagrangian multiplier to control the advanatge
advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)

Please refer to Lagrangian Multiplier for more details.

Projection Implementation#

The key idea of CUP and FOCOPS is projecting the policy back to the safe set. A more detailed theoretical analysis can be found in here.

We provide how SafePO implements the two stage projection:

CUP first make a PPO update to improve the policy reward. Then it projects the policy back to the safe set. We will focus on the projection part.

  • Get the cost advantage from buffer and prepare training data.

advantage = data["adv_c"]
dataloader = DataLoader(
      dataset=TensorDataset(
         data["obs"], data["act"], data["log_prob"], advantage, old_mean, old_std
      ),
      batch_size=64,
      shuffle=True,
)
  • Update the policy by using cost advantage and kl divergence.

coef = (1 - args.cup_gamma * args.cup_lambda) / (1 - args.cup_gamma)
loss_pi_cost = (
   lagrange.lagrangian_multiplier * coef * ratio * adv_b + temp_kl
).mean()

Where args.cup_gamma is the GAE gamma, args.cup_lambda is the cost GAE lambda, ratio is the importance sampling ratio, adv_b is the cost advantage, temp_kl is the kl divergence.

FOCOPS uses a lagrangian multiplier combined with projection to project the policy back to the safe set.

  • First, get the data from buffer and finish pre-computation.

old_distribution_b = Normal(loc=old_mean_b, scale=old_std_b)

distribution = policy.actor(obs_b)
log_prob = distribution.log_prob(act_b).sum(dim=-1)
ratio = torch.exp(log_prob - log_prob_b)
temp_kl = torch.distributions.kl_divergence(
   distribution, old_distribution_b
).sum(-1, keepdim=True)
  • Then, update the policy by using cost advantage and kl divergence.

loss_pi = (temp_kl - (1 / args.focops_lam) * ratio * adv_b) * (
   temp_kl.detach() <= args.focops_eta
).type(torch.float32)

Where temp_kl is the kl divergence, ratio is the importance sampling ratio, adv_b is the reward advantage, args.focops_lam and args.focops_eta are the hyperparameters of FOCOPS.