First Order Projection Methods#
Experiment Results#
Implementation Details#
Note
All experiments are ran under total 1e7 steps, while in the Doggo agent, 1e8 steps are used. This setting is the same as Safety-Gym
Environment Wrapper#
In the course of our experimental investigations, we have discerned certain hyperparameters that wield a discernible influence upon the algorithm’s performance:
The parameter denoted as
obs_normalize, which pertains to the normalization of observations.reward_normalize, governing the normalization of rewards.cost_normalize, governing the normalization of costs.
Throughout the experimental trials, a consistent pattern emerged,
wherein the setting obs_normalize=True consistently yielded superior
results.
Note
Significantly, the outcome is not uniformly corroborated when it comes
to the reward_normalize parameter. Its affirmative setting
reward_normalize=True does not invariably outperform the negative
counterpart reward_normalize=False, a trend particularly pronounced
in the SafetyHopperVelocity-v1 and SafetyWalker2dVelocity-v1
environments.
Therefore, We make the environment wrapper to control the normalization of observations, rewards and costs:
env = safety_gymnasium.make(env_id)
env.reset(seed=seed)
obs_space = env.observation_space
act_space = env.action_space
env = SafeAutoResetWrapper(env)
env = SafeRescaleAction(env, -1.0, 1.0)
env = SafeNormalizeObservation(env)
env = SafeUnsqueeze(env)
return env, obs_space, act_space
Lagrangian Multiplier#
Lagrangian-based algorithms use Lagrangian Multiplier to control the safety
constraint. The Lagrangian Multiplier is an Integrated part of
SafePO.
Some key points:
The implementation of
Lagrangian Multiplieris based onAdamoptimizer for a smooth update.The
Lagrangian Multiplieris updated every epoch based on the total cost violation of current episodes.
Key implementation:
from safepo.common.lagrange import Lagrange
# setup lagrangian multiplier
COST_LIMIT = 25.0
LAGRANGIAN_MULTIPLIER_INIT = 0.001
LAGRANGIAN_MULTIPLIER_LR = 0.035
lagrange = Lagrange(
cost_limit=COST_LIMIT,
lagrangian_multiplier_init=LAGRANGIAN_MULTIPLIER_INIT,
lagrangian_multiplier_lr=LAGRANGIAN_MULTIPLIER_LR,
)
# update lagrangian multiplier
# suppose ep_cost is 50.0
ep_cost = 50.0
lagrange.update_lagrange_multiplier(ep_cost)
# use lagrangian multiplier to control the advanatge
advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)
Please refer to Lagrangian Multiplier for more details.
Projection Implementation#
The key idea of CUP and FOCOPS is projecting the policy back to the safe set.
A more detailed theoretical analysis can be found in here.
We provide how SafePO implements the two stage projection:
CUP first make a PPO update to improve the policy reward. Then it projects the policy back to the safe set. We will focus on the projection part.
Get the cost advantage from buffer and prepare training data.
advantage = data["adv_c"]
dataloader = DataLoader(
dataset=TensorDataset(
data["obs"], data["act"], data["log_prob"], advantage, old_mean, old_std
),
batch_size=64,
shuffle=True,
)
Update the policy by using cost advantage and kl divergence.
coef = (1 - args.cup_gamma * args.cup_lambda) / (1 - args.cup_gamma)
loss_pi_cost = (
lagrange.lagrangian_multiplier * coef * ratio * adv_b + temp_kl
).mean()
Where args.cup_gamma is the GAE gamma, args.cup_lambda is the cost GAE lambda, ratio is the importance sampling ratio, adv_b is the cost advantage, temp_kl is the kl divergence.
FOCOPS uses a lagrangian multiplier combined with projection to project the policy back to the safe set.
First, get the data from buffer and finish pre-computation.
old_distribution_b = Normal(loc=old_mean_b, scale=old_std_b)
distribution = policy.actor(obs_b)
log_prob = distribution.log_prob(act_b).sum(dim=-1)
ratio = torch.exp(log_prob - log_prob_b)
temp_kl = torch.distributions.kl_divergence(
distribution, old_distribution_b
).sum(-1, keepdim=True)
Then, update the policy by using cost advantage and kl divergence.
loss_pi = (temp_kl - (1 / args.focops_lam) * ratio * adv_b) * (
temp_kl.detach() <= args.focops_eta
).type(torch.float32)
Where temp_kl is the kl divergence, ratio is the importance sampling ratio, adv_b is the reward advantage, args.focops_lam and args.focops_eta are the hyperparameters of FOCOPS.