First Order Projection Methods#
Experiment Results#
Implementation Details#
Note
All experiments are ran under total 1e7 steps, while in the Doggo agent, 1e8 steps are used. This setting is the same as Safety-Gym
Environment Wrapper#
In the course of our experimental investigations, we have discerned certain hyperparameters that wield a discernible influence upon the algorithm’s performance:
The parameter denoted as
obs_normalize
, which pertains to the normalization of observations.reward_normalize
, governing the normalization of rewards.cost_normalize
, governing the normalization of costs.
Throughout the experimental trials, a consistent pattern emerged,
wherein the setting obs_normalize=True
consistently yielded superior
results.
Note
Significantly, the outcome is not uniformly corroborated when it comes
to the reward_normalize
parameter. Its affirmative setting
reward_normalize=True
does not invariably outperform the negative
counterpart reward_normalize=False
, a trend particularly pronounced
in the SafetyHopperVelocity-v1
and SafetyWalker2dVelocity-v1
environments.
Therefore, We make the environment wrapper to control the normalization of observations, rewards and costs:
env = safety_gymnasium.make(env_id)
env.reset(seed=seed)
obs_space = env.observation_space
act_space = env.action_space
env = SafeAutoResetWrapper(env)
env = SafeRescaleAction(env, -1.0, 1.0)
env = SafeNormalizeObservation(env)
env = SafeUnsqueeze(env)
return env, obs_space, act_space
Lagrangian Multiplier#
Lagrangian-based algorithms use Lagrangian Multiplier
to control the safety
constraint. The Lagrangian Multiplier
is an Integrated part of
SafePO.
Some key points:
The implementation of
Lagrangian Multiplier
is based onAdam
optimizer for a smooth update.The
Lagrangian Multiplier
is updated every epoch based on the total cost violation of current episodes.
Key implementation:
from safepo.common.lagrange import Lagrange
# setup lagrangian multiplier
COST_LIMIT = 25.0
LAGRANGIAN_MULTIPLIER_INIT = 0.001
LAGRANGIAN_MULTIPLIER_LR = 0.035
lagrange = Lagrange(
cost_limit=COST_LIMIT,
lagrangian_multiplier_init=LAGRANGIAN_MULTIPLIER_INIT,
lagrangian_multiplier_lr=LAGRANGIAN_MULTIPLIER_LR,
)
# update lagrangian multiplier
# suppose ep_cost is 50.0
ep_cost = 50.0
lagrange.update_lagrange_multiplier(ep_cost)
# use lagrangian multiplier to control the advanatge
advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)
Please refer to Lagrangian Multiplier for more details.
Projection Implementation#
The key idea of CUP
and FOCOPS
is projecting the policy back to the safe set.
A more detailed theoretical analysis can be found in here.
We provide how SafePO
implements the two stage projection:
CUP first make a PPO update to improve the policy reward. Then it projects the policy back to the safe set. We will focus on the projection part.
Get the cost advantage from buffer and prepare training data.
advantage = data["adv_c"]
dataloader = DataLoader(
dataset=TensorDataset(
data["obs"], data["act"], data["log_prob"], advantage, old_mean, old_std
),
batch_size=64,
shuffle=True,
)
Update the policy by using cost advantage and kl divergence.
coef = (1 - args.cup_gamma * args.cup_lambda) / (1 - args.cup_gamma)
loss_pi_cost = (
lagrange.lagrangian_multiplier * coef * ratio * adv_b + temp_kl
).mean()
Where args.cup_gamma
is the GAE gamma, args.cup_lambda
is the cost GAE lambda, ratio
is the importance sampling ratio, adv_b
is the cost advantage, temp_kl
is the kl divergence.
FOCOPS uses a lagrangian multiplier combined with projection to project the policy back to the safe set.
First, get the data from buffer and finish pre-computation.
old_distribution_b = Normal(loc=old_mean_b, scale=old_std_b)
distribution = policy.actor(obs_b)
log_prob = distribution.log_prob(act_b).sum(dim=-1)
ratio = torch.exp(log_prob - log_prob_b)
temp_kl = torch.distributions.kl_divergence(
distribution, old_distribution_b
).sum(-1, keepdim=True)
Then, update the policy by using cost advantage and kl divergence.
loss_pi = (temp_kl - (1 / args.focops_lam) * ratio * adv_b) * (
temp_kl.detach() <= args.focops_eta
).type(torch.float32)
Where temp_kl
is the kl divergence, ratio
is the importance sampling ratio, adv_b
is the reward advantage, args.focops_lam
and args.focops_eta
are the hyperparameters of FOCOPS.