Lagrangian Methods#
Experiment Results#
Implement Details#
Note
All experiments are ran under total 1e7 steps, while in the Doggo agent, 1e8 steps are used. This setting is the same as Safety-Gym
Environment Wrapper#
In the course of our experimental investigations, we have discerned certain hyperparameters that wield a discernible influence upon the algorithm’s performance:
The parameter denoted as
obs_normalize
, which pertains to the normalization of observations.reward_normalize
, governing the normalization of rewards.cost_normalize
, governing the normalization of costs.
Throughout the experimental trials, a consistent pattern emerged,
wherein the setting obs_normalize=True
consistently yielded superior
results.
Note
Significantly, the outcome is not uniformly corroborated when it comes
to the reward_normalize
parameter. Its affirmative setting
reward_normalize=True
does not invariably outperform the negative
counterpart reward_normalize=False
, a trend particularly pronounced
in the SafetyHopperVelocity-v1
and SafetyWalker2dVelocity-v1
environments.
Therefore, We make the environment wrapper to control the normalization of observations, rewards and costs:
env = safety_gymnasium.make(env_id)
env.reset(seed=seed)
obs_space = env.observation_space
act_space = env.action_space
env = SafeAutoResetWrapper(env)
env = SafeRescaleAction(env, -1.0, 1.0)
env = SafeNormalizeObservation(env)
env = SafeUnsqueeze(env)
return env, obs_space, act_space
Lagrangian Multiplier#
Lagrangian-based algorithms use Lagrangian Multiplier
to control the safety
constraint. The Lagrangian Multiplier
is an Integrated part of
SafePO.
Some key points:
The implementation of
Lagrangian Multiplier
is based onAdam
optimizer for a smooth update.The
Lagrangian Multiplier
is updated every epoch based on the total cost violation of current episodes.
Key implementation:
from safepo.common.lagrange import Lagrange
# setup lagrangian multiplier
COST_LIMIT = 25.0
LAGRANGIAN_MULTIPLIER_INIT = 0.001
LAGRANGIAN_MULTIPLIER_LR = 0.035
lagrange = Lagrange(
cost_limit=COST_LIMIT,
lagrangian_multiplier_init=LAGRANGIAN_MULTIPLIER_INIT,
lagrangian_multiplier_lr=LAGRANGIAN_MULTIPLIER_LR,
)
# update lagrangian multiplier
# suppose ep_cost is 50.0
ep_cost = 50.0
lagrange.update_lagrange_multiplier(ep_cost)
# use lagrangian multiplier to control the advanatge
advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)
Please refer to Lagrangian Multiplier for more details.
Configuration Analysis#
The implementation of Lagrangian multiplier based algorithms is based on classic reinforcement learning algorithms, e.g. PPO
and TRPO
.
And the PPO
and TRPO
hyperparameters is basically the same as community version.
We listed the key hyperparameters as follows:
batch_size
: 64gamma
: 0.99lam
: 0.95lam_c
: 0.95clip
: 0.2actor_lr
: 3e-4critic_lr
: 3e-4hidden_size
: 64 for all agents while 256 forDoggo
andAnt
batch_size
: 128gamma
: 0.99lam
: 0.95lam_c
: 0.95critic_lr
: 1e-3hidden_size
: 64 for all agents while 256 forDoggo
andAnt