Lagrangian Methods#
Experiment Results#
Implement Details#
Note
All experiments are ran under total 1e7 steps, while in the Doggo agent, 1e8 steps are used. This setting is the same as Safety-Gym
Environment Wrapper#
In the course of our experimental investigations, we have discerned certain hyperparameters that wield a discernible influence upon the algorithm’s performance:
The parameter denoted as
obs_normalize, which pertains to the normalization of observations.reward_normalize, governing the normalization of rewards.cost_normalize, governing the normalization of costs.
Throughout the experimental trials, a consistent pattern emerged,
wherein the setting obs_normalize=True consistently yielded superior
results.
Note
Significantly, the outcome is not uniformly corroborated when it comes
to the reward_normalize parameter. Its affirmative setting
reward_normalize=True does not invariably outperform the negative
counterpart reward_normalize=False, a trend particularly pronounced
in the SafetyHopperVelocity-v1 and SafetyWalker2dVelocity-v1
environments.
Therefore, We make the environment wrapper to control the normalization of observations, rewards and costs:
env = safety_gymnasium.make(env_id)
env.reset(seed=seed)
obs_space = env.observation_space
act_space = env.action_space
env = SafeAutoResetWrapper(env)
env = SafeRescaleAction(env, -1.0, 1.0)
env = SafeNormalizeObservation(env)
env = SafeUnsqueeze(env)
return env, obs_space, act_space
Lagrangian Multiplier#
Lagrangian-based algorithms use Lagrangian Multiplier to control the safety
constraint. The Lagrangian Multiplier is an Integrated part of
SafePO.
Some key points:
The implementation of
Lagrangian Multiplieris based onAdamoptimizer for a smooth update.The
Lagrangian Multiplieris updated every epoch based on the total cost violation of current episodes.
Key implementation:
from safepo.common.lagrange import Lagrange
# setup lagrangian multiplier
COST_LIMIT = 25.0
LAGRANGIAN_MULTIPLIER_INIT = 0.001
LAGRANGIAN_MULTIPLIER_LR = 0.035
lagrange = Lagrange(
cost_limit=COST_LIMIT,
lagrangian_multiplier_init=LAGRANGIAN_MULTIPLIER_INIT,
lagrangian_multiplier_lr=LAGRANGIAN_MULTIPLIER_LR,
)
# update lagrangian multiplier
# suppose ep_cost is 50.0
ep_cost = 50.0
lagrange.update_lagrange_multiplier(ep_cost)
# use lagrangian multiplier to control the advanatge
advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)
Please refer to Lagrangian Multiplier for more details.
Configuration Analysis#
The implementation of Lagrangian multiplier based algorithms is based on classic reinforcement learning algorithms, e.g. PPO and TRPO.
And the PPO and TRPO hyperparameters is basically the same as community version.
We listed the key hyperparameters as follows:
batch_size: 64gamma: 0.99lam: 0.95lam_c: 0.95clip: 0.2actor_lr: 3e-4critic_lr: 3e-4hidden_size: 64 for all agents while 256 forDoggoandAnt
batch_size: 128gamma: 0.99lam: 0.95lam_c: 0.95critic_lr: 1e-3hidden_size: 64 for all agents while 256 forDoggoandAnt