Customization of Algorithms#

Trustworthy Classic RL Algorithms#

As Safe RL algorithms are also based on classic RL algorithms, a trustworthy implementation of the classic RL algorithm is required. SafePO provided a set of classic RL algorithms, PPO, NaturaPG and TRPO.

To verify the correctness of the classic RL algorithms, we provide the performance of them in the MuJoCo Velocity environment.

Integrated Safe RL Pipeline#

SafePO’s classic RL algorithms are integrated with the Safe RL pipeline, though they make no use of the constraint. You can customize the Safe RL algorithms based on the classic RL algorithms.

Briefly, the PPO in SafePO has the following characteristics, which are also suitable for other customization of safe RL algorithms.

VectorizedOnPolicyBuffer: A vectorized buffer supporting cost advantage estimation.
ActorVCritic: A actor-critic network supporting cost value estimation.
Lagrange: A lagrangian multiplier for constraint violation control.

Beyond the above characteristics, the PPO in SafePO also provides a training pipeline for data collection and training. You can customize new algorithms based on it.

Next we will provide a detailed example to show how to customize the PPO algorithm to PPO-Lag algorithm.

Example: PPO-Lag#

The Lagrangian multiplier is a useful tool to control the constraint violation in the Safe RL algorithms. Classic RL algorithms combined with the Lagrangian multiplier are trustworthy baselines for Safe RL algorithms.

Note

SafePO provide naive lagrangian multiplier and pid-based lagrangian multiplier. The former suffer from oscillation and the latter is more stable.

Here we provide an example of using the Lagrangian multiplier in the PPO algorithm.

First, import the Lagrange class.

from safepo.common.lagrange import Lagrange

Second, initialize the Lagrange class.

lagrange = Lagrange(
    cost_limit=args.cost_limit,
    lagrangian_multiplier_init=args.lagrangian_multiplier_init,
    lagrangian_multiplier_lr=args.lagrangian_multiplier_lr,
)

Third, update the Lagrange class.

ep_costs = logger.get_stats("Metrics/EpCost")
lagrange.update_lagrange_multiplier(ep_costs)

Finally, use the lagrangian multiplier to update the policy network.

advantage = data["adv_r"] - lagrange.lagrangian_multiplier * data["adv_c"]
advantage /= (lagrange.lagrangian_multiplier + 1)

Note

Only within 10 lines of code, you can use the Lagrangian multiplier in the PPO algorithm. The framework of PPO is also suitable for other customization of safe RL algorithms.