Welcome to SafePO’s documentation!#

Safe Policy Optimization is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL).


Overall Algorithms Performance Analysis#

This illustration delineates the algorithms within the safepo framework across diverse environmental conditions and tasks, while also encompassing a comparative analysis of the distribution of EpCost throughout the entirety of the training process. The area under consideration signifies the degree of concentration exhibited by EpCost during the course of training. Upon scrutiny of this graphical representation, several observations emerge:


  • CPO exhibits superior stability in contrast to the Lagrangian approach, resulting in a comparatively more concentrated distribution of EpCost; however, it is noteworthy that instances of constraint violation occur with heightened frequency.

  • The PID Lagrangian method CPPOPID displays enhanced stability when juxtaposed with the conventional Lagrangian approach.

  • PPOLag, though marked by pronounced oscillations, demonstrates heightened aptitude in adhering to constraints, as evidenced by a relatively lower overall EpCost value.

  • PCPO closely parallels the characteristics of CPO, while FOCOPS and CUP can be conceptualized as striking a balance between the PPOLag method and CPO.

Easy Start#

One line to run SafePO benchmark:

make benchmark

Then you can check the runs in safepo/runs. After that, you can check the results (evaluation outcomes, training curves) in safepo/results.