Welcome to SafePO’s documentation!#
Safe Policy Optimization is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL).
Overall Algorithms Performance Analysis#
This illustration delineates the algorithms within the safepo framework across diverse environmental conditions and tasks,
while also encompassing a comparative analysis of the distribution of EpCost throughout the entirety of the training process.
The area under consideration signifies the degree of concentration exhibited by EpCost during the course of training.
Upon scrutiny of this graphical representation, several observations emerge:
Hint
CPOexhibits superior stability in contrast to the Lagrangian approach, resulting in a comparatively more concentrated distribution ofEpCost; however, it is noteworthy that instances of constraint violation occur with heightened frequency.The PID Lagrangian method
CPPOPIDdisplays enhanced stability when juxtaposed with the conventional Lagrangian approach.PPOLag, though marked by pronounced oscillations, demonstrates heightened aptitude in adhering to constraints, as evidenced by a relatively lower overallEpCostvalue.PCPOclosely parallels the characteristics ofCPO, whileFOCOPSand CUP can be conceptualized as striking a balance between thePPOLagmethod andCPO.
Easy Start#
One line to run SafePO benchmark:
make benchmark
Then you can check the runs in safepo/runs. After that, you can check the
results (evaluation outcomes, training curves) in safepo/results.