Welcome to SafePO’s documentation!#
Safe Policy Optimization is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL).
Overall Algorithms Performance Analysis#
This illustration delineates the algorithms within the safepo framework across diverse environmental conditions and tasks,
while also encompassing a comparative analysis of the distribution of EpCost
throughout the entirety of the training process.
The area under consideration signifies the degree of concentration exhibited by EpCost
during the course of training.
Upon scrutiny of this graphical representation, several observations emerge:
Hint
CPO
exhibits superior stability in contrast to the Lagrangian approach, resulting in a comparatively more concentrated distribution ofEpCost
; however, it is noteworthy that instances of constraint violation occur with heightened frequency.The PID Lagrangian method
CPPOPID
displays enhanced stability when juxtaposed with the conventional Lagrangian approach.PPOLag
, though marked by pronounced oscillations, demonstrates heightened aptitude in adhering to constraints, as evidenced by a relatively lower overallEpCost
value.PCPO
closely parallels the characteristics ofCPO
, whileFOCOPS
and CUP can be conceptualized as striking a balance between thePPOLag
method andCPO
.
Easy Start#
One line to run SafePO benchmark:
make benchmark
Then you can check the runs in safepo/runs
. After that, you can check the
results (evaluation outcomes, training curves) in safepo/results
.