Trained with with the improved reward function, DISTANCE_PENALTY=4
, MINOR_SAFETY_PENALTY=1
and MAJOR_SAFETY_PENALTY=5
. No noise during training.
Same reward function. Trained with unbiased noise with standard deviation 0.1
Same reward function. Trained with unbiased noise with standard deviation 1.0