Skip to content
This repository has been archived by the owner on Oct 7, 2024. It is now read-only.

Commit

Permalink
Calculate best episode using full episode return in cartpole_swingup.
Browse files Browse the repository at this point in the history
Return is non-monotonic in this problem; currently this cherry-picks the peak of return during the episode.

Also applied same change to base cartpole for consistency and efficiency, but cartpole return is monotonic (so not a bug).

PiperOrigin-RevId: 308033113
Change-Id: I9add00d41f8e87d518e00c3fef9cd9ad7ad18d0b
  • Loading branch information
DeepMind authored and copybara-github committed Apr 23, 2020
1 parent f9b74bf commit beb1630
Show file tree
Hide file tree
Showing 2 changed files with 2 additions and 2 deletions.
2 changes: 1 addition & 1 deletion bsuite/environments/cartpole.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,9 +145,9 @@ def step(self, action):
reward = 1. if is_reward else 0.
self._raw_return += reward
self._episode_return += reward
self._best_episode = max(self._episode_return, self._best_episode)

if self._state.time_elapsed > self._max_time or not is_reward:
self._best_episode = max(self._episode_return, self._best_episode)
self._reset_next_step = True
return dm_env.termination(reward=reward, observation=self.observation)
return dm_env.transition(reward=reward, observation=self.observation)
Expand Down
2 changes: 1 addition & 1 deletion bsuite/experiments/cartpole_swingup/cartpole_swingup.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,11 +111,11 @@ def step(self, action):
self._total_upright += 1
self._raw_return += reward
self._episode_return += reward
self._best_episode = max(self._episode_return, self._best_episode)

is_end_of_episode = (self._state.time_elapsed > self._max_time
or np.abs(self._state.x) > self._x_threshold)
if is_end_of_episode:
self._best_episode = max(self._episode_return, self._best_episode)
self._reset_next_step = True
return dm_env.termination(reward=reward, observation=self.observation)
else: # continuing transition.
Expand Down

0 comments on commit beb1630

Please sign in to comment.