Skip to content

[WIP] Improve training of DQN tutorial #2030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 28 commits into from
Closed

[WIP] Improve training of DQN tutorial #2030

wants to merge 28 commits into from

Conversation

SiftingSands
Copy link
Contributor

@SiftingSands SiftingSands commented Sep 7, 2022

Following up the discussion from #2026

I still need to do multiple runs to get a semblance of the statistics of # episodes vs duration for both the original and my changes. The slight increase in model capacity still only uses ~1.5 GB of VRAM, so it should be pretty accessible and training is still relatively quick.

Here's the reward history for one run of these tweaks when I was doing a bunch of trial and error (spent an embarrassing amount of time tweaking hyperparameters and rewards).

duration_only_reward_03

@vmoens Feel free to change (or completely discard) anything based on your findings. I haven't tried tweaking anything else in the training pipeline.

Verified

This commit was signed with the committer’s verified signature.
JustAnotherID JustAnotherID

Verified

This commit was signed with the committer’s verified signature.
JustAnotherID JustAnotherID
…e penalty in the first few episodes to enable more exploration
@netlify
Copy link

netlify bot commented Sep 7, 2022

Deploy Preview for pytorch-tutorials-preview ready!

Name Link
🔨 Latest commit b9bc71c
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-tutorials-preview/deploys/6387da52ae1b3e00080e9f3c
😎 Deploy Preview https://deploy-preview-2030--pytorch-tutorials-preview.netlify.app/intermediate/reinforcement_q_learning
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@svekars svekars requested a review from vmoens September 7, 2022 03:58
Copy link
Contributor

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we're on the good track!

One thing I'm not super familiar / comfortable with is the batch norm. In general many don't use it in RL. In this case I would at least turn it off for the select_action function, ie. for data collection (call policy.eval() before and policy.train() after).
For the target network I would call requires_grad_(False) rather than detaching and I think we can avoid calling eval on it (the target net could be in train no?), hence batch_norm would still be "active" there.

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 8, 2022

See if that last commit is what you had in mind. I decided to use torch.no_grad() instead of setting requires_grad=False on all of the params of the target network. I agree, it seems logical to have the batch norm behavior be the same for both the target and policy networks for loss calculation. However, I wonder if the target network only updating every 10 episodes makes that a moot point. I got it running in the background anyway.

FYI I'm only doing 10 runs for each "config", so I'm sure the mean results (when I get around to making those graphs) will still be noisy. We could try removing batch norm altogether as well.

The only other thought I had right now was changing the replay memory sampling from uniform to something learned or a heuristic, but I'm on the fence due to the added complexity for a tutorial.

@svekars svekars requested a review from vmoens September 8, 2022 17:57
@vmoens
Copy link
Contributor

vmoens commented Sep 8, 2022

The only other thought I had right now was changing the replay memory sampling from uniform to something learned or a heuristic, but I'm on the fence due to the added complexity for a tutorial

I don't think that a prioritized RB will change much and that will be a lot of trouble for a tutorial.
But we could update the target net more progressively with little effort (using a moving average for instance, i.e. soft update). A decay of 0.002 is typical but we can go a bit higher than that depending on how easy the problem is to solve.

We could try removing batch norm altogether as well
I would be in favour of that!

IMO we should also check if the data as is is properly normalized, I don't think it is. We could compute the stats over, say, 100 images for each channels (i.e. a mean of 3 values and a std of 3 values) and normalize the data at every step with that to get something close to normally distributed.

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 12, 2022

I decided to setup some experiment tracking infrastructure (test driving mlflow), so things are more manageable. I ran 5 configs so far, but only doing 10 runs for each had very noisy results as expected.

For example, here's a plot of mean +/- 1 sigma and median +/- 25%. cfg_01 is from the same settings as the first 4 commits here and the same settings that produced the graph in the original post. baseline is the original tutorial parameters.
mean_stdev
quartiles

Instead of that mess, I decided to smooth the medians to make multiple runs readable. Here I'm using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.html with alpha = 0.01 (heavy smoothing)
Here's an example comparing median vs smoothed
smooth_example

So, here's the comparison between all of my runs so far. Gonna test out soft updates and input normalization afterwards.

  • cfg_01_vmoensBN is the most recent commit with changes suggested at [WIP] Improve training of DQN tutorial #2030 (review)
  • cfg_01_BNdelete has all batch norm ops removed
  • I ran baseline_BNdelete after seeing how poor performance was after removing batch norm, and I wanted a sanity check.
    initial_results

@vmoens
Copy link
Contributor

vmoens commented Sep 12, 2022

Feels like i'm giving pretty bad suggestions haha
It would be nice to know how significant those differences are. From what I see from the plot above it could be that the differences are not significant, am I right to assume that?
It could be nice to have a sense of what is the loss value, the value of the gradients etc.
On my side I'm working on a torchrl tutorial based on the same example, so perhaps I'll have some insight to share about all this soon.

(thanks for the plot! do you think you could provide a version where we can see the scale of the returns more clearly? Here the log-scale + small chars makes it hard to read it)

@pseudo-rnd-thoughts
Copy link
Contributor

pseudo-rnd-thoughts commented Sep 12, 2022

@vmoens @SiftingSands Hey, Im one of the maintainers of gym and just saw this PR as I noted we needed to update it with the last changes in v26 so thanks for the PR

Could it be updated to v26 instead of v25? We should specify what gym version this tutorial uses. See requirements.txt for gym version
If so, then the gym.make on line 77 will need to be updated to env = gym.make('CartPole-v0', render_mode='rgb_array').unwrapped

I also had a couple of questions about the current tutorial that you might know the answer to

  1. Why is the environment unwrapped from gym.make? This seems to be to remove the time limit wrapper
  2. Is the cartpole environment being imagined as an infinite horizon problem with no time limit (this might be the solution to question 1). If so, we should mention this assumption as it is not true for a majority of problems otherwise, we should add the termination variable to maths equations.

Couple of issues

  1. "Install using pip install gym" on line 49 should be pip install gym[classic_control] to install pygame automatically which is necessary for the rendering
  2. "It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about." should note that this also encourages agents to collect reward closer in time than equivalent rewards temporally future away
  3. \delta = Q(s, a) - (r + \gamma \max_a Q(s', a)) the max_a should probably be max_a' to avoid confusion with Q(s, a).
  4. "Below, num_episodes is set small. You should download the notebook and run lot more epsiodes, such as 300+ for meaningful duration improvements." on 449 is incorrect currently as num_episodes=1000

Suggested upgrades

  1. Add docstring explaining the BATCH_SIZE, GAMMA and TARGET_UPDATE variables
  2. Use env.action_space.sample() instead of random.randrange(n_actions) in line 350
  3. I believe you are using param.grad.data.clamp_(-1, 1) on line 437 is gradient clipping. Maybe add a note of this as it is common in RL.
  4. Change done to terminated on line 467 as this is the true meaning of the variable
  5. Line 470 does reward shaping, however, to my knowledge, there isn't an explanation of this anywhere

@SiftingSands
Copy link
Contributor Author

Could it be updated to v26 instead of v25? We should specify what gym version this tutorial uses. If so, then the gym.make on line 77 will need to be updated to env = gym.make('CartPole-v0', render_mode='rgb_array').unwrapped

I also had a couple of questions about the current tutorial that you might know the answer to

1. Why is the environment unwrapped from `gym.make`? This seems to be to remove the time limit wrapper

2. Is the cartpole environment being imagined as an infinite horizon problem with no time limit (this might be the solution to question 1). If so, we should mention this assumption as it is not true for a majority of problems otherwise, we should add the termination variable to maths equations.

unwrapped was in Adam's original tutorial, so I didn't bother changing it. I did notice the time limit being unbounded, when there were durations over 200 even though cartpole-v0 was specified.

Couple of issues

1. "Install using `pip install gym`" on line 49 should be `pip install gym[classic_control]` to install pygame automatically which is necessary for the rendering

2. "It makes rewards from the uncertain far future less important for our agent than the ones in the near future that it can be fairly confident about." should note that this also encourages agents to collect reward closer in time than equivalent rewards temporally future away

3. `\delta = Q(s, a) - (r + \gamma \max_a Q(s', a))` the `max_a` should probably be `max_a'` to avoid confusion with `Q(s, a)`.

4. "Below, `num_episodes` is set small. You should download the notebook and run lot more epsiodes, such as 300+ for meaningful duration improvements." on 449 is incorrect currently as `num_episodes=1000`

Suggested upgrades

1. Add docstring explaining the `BATCH_SIZE`, `GAMMA` and `TARGET_UPDATE` variables

2. Use `env.action_space.sample()` instead of `random.randrange(n_actions)` in line 350

3. I believe you are using `param.grad.data.clamp_(-1, 1)` on line 437 is gradient clipping. Maybe add a note of this as it is common in RL.

4. Change `done` to `terminated` on line 467 as this is the true meaning of the variable

5. Line 470 does reward shaping, however, to my knowledge, there isn't an explanation of this anywhere

Thanks for the suggestions on the documentation! It all seems reasonable to me, but I'll defer to @vmoens on further discussion of the doc changes.
The increase in episodes and reward shaping are very much experimental, as this PR is really just brainstorming (hence the WIP).

@vmoens
Copy link
Contributor

vmoens commented Sep 13, 2022

Hi @pseudo-rnd-thoughts
Thanks for those suggestions. Indeed it's important that our RL tutorials use gym as it should.
We'll make the changes you suggested.

I don't think we should unwrap the environment, that does not seem like something we want to do.

The number of episodes being run is indeed puzzling in this tutorial and I guess most user wonder why so few are proposed. We should definitely increase that.

For the gradient clippping I believe we should use torch.nn.utils.clip_grad_value_() instead, since it's a PyTorch tutorial let's use pytorch tools!

+1 on the done to terminated change.

In a nutshell there is work to do!

@SiftingSands feel free to make those edits in your PR, I can give a hand if needed.

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 17, 2022

I decided to try out offline normalization like what's done for ImageNet trained models (e.g. Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))). We could do a moving average and std dev calculation as you suggested, but maybe this is 99% of the benefit.

Regardless, I tracked the average of the mean and std dev of the "state" over 2000 episodes in the figure below (x axis is actually "steps" instead of episodes). I think the sharp movements are due to the beginning of a new episode. I was hoping the std dev would level out, but I expect it to stay within the range of [0.006, 0.007]. The mean is, unsurprisingly, effectively zero, because the "state" is a frame difference.

state_stat_history

At the end of 2000 episodes these are the numerical results for each channel. The transform that I ended up using was normalize(state, mean=[0,0,0], std=[0.00656, 0.0106, 0.0152])

Channel 0 mean: 1.7123858279475807e-06, std: 0.006567582970755784
Channel 1 mean: 1.1795560506103564e-06, std: 0.010622139538439322
Channel 2 mean: 8.115399832946285e-07, std: 0.015237523709654841

I ran the input normalization on the baseline and my changes from the initial post, cfg_01. Looks like a slight improvement, but maybe that will evaporate if ran for more episodes and repeats? Same sentiment as baseline vs cfg_01 to answer you from
#2030 (comment)

From what I see from the plot above it could be that the differences are not significant, am I right to assume that?

(switched y axis to linear scaling and increased the font size a bit, but it should show up fine if you click on the figure to expand)
offlineNorm_results

@vmoens Find any improvements on your end? I can still look into soft updates, since that should be a pretty simple change. I've been saving the best model from each run, so I still gotta run them in the environment and see if they actually perform.

@pseudo-rnd-thoughts
Copy link
Contributor

Hey, is there any timeline on this PR and what are the aims of the PR as that isn't clear by the initial comment.

If we re-add the TimeLimit wrapper with CartPole-v1 (defaults to 500), then an easy aim could be for a script that will train a neural network to achieve on average X rewards? Im not sure what a reasonable reward is
This seems like a very helpful tutorial for users and I worry that making a tutorial with every fancy tool RL researchers twhich will encourage tools that are not helpful for alot of settings to become standard.

@vmoens
Copy link
Contributor

vmoens commented Sep 19, 2022

Hey, is there any timeline on this PR and what are the aims of the PR as that isn't clear by the initial comment.

If we re-add the TimeLimit wrapper with CartPole-v1 (defaults to 500), then an easy aim could be for a script that will train a neural network to achieve on average X rewards? Im not sure what a reasonable reward is This seems like a very helpful tutorial for users and I worry that making a tutorial with every fancy tool RL researchers twhich will encourage tools that are not helpful for alot of settings to become standard.

I agree that the tuto is useful as is but it looks bad if it does not train... I would advise against anything fancy (like reward shaping) but for instance proper normalization of the observations or good usage of gym's API seem to be crucial points.

To narrow the scope, our goal is basically to have a version of the tuto that trains out of the box (which was far from being the case before).

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 20, 2022

@vmoens Here's the results for soft updates of the target network using \tau = 0.002, (θ′ ← τ θ + (1 −τ )θ′). Updates were done at every step instead of a predefined # of episodes. The baseline run now includes offline input normalization. I reran the baseline, because I updated Gym to v0.26.1 and updated various API calls (switched to cartpole-v1), so the results are a bit different from the plot from yesterday.

softupdates

I didn't bother running it with my changes, since the reward shaping was quite a hack. I might try tweaking a few parameters in the next day or so, but I won't be able to contribute too much more on this in the near term.

@vmoens
Copy link
Contributor

vmoens commented Sep 21, 2022

@SiftingSands I got it running using TorchRL, I'll make a PR on https://github.comfacebookresearch/rl by EOD and ping you with it

EDIT
@SiftingSands here's the tuto
pytorch/rl#474

In a nutshell, we used the following config:

  • environment
    • instead of the diff from frame to frame, we used a stack of 4 consecutive frames as input
    • we turn them to grayscale and 64x64
  • return computation
    • td(lambda) works considerably better than td(0) (like we use in the pytorch tutorial)
  • We used a duelling DQN network. Nothing fancy, but we have the primitive in torchrl so why not using it :)

For the rest the tuto should be self-explanatory.

Here are the results (I truncated the run early, but previously I was able to get a better perf by running it for twice the time)

image

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 23, 2022

@vmoens Thanks for all the hard work on that tutorial! My understanding is that the current implementation roughly on par with your td(0) results? The overall trend of the training history of episode lengths is very reminiscent of the current results, if I look at a single one of my repeated runs instead of the statistics over the 10 repeats.

As you mentioned, the addition of td(\lambda) is the most significant performance improvement. However, the current tutorial's replay buffer only holds state transitions instead of trajectories. Correct me if I'm wrong, but a td(\lambda) implementation would require this tutorial to have a replay buffer of trajectories instead of state transitions? If so, I'm afraid that I can't commit to making the necessary updates at this time. However, I can open another PR for the minor gym API and doc changes (if of interest), and someone else can take the reigns from there.

@vmoens
Copy link
Contributor

vmoens commented Sep 23, 2022

As you mentioned, the addition of td(\lambda) is the most significant performance improvement. However, the current tutorial's replay buffer only holds state transitions instead of trajectories

yeah so here are our options:

  • store trajectories in the replay buffer. Not sure how problematic that would be, in torchrl it's extremely simple but I can imagine that changing the current RB to do that would require a bit of work. Other than that we'd need to provide an implementation of td(lambda) which is also beyond the scope of this tuto.
  • Distributional DQN, noisy linear layers for exploration, etc. (stuff from rainbow). Again, a lot of work for a wide audience tutorial.
  • Just leave a comment in the tuto referring to this discussion or to another implementation saying that more advanced stuff can be found there.

I think that here we're bound by the constraints of what the tutorial is built on initially. For instance, we're using CartPole (where the reward is always 1, i.e. in some sense it is sparse, as only the done/terminated signal is informative), which isn't the most easy task to solve in a few iterations (not saying it's hard but it requires more trials, especially with td(0)). I'm also surprised by the choice to do this with pixels. I get that this is more appealing but it's much harder than learning from vector states, and again for a tutorial that needs to be solved in a few thousands of episodes I'd be surprised if we could do any better than what we're doing here.

@pseudo-rnd-thoughts
Copy link
Contributor

Somehow did not realise this before but why are we doing this as a vision-based version of cartpole?
If this is meant to be an introduction to reinforcement learning with Gym and neural network based DQN, this seems like a strange first tutorial.

Could we not do an initial tutorial using the raw state observations with Neural networks based DQN then make an advanced tutorial for Atari or Car Racing, problems that are only vision based

@SiftingSands
Copy link
Contributor Author

I think that here we're bound by the constraints of what the tutorial is built on initially. For instance, we're using CartPole (where the reward is always 1, i.e. in some sense it is sparse, as only the done/terminated signal is informative), which isn't the most easy task to solve in a few iterations (not saying it's hard but it requires more trials, especially with td(0)). I'm also surprised by the choice to do this with pixels. I get that this is more appealing but it's much harder than learning from vector states, and again for a tutorial that needs to be solved in a few thousands of episodes I'd be surprised if we could do any better than what we're doing here.

Agreed, this level of performance is probably all we can hope with vanilla DQN on pixels (trying to stay in the "spirit" of the original tutorial). I was initially expecting better results within a 1000 episodes (tutorial starts off w/ 50 and only suggests 300+), because this was within the official PyTorch docs.

@pseudo-rnd-thoughts Vision based cartpole is a "legacy" choice from the original author of the tutorial. I would defer to @vmoens on the decision to pivot to a state vector input.

@vmoens I can do the 3rd option you outlined and drop a mention to your torchrl tutorial. Just let me know if I should do that, and I'll open up a new PR. Unfortunately, that's all I can commit to at this time. (Although, switching to a non-pixel input should be very straightforward and likely achieve good performance within a 1000 episodes?)

@vmoens
Copy link
Contributor

vmoens commented Sep 23, 2022

Let me see with the authors of the tutorial just to make sure we're not rushing into anything but in general I'd be supportive to switch to a state-based tutorial.
Having a tuto that does not learn is counterproductive IMO.

EDIT: upon reflection I think that we should move to a state-based tuto and make the PR. Do you want to work on that @SiftingSands or should I take care of it? You've already done a lot so no pressure, I'd be happy to take it over.

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Sep 24, 2022

Let me see with the authors of the tutorial just to make sure we're not rushing into anything but in general I'd be supportive to switch to a state-based tutorial. Having a tuto that does not learn is counterproductive IMO.

EDIT: upon reflection I think that we should move to a state-based tuto and make the PR. Do you want to work on that @SiftingSands or should I take care of it? You've already done a lot so no pressure, I'd be happy to take it over.

Sounds like a plan. I'll push all of my changes to the PR this weekend. Switched to the full state vector input w/ a 3 layer MLP, AdamW instead of RMSProp (massive improvement), and minor changes to gamma, epsilon decay, tau (kept soft updates) -> got duration plateauing to 500 steps within 150 episodes with some annoying dips that quickly recover. Didn't even bother with any form of input normalization/scaling, since this setup really doesn't need it.

(didn't make a fancy chart for this one, episode duration vs episode #)
newplot(1)

EDIT : Nevermind, I decreased the LR from 3e-4 to 1e-4 in order to kill the dips but it plateaus a few 100 episodes later. I was hoping to keep 3e-4 in reference to a somewhat old meme https://twitter.com/karpathy/status/801621764144971776?lang=en
newplot(2)

…in a few 100 episodes. removed all code related to image processing. added timelimit wrapper. added soft updates.
@pseudo-rnd-thoughts
Copy link
Contributor

@SiftingSands thanks for the changes.
Given that we have changed the problem from an infinite horizon problem to a finite horizon (time limited) problem.
I think that we should change some of the maths to reflect termination and truncation.
We also need a note on the gym version used

@SiftingSands
Copy link
Contributor Author

Yes, my proposed code would be

if gym.__version__[:4] == '0.26':
    env = gym.make('CartPole-v1')
elif gym.__version__[:4] == '0.25':
    env = gym.make('CartPole-v1', new_step_api=True)
else:
    raise ImportException(f"Requires gym v25 or v26, actual version: {gym.__version__}")

Great! Yeah, it looks like that was the one change that got overwritten due to https://github.com/pytorch/tutorials/pull/2073/files . I can put that into a new commit, if that's the most convenient. (I'm not sure if you can directly make changes to this PR, since I believe only "maintainers" can do that) Just let me know what you think is best.

I think limiting the # of episodes to 600 is an acceptable change. I guess we'll wait for @malfet 's feedback, since he was going to benchmark it on colab.

@vmoens
Copy link
Contributor

vmoens commented Oct 15, 2022

What happens when gym moves to 0.27? Do we expect this not to run?
Can't we use the version.parse you compare versions?

IMO this kind of version-dependent code in a tutorial is counter productive... might be confusing for our first-time users no?

I agree with @malfet, we could have 2 number of episodes depending on the presence of cuda.

@pseudo-rnd-thoughts
Copy link
Contributor

pseudo-rnd-thoughts commented Oct 15, 2022

We (the current maintenance team) are not planning on making a Gym v27 release, there will be an announcement explaining why hopefully released on Monday

Edit: Next Monday
Edit: https://farama.org/Announcing-The-Farama-Foundation

@pseudo-rnd-thoughts
Copy link
Contributor

Yes, my proposed code would be

if gym.__version__[:4] == '0.26':
    env = gym.make('CartPole-v1')
elif gym.__version__[:4] == '0.25':
    env = gym.make('CartPole-v1', new_step_api=True)
else:
    raise ImportException(f"Requires gym v25 or v26, actual version: {gym.__version__}")

I realised it is more ugly than this, on reset, if we need to check the gym version as v25 will default to returning just the observation while v26 will return the observation and info. I don't know the elegant way of solving this for both other than adding the if statement above (we can simplify to know that it must be 0.26 or 0.25 now).

@SiftingSands
Copy link
Contributor Author

Yes, my proposed code would be

if gym.__version__[:4] == '0.26':
    env = gym.make('CartPole-v1')
elif gym.__version__[:4] == '0.25':
    env = gym.make('CartPole-v1', new_step_api=True)
else:
    raise ImportException(f"Requires gym v25 or v26, actual version: {gym.__version__}")

I realised it is more ugly than this, on reset, if we need to check the gym version as v25 will default to returning just the observation while v26 will return the observation and info. I don't know the elegant way of solving this for both other than adding the if statement above (we can simplify to know that it must be 0.26 or 0.25 now).

@pseudo-rnd-thoughts I looked for the announcement for Gym's versioning roadmap that you mentioned last week, but I didn't see anything. No one else commented on limiting the tutorial to v.25 or v.25, so let me know if that code snippet is still what you think is the best way forward. If that's still the case, then I'll make those changes and add the logic for changing the number of episodes dependent on CUDA existence. Hopefully, we can then finally close this PR.

@pseudo-rnd-thoughts
Copy link
Contributor

pseudo-rnd-thoughts commented Oct 25, 2022

Currently PyTorch tutorials uses Gym v25 as its version.
Therefore. I agree it would be easier to use a single version both for maintenance and ease of reading, we would need to update the other RL tutorials at the same time.

Edit: There exists a PR for updating the Mario RL repo to v26 #2069
So it should just be updating the requirements to v26 and checking the only CI issues are from the Mario RL repo. Then if both PRs are ready to merge at the same time, one could be merged just before the other to minimise CI error time (if that makes sense).

@pseudo-rnd-thoughts
Copy link
Contributor

@malfet @SiftingSands I don't think that it is possible for support v26 and v25. The CI is currently failing on obs, _ = env.reset() which to support for v25 needs to use obs, _ = env.reset(return_info=True).

As #2069 is already updating to v26, I would merge #2069 first and then merge this PR

Therefore, I wouldn't bother with supporting both and just support v26.
It means that this PR will fail CI until the #2069 is merged

Do you both agree or think there is a better plan?

@SiftingSands
Copy link
Contributor Author

@malfet @SiftingSands I don't think that it is possible for support v26 and v25. The CI is currently failing on obs, _ = env.reset() which to support for v25 needs to use obs, _ = env.reset(return_info=True).

As #2069 is already updating to v26, I would merge #2069 first and then merge this PR

Therefore, I wouldn't bother with supporting both and just support v26. It means that this PR will fail CI until the #2069 is merged

Do you both agree or think there is a better plan?

@pseudo-rnd-thoughts Sorry for the delayed response. I'm fine with waiting until #2069 is merged and only supporting v.26. However, I'm not a maintainer of the pytorch tutorials repo, so @malfet would probably have the final say.

I'm guessing there's a reason why this won't work, besides being ugly?

if gym.__version__[:4] == '0.26':
    obs, _ = env.reset() 
elif gym.__version__[:4] == '0.25':
    obs, _ = env.reset(return_info=True)

@pseudo-rnd-thoughts
Copy link
Contributor

There shouldn't be an issue with that code, I just dislike it for being ugly and that #2069 is already planning on updating to v26.
The problem is that on line 264 you don't include that if else statement for resetting with info.

@SiftingSands
Copy link
Contributor Author

There shouldn't be an issue with that code, I just dislike it for being ugly and that #2069 is already planning on updating to v26. The problem is that on line 264 you don't include that if else statement for resetting with info.

I made the changes for v.25, because nobody has replied to your last comment #2069 yet. It looks like CI now passes (got tired of not seeing green). I'll leave this up to @malfet or @vmoens on if they want to merge this now or wait.

Copy link
Contributor

@vmoens vmoens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the effort @SiftingSands and @pseudo-rnd-thoughts

@SiftingSands
Copy link
Contributor Author

@svekars I believe this will resolve #263

@SiftingSands
Copy link
Contributor Author

What's the hold up on merging this? @vmoens has already approved this.

@arr00
Copy link

arr00 commented Nov 28, 2022

Why isn't this being merged? I followed the current tutorial with no luck but this modified version works perfectly.

@vmoens
Copy link
Contributor

vmoens commented Nov 30, 2022

What's the hold up on merging this? @vmoens has already approved this.

Can you solve the conflicts? I'll make sure that we merge it after that

@SiftingSands
Copy link
Contributor Author

SiftingSands commented Dec 1, 2022

What's the hold up on merging this? @vmoens has already approved this.

Can you solve the conflicts? I'll make sure that we merge it after that

Conflict is resolved, but CI is failing on intermediate_source/ax_multiobjective_nas_tutorial.py?

@vmoens
Copy link
Contributor

vmoens commented Dec 8, 2022

Closed by #2145

@SiftingSands SiftingSands deleted the DQN_revise_training branch December 8, 2022 21:31
@SiftingSands
Copy link
Contributor Author

Closed by #2145

@vmoens Thanks for taking a look at those CI issues and getting this merged!

@carljparker carljparker mentioned this pull request Mar 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed rl Issues related to reinforcement learning tutorial, DQN, and so on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants