-
Notifications
You must be signed in to change notification settings - Fork 4.1k
[WIP] Improve training of DQN tutorial #2030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…e penalty in the first few episodes to enable more exploration
✅ Deploy Preview for pytorch-tutorials-preview ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we're on the good track!
One thing I'm not super familiar / comfortable with is the batch norm. In general many don't use it in RL. In this case I would at least turn it off for the select_action function, ie. for data collection (call policy.eval() before and policy.train() after).
For the target network I would call requires_grad_(False) rather than detaching and I think we can avoid calling eval on it (the target net could be in train no?), hence batch_norm would still be "active" there.
… # of episodes to match for now
See if that last commit is what you had in mind. I decided to use torch.no_grad() instead of setting requires_grad=False on all of the params of the target network. I agree, it seems logical to have the batch norm behavior be the same for both the target and policy networks for loss calculation. However, I wonder if the target network only updating every 10 episodes makes that a moot point. I got it running in the background anyway. FYI I'm only doing 10 runs for each "config", so I'm sure the mean results (when I get around to making those graphs) will still be noisy. We could try removing batch norm altogether as well. The only other thought I had right now was changing the replay memory sampling from uniform to something learned or a heuristic, but I'm on the fence due to the added complexity for a tutorial. |
I don't think that a prioritized RB will change much and that will be a lot of trouble for a tutorial.
IMO we should also check if the data as is is properly normalized, I don't think it is. We could compute the stats over, say, 100 images for each channels (i.e. a mean of 3 values and a std of 3 values) and normalize the data at every step with that to get something close to normally distributed. |
I decided to setup some experiment tracking infrastructure (test driving mlflow), so things are more manageable. I ran 5 configs so far, but only doing 10 runs for each had very noisy results as expected. For example, here's a plot of mean +/- 1 sigma and median +/- 25%. Instead of that mess, I decided to smooth the medians to make multiple runs readable. Here I'm using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.ewm.html with So, here's the comparison between all of my runs so far. Gonna test out soft updates and input normalization afterwards.
|
Feels like i'm giving pretty bad suggestions haha (thanks for the plot! do you think you could provide a version where we can see the scale of the returns more clearly? Here the log-scale + small chars makes it hard to read it) |
@vmoens @SiftingSands Hey, Im one of the maintainers of gym and just saw this PR as I noted we needed to update it with the last changes in v26 so thanks for the PR Could it be updated to v26 instead of v25? We should specify what gym version this tutorial uses. See requirements.txt for gym version I also had a couple of questions about the current tutorial that you might know the answer to
Couple of issues
Suggested upgrades
|
Thanks for the suggestions on the documentation! It all seems reasonable to me, but I'll defer to @vmoens on further discussion of the doc changes. |
Hi @pseudo-rnd-thoughts I don't think we should unwrap the environment, that does not seem like something we want to do. The number of episodes being run is indeed puzzling in this tutorial and I guess most user wonder why so few are proposed. We should definitely increase that. For the gradient clippping I believe we should use +1 on the done to terminated change. In a nutshell there is work to do! @SiftingSands feel free to make those edits in your PR, I can give a hand if needed. |
I decided to try out offline normalization like what's done for ImageNet trained models (e.g. Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))). We could do a moving average and std dev calculation as you suggested, but maybe this is 99% of the benefit. Regardless, I tracked the average of the mean and std dev of the "state" over 2000 episodes in the figure below (x axis is actually "steps" instead of episodes). I think the sharp movements are due to the beginning of a new episode. I was hoping the std dev would level out, but I expect it to stay within the range of [0.006, 0.007]. The mean is, unsurprisingly, effectively zero, because the "state" is a frame difference. At the end of 2000 episodes these are the numerical results for each channel. The transform that I ended up using was
I ran the input normalization on the
(switched y axis to linear scaling and increased the font size a bit, but it should show up fine if you click on the figure to expand) @vmoens Find any improvements on your end? I can still look into soft updates, since that should be a pretty simple change. I've been saving the best model from each run, so I still gotta run them in the environment and see if they actually perform. |
Hey, is there any timeline on this PR and what are the aims of the PR as that isn't clear by the initial comment. If we re-add the TimeLimit wrapper with CartPole-v1 (defaults to 500), then an easy aim could be for a script that will train a neural network to achieve on average X rewards? Im not sure what a reasonable reward is |
I agree that the tuto is useful as is but it looks bad if it does not train... I would advise against anything fancy (like reward shaping) but for instance proper normalization of the observations or good usage of gym's API seem to be crucial points. To narrow the scope, our goal is basically to have a version of the tuto that trains out of the box (which was far from being the case before). |
@vmoens Here's the results for soft updates of the target network using \tau = 0.002, (θ′ ← τ θ + (1 −τ )θ′). Updates were done at every step instead of a predefined # of episodes. The baseline run now includes offline input normalization. I reran the baseline, because I updated Gym to v0.26.1 and updated various API calls (switched to cartpole-v1), so the results are a bit different from the plot from yesterday. I didn't bother running it with my changes, since the reward shaping was quite a hack. I might try tweaking a few parameters in the next day or so, but I won't be able to contribute too much more on this in the near term. |
@SiftingSands I got it running using TorchRL, I'll make a PR on https://github.comfacebookresearch/rl by EOD and ping you with it EDIT In a nutshell, we used the following config:
For the rest the tuto should be self-explanatory. Here are the results (I truncated the run early, but previously I was able to get a better perf by running it for twice the time) |
@vmoens Thanks for all the hard work on that tutorial! My understanding is that the current implementation roughly on par with your td(0) results? The overall trend of the training history of episode lengths is very reminiscent of the current results, if I look at a single one of my repeated runs instead of the statistics over the 10 repeats. As you mentioned, the addition of td(\lambda) is the most significant performance improvement. However, the current tutorial's replay buffer only holds state transitions instead of trajectories. Correct me if I'm wrong, but a td(\lambda) implementation would require this tutorial to have a replay buffer of trajectories instead of state transitions? If so, I'm afraid that I can't commit to making the necessary updates at this time. However, I can open another PR for the minor gym API and doc changes (if of interest), and someone else can take the reigns from there. |
yeah so here are our options:
I think that here we're bound by the constraints of what the tutorial is built on initially. For instance, we're using CartPole (where the reward is always 1, i.e. in some sense it is sparse, as only the done/terminated signal is informative), which isn't the most easy task to solve in a few iterations (not saying it's hard but it requires more trials, especially with td(0)). I'm also surprised by the choice to do this with pixels. I get that this is more appealing but it's much harder than learning from vector states, and again for a tutorial that needs to be solved in a few thousands of episodes I'd be surprised if we could do any better than what we're doing here. |
Somehow did not realise this before but why are we doing this as a vision-based version of cartpole? Could we not do an initial tutorial using the raw state observations with Neural networks based DQN then make an advanced tutorial for Atari or Car Racing, problems that are only vision based |
Agreed, this level of performance is probably all we can hope with vanilla DQN on pixels (trying to stay in the "spirit" of the original tutorial). I was initially expecting better results within a 1000 episodes (tutorial starts off w/ 50 and only suggests 300+), because this was within the official PyTorch docs. @pseudo-rnd-thoughts Vision based cartpole is a "legacy" choice from the original author of the tutorial. I would defer to @vmoens on the decision to pivot to a state vector input. @vmoens I can do the 3rd option you outlined and drop a mention to your torchrl tutorial. Just let me know if I should do that, and I'll open up a new PR. Unfortunately, that's all I can commit to at this time. (Although, switching to a non-pixel input should be very straightforward and likely achieve good performance within a 1000 episodes?) |
Let me see with the authors of the tutorial just to make sure we're not rushing into anything but in general I'd be supportive to switch to a state-based tutorial. EDIT: upon reflection I think that we should move to a state-based tuto and make the PR. Do you want to work on that @SiftingSands or should I take care of it? You've already done a lot so no pressure, I'd be happy to take it over. |
Sounds like a plan. I'll push all of my changes to the PR this weekend. Switched to the full state vector input w/ a 3 layer MLP, AdamW instead of RMSProp (massive improvement), and minor changes to gamma, epsilon decay, tau (kept soft updates) -> got duration plateauing to 500 steps within 150 episodes with some annoying dips that quickly recover. Didn't even bother with any form of input normalization/scaling, since this setup really doesn't need it. (didn't make a fancy chart for this one, episode duration vs episode #) EDIT : Nevermind, I decreased the LR from 3e-4 to 1e-4 in order to kill the dips but it plateaus a few 100 episodes later. I was hoping to keep 3e-4 in reference to a somewhat old meme https://twitter.com/karpathy/status/801621764144971776?lang=en |
…in a few 100 episodes. removed all code related to image processing. added timelimit wrapper. added soft updates.
@SiftingSands thanks for the changes. |
Great! Yeah, it looks like that was the one change that got overwritten due to https://github.com/pytorch/tutorials/pull/2073/files . I can put that into a new commit, if that's the most convenient. (I'm not sure if you can directly make changes to this PR, since I believe only "maintainers" can do that) Just let me know what you think is best. I think limiting the # of episodes to 600 is an acceptable change. I guess we'll wait for @malfet 's feedback, since he was going to benchmark it on colab. |
What happens when gym moves to 0.27? Do we expect this not to run? IMO this kind of version-dependent code in a tutorial is counter productive... might be confusing for our first-time users no? I agree with @malfet, we could have 2 number of episodes depending on the presence of cuda. |
We (the current maintenance team) are not planning on making a Gym v27 release, there will be an announcement explaining why hopefully released on Monday Edit: Next Monday |
I realised it is more ugly than this, on |
@pseudo-rnd-thoughts I looked for the announcement for Gym's versioning roadmap that you mentioned last week, but I didn't see anything. No one else commented on limiting the tutorial to v.25 or v.25, so let me know if that code snippet is still what you think is the best way forward. If that's still the case, then I'll make those changes and add the logic for changing the number of episodes dependent on CUDA existence. Hopefully, we can then finally close this PR. |
Currently PyTorch tutorials uses Gym v25 as its version. Edit: There exists a PR for updating the Mario RL repo to v26 #2069 |
…; hardware dependent # episodes w/ more writing
@malfet @SiftingSands I don't think that it is possible for support v26 and v25. The CI is currently failing on As #2069 is already updating to v26, I would merge #2069 first and then merge this PR Therefore, I wouldn't bother with supporting both and just support v26. Do you both agree or think there is a better plan? |
@pseudo-rnd-thoughts Sorry for the delayed response. I'm fine with waiting until #2069 is merged and only supporting v.26. However, I'm not a maintainer of the pytorch tutorials repo, so @malfet would probably have the final say. I'm guessing there's a reason why this won't work, besides being ugly?
|
There shouldn't be an issue with that code, I just dislike it for being ugly and that #2069 is already planning on updating to v26. |
I made the changes for v.25, because nobody has replied to your last comment #2069 yet. It looks like CI now passes (got tired of not seeing green). I'll leave this up to @malfet or @vmoens on if they want to merge this now or wait. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the effort @SiftingSands and @pseudo-rnd-thoughts
What's the hold up on merging this? @vmoens has already approved this. |
Why isn't this being merged? I followed the current tutorial with no luck but this modified version works perfectly. |
Can you solve the conflicts? I'll make sure that we merge it after that |
Conflict is resolved, but CI is failing on |
Closed by #2145 |
Following up the discussion from #2026
I still need to do multiple runs to get a semblance of the statistics of # episodes vs duration for both the original and my changes. The slight increase in model capacity still only uses ~1.5 GB of VRAM, so it should be pretty accessible and training is still relatively quick.
Here's the reward history for one run of these tweaks when I was doing a bunch of trial and error (spent an embarrassing amount of time tweaking hyperparameters and rewards).
@vmoens Feel free to change (or completely discard) anything based on your findings. I haven't tried tweaking anything else in the training pipeline.