-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Action masking (feature request) #77
Comments
This would be nice to have. One question is whether the mask should be part of the environment/MDP model or specified separately (or both). Currently, POMDPs.jl has |
For efficiently filtering the NN output, the action mask should probably be a boolean array, so |
Right. Having both efficiency and a simiple interface is a difficult challenge. The question should always be "what is the best development path?" We should probably not allow efficiency concerns to prevent us from implementing the feature. That would be premature optimization, and it is not clear what fraction of the time would be taken up by masking anyways. A good maxim here is "make it work, make it right, make it fast" (in that order). It may be that
This is not as good as the bit array, but I would say also not egregious, and it can be optimized later if we find that it is a performance bottleneck. We also may be able to cache masks if the same states are encountered often. Bottom line, don't let efficiency concerns derail development 🙂 |
To construct |
|
The replay buffer may need to be modified to store the mask array for each state. Looking at line 265 of solver.jl,
This computes the maximum of Therefore, some data structures in the current code would need to be modified to support masking. This may cause an overhead for solving models that do not use masking, though the overhead is likely small. |
I like the way you are thinking! Another option (with pros and cons) would be to maintain a separate data structure parallel to the main (s, a, sp, r) buffer that just holds masks. Pro: could be more easily disabled if the masks are not needed; Con: perhaps harder to maintain if we change the structure of the main buffer. |
For example, consider a 2D maze environment. The state is the grid coordinates, and the actions are moving left, right, up or down, but the only valid actions are those that do not cross the wall in the maze. It would be nice if the user could specify which actions are valid. The invalid actions are "masked" and ignored by the
NNPolicy
action selection.The text was updated successfully, but these errors were encountered: