-
Notifications
You must be signed in to change notification settings - Fork 139
Adaptive normalization for unknown reward sizes #88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Adaptive normalization for unknown reward sizes #88
Conversation
The code looks good! I am aware of the problem of non-normalized rewards, which is why I had introduced the more manual |
Yeah I see, for cases with known maximum rewards, that in combination cpuct fixes the problem as well (cpuct of around 15 seems to be right for the gridworld problem).
Yes, this a bit of a downside as bad solutions could be found that seem to be optimal when the network isn't trained much yet. However, the exploration of the mcts should still guide towards better solutions than the networks points to, which would in time improve the network and the sample quality.
You're right, should be fixed |
Is this change only about saving the user the effort of figuring out the right normalization factor for the rewards? If this is the case, I believe the complexity added and the potential learning instability does not justify an inclusion in the main library. However, I am starting to wonder if this change may not benefit learning in some environments where only small rewards are initially available to a week agent for a long time before it becomes good enough to collect bigger ones. In these cases, it might be sound to have the normalization factor evolve over time to adapt to the learning agent. Are you aware of any RL library or publication featuring this? If this is not the case and you can demonstrate a significant gain from using an adaptive reward normalization factor in a concrete environment, this may get you a nice blog article. :-) |
No, the intended use is only for uses where the maximum reward is unknown or varies per problem instance. In that case you can't use a static normalization factor as you either don't know it or it varies based on the parameters of the problem instance. The gridworld isn't the best example for demonstration, as the maximum reward is known (10). I guess a better example would be the gridworld problem with randomly scaled rewards. |
Good luck with your thesis. I am looking forward to hearing more about it! Also, I understand the motivation for your proposal now. I am leaving this PR open for now but I am interested in merging it after we have more experimental results. |
In games with unknown reward sizes, MCTS exploration is hindered as the ratio between the Q term and the exploration therm varies. To deal with this, the rewards are scaled based on the best reward found during the search to be scaled between -1 and 1. Likewise, the value function is scaled to learn normalized value estimations.
The scaling is based on the maximum absolute value found so far to scale all values between -1 and 1 while keeping zero rewards at 0.
Lastly, I'm quite new at Julia, so I'm open to suggestions on improving the code.