Skip to content

Adaptive normalization for unknown reward sizes #88

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

casper2002casper
Copy link

In games with unknown reward sizes, MCTS exploration is hindered as the ratio between the Q term and the exploration therm varies. To deal with this, the rewards are scaled based on the best reward found during the search to be scaled between -1 and 1. Likewise, the value function is scaled to learn normalized value estimations.
The scaling is based on the maximum absolute value found so far to scale all values between -1 and 1 while keeping zero rewards at 0.
Lastly, I'm quite new at Julia, so I'm open to suggestions on improving the code.

@casper2002casper
Copy link
Author

casper2002casper commented Jan 6, 2022

Gridworld might not be the best example to demonstrate its use as the maximum reward is known, however comparing results of the adaptive normalization shows improvement over the old standard parameters.
Original parameters
benchmark_reward
Adaptive reward normalization
benchmark_reward_adaptive
This due to the reward therm overpowering the exploration therm with the original parameters, but adaptive reward normalization is able to fix it.

@jonathan-laurent
Copy link
Owner

The code looks good!

I am aware of the problem of non-normalized rewards, which is why I had introduced the more manual normalize_rewards hyperparameter. However, I am unsure about your adaptive normalization scheme as it may result in different samples having their rewards normalized differently, therefore providing a moving target for the network. Also, what happens when typical rewards are too small instead of too big? In this case, I guess the adaptive scheme would not work as you initialize the normalizing factor at 1.

@casper2002casper
Copy link
Author

I am aware of the problem of non-normalized rewards, which is why I had introduced the more manual normalize_rewards hyperparameter.

Yeah I see, for cases with known maximum rewards, that in combination cpuct fixes the problem as well (cpuct of around 15 seems to be right for the gridworld problem).

However, I am unsure about your adaptive normalization scheme as it may result in different samples having their rewards normalized differently, therefore providing a moving target for the network.

Yes, this a bit of a downside as bad solutions could be found that seem to be optimal when the network isn't trained much yet. However, the exploration of the mcts should still guide towards better solutions than the networks points to, which would in time improve the network and the sample quality.

Also, what happens when typical rewards are too small instead of too big? In this case, I guess the adaptive scheme would not work as you initialize the normalizing factor at 1.

You're right, should be fixed

@jonathan-laurent
Copy link
Owner

jonathan-laurent commented Jan 7, 2022

Is this change only about saving the user the effort of figuring out the right normalization factor for the rewards? If this is the case, I believe the complexity added and the potential learning instability does not justify an inclusion in the main library.

However, I am starting to wonder if this change may not benefit learning in some environments where only small rewards are initially available to a week agent for a long time before it becomes good enough to collect bigger ones. In these cases, it might be sound to have the normalization factor evolve over time to adapt to the learning agent. Are you aware of any RL library or publication featuring this? If this is not the case and you can demonstrate a significant gain from using an adaptive reward normalization factor in a concrete environment, this may get you a nice blog article. :-)

@casper2002casper
Copy link
Author

No, the intended use is only for uses where the maximum reward is unknown or varies per problem instance. In that case you can't use a static normalization factor as you either don't know it or it varies based on the parameters of the problem instance. The gridworld isn't the best example for demonstration, as the maximum reward is known (10). I guess a better example would be the gridworld problem with randomly scaled rewards.
The intended use case is mostly for combinatorial optimization problems. You would want to train an agent that is able to solve randomly generated problem instances, however you don't know beforehand what the range of rewards are, so you'd want to scale them adaptively.
I'm actually doing my thesis about this topic, so I'll update this PR once I get some results from my work.
The scaling is based on this paper, but I use the maximum absolute value to scale the rewards between [-1,1] instead while keeping zero rewards at 0.0.

@jonathan-laurent
Copy link
Owner

Good luck with your thesis. I am looking forward to hearing more about it!

Also, I understand the motivation for your proposal now. I am leaving this PR open for now but I am interested in merging it after we have more experimental results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants