Adaptive normalization for unknown reward sizes #88

casper2002casper · 2022-01-06T13:30:30Z

In games with unknown reward sizes, MCTS exploration is hindered as the ratio between the Q term and the exploration therm varies. To deal with this, the rewards are scaled based on the best reward found during the search to be scaled between -1 and 1. Likewise, the value function is scaled to learn normalized value estimations.
The scaling is based on the maximum absolute value found so far to scale all values between -1 and 1 while keeping zero rewards at 0.
Lastly, I'm quite new at Julia, so I'm open to suggestions on improving the code.

casper2002casper · 2022-01-06T14:01:18Z

Gridworld might not be the best example to demonstrate its use as the maximum reward is known, however comparing results of the adaptive normalization shows improvement over the old standard parameters.
Original parameters

Adaptive reward normalization

This due to the reward therm overpowering the exploration therm with the original parameters, but adaptive reward normalization is able to fix it.

jonathan-laurent · 2022-01-06T14:32:20Z

The code looks good!

I am aware of the problem of non-normalized rewards, which is why I had introduced the more manual normalize_rewards hyperparameter. However, I am unsure about your adaptive normalization scheme as it may result in different samples having their rewards normalized differently, therefore providing a moving target for the network. Also, what happens when typical rewards are too small instead of too big? In this case, I guess the adaptive scheme would not work as you initialize the normalizing factor at 1.

casper2002casper · 2022-01-06T15:58:48Z

I am aware of the problem of non-normalized rewards, which is why I had introduced the more manual normalize_rewards hyperparameter.

Yeah I see, for cases with known maximum rewards, that in combination cpuct fixes the problem as well (cpuct of around 15 seems to be right for the gridworld problem).

However, I am unsure about your adaptive normalization scheme as it may result in different samples having their rewards normalized differently, therefore providing a moving target for the network.

Yes, this a bit of a downside as bad solutions could be found that seem to be optimal when the network isn't trained much yet. However, the exploration of the mcts should still guide towards better solutions than the networks points to, which would in time improve the network and the sample quality.

Also, what happens when typical rewards are too small instead of too big? In this case, I guess the adaptive scheme would not work as you initialize the normalizing factor at 1.

You're right, should be fixed

jonathan-laurent · 2022-01-07T18:04:47Z

Is this change only about saving the user the effort of figuring out the right normalization factor for the rewards? If this is the case, I believe the complexity added and the potential learning instability does not justify an inclusion in the main library.

However, I am starting to wonder if this change may not benefit learning in some environments where only small rewards are initially available to a week agent for a long time before it becomes good enough to collect bigger ones. In these cases, it might be sound to have the normalization factor evolve over time to adapt to the learning agent. Are you aware of any RL library or publication featuring this? If this is not the case and you can demonstrate a significant gain from using an adaptive reward normalization factor in a concrete environment, this may get you a nice blog article. :-)

casper2002casper · 2022-01-08T14:38:19Z

No, the intended use is only for uses where the maximum reward is unknown or varies per problem instance. In that case you can't use a static normalization factor as you either don't know it or it varies based on the parameters of the problem instance. The gridworld isn't the best example for demonstration, as the maximum reward is known (10). I guess a better example would be the gridworld problem with randomly scaled rewards.
The intended use case is mostly for combinatorial optimization problems. You would want to train an agent that is able to solve randomly generated problem instances, however you don't know beforehand what the range of rewards are, so you'd want to scale them adaptively.
I'm actually doing my thesis about this topic, so I'll update this PR once I get some results from my work.
The scaling is based on this paper, but I use the maximum absolute value to scale the rewards between [-1,1] instead while keeping zero rewards at 0.0.

jonathan-laurent · 2022-01-08T14:53:26Z

Good luck with your thesis. I am looking forward to hearing more about it!

Also, I understand the motivation for your proposal now. I am leaving this PR open for now but I am interested in merging it after we have more experimental results.

casper2002casper added 3 commits January 6, 2022 14:13

Normalize MCTS exploration

eb05bd7

Normalize rewards

e279ec6

Use adaptive_normalization in gridworld

1a6517c

Allow normalization for <1 values

4215ea7

Use root node for scaling cpuct

2cf081a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adaptive normalization for unknown reward sizes #88

Adaptive normalization for unknown reward sizes #88

Uh oh!

casper2002casper commented Jan 6, 2022

Uh oh!

casper2002casper commented Jan 6, 2022 •

edited

Loading

Uh oh!

jonathan-laurent commented Jan 6, 2022

Uh oh!

casper2002casper commented Jan 6, 2022

Uh oh!

jonathan-laurent commented Jan 7, 2022 •

edited

Loading

Uh oh!

casper2002casper commented Jan 8, 2022

Uh oh!

jonathan-laurent commented Jan 8, 2022

Uh oh!

Uh oh!

Adaptive normalization for unknown reward sizes #88

Are you sure you want to change the base?

Adaptive normalization for unknown reward sizes #88

Uh oh!

Conversation

casper2002casper commented Jan 6, 2022

Uh oh!

casper2002casper commented Jan 6, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jonathan-laurent commented Jan 6, 2022

Uh oh!

casper2002casper commented Jan 6, 2022

Uh oh!

jonathan-laurent commented Jan 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

casper2002casper commented Jan 8, 2022

Uh oh!

jonathan-laurent commented Jan 8, 2022

Uh oh!

Uh oh!

casper2002casper commented Jan 6, 2022 •

edited

Loading

jonathan-laurent commented Jan 7, 2022 •

edited

Loading