Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect AIRL reward #4

Open
rohitrango opened this issue Dec 12, 2019 · 1 comment
Open

Incorrect AIRL reward #4

rohitrango opened this issue Dec 12, 2019 · 1 comment

Comments

@rohitrango
Copy link

The paper suggests that the reward is given by: f(s, a, s') - \pi(a | s) (which is the same as logD - log(1-D)) but the reward in the repo is g(s, a).
Why is this discrepancy?

@uidilr
Copy link
Owner

uidilr commented Dec 16, 2019

Hi rohitrango, thank you for asking the question!
Answers for the question are as follows.

  1. Discriminator do not have to add - \log \pi to reward value because - \log \pi is added to reward in PPO algorithm as an entropy term.
  2. Of course, you can use f(s, a, s') as a reward, but I use g(s) as r(s) because the AIRL paper says r*(s) = g*(s) + const holds.

I hope that it answers your questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants