Incorrect AIRL reward #4

rohitrango · 2019-12-12T04:35:48Z

The paper suggests that the reward is given by: f(s, a, s') - \pi(a | s) (which is the same as logD - log(1-D)) but the reward in the repo is g(s, a).
Why is this discrepancy?

uidilr · 2019-12-16T09:34:39Z

Hi rohitrango, thank you for asking the question!
Answers for the question are as follows.

Discriminator do not have to add - \log \pi to reward value because - \log \pi is added to reward in PPO algorithm as an entropy term.
Of course, you can use f(s, a, s') as a reward, but I use g(s) as r(s) because the AIRL paper says r*(s) = g*(s) + const holds.

I hope that it answers your questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect AIRL reward #4

Incorrect AIRL reward #4

rohitrango commented Dec 12, 2019

uidilr commented Dec 16, 2019 •

edited

Loading

Incorrect AIRL reward #4

Incorrect AIRL reward #4

Comments

rohitrango commented Dec 12, 2019

uidilr commented Dec 16, 2019 • edited Loading

uidilr commented Dec 16, 2019 •

edited

Loading