Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOLA breaks when changing number of actions and/or states #6

Open
jazzbob opened this issue Oct 11, 2018 · 5 comments
Open

LOLA breaks when changing number of actions and/or states #6

jazzbob opened this issue Oct 11, 2018 · 5 comments

Comments

@jazzbob
Copy link

jazzbob commented Oct 11, 2018

I try to edit IPD to a setup with four actions. This yields a 4x4 payoff matrix and a 17-dimensional input, which breaks both LOLA and LOLA-DiCE implementations.

  • train_exact.py assumes that NUM_ACTIONS = 4 and NUM_STATES = 5 in the environment.
  • Isn't the number of states also depending on number of actions: NUM_STATES = NUM_ACTIONS ** 2 + 1?
  • As the payoff for agent 2 is simply the transposed payoff matrix: Does the game have to by symmetric? Or are different payoffs per agent possible in the current implementation?
@jakobnicolaus
Copy link
Collaborator

jakobnicolaus commented Oct 11, 2018

Thanks for the comment! If you send a pull-request with a fix we can include the generalisation to an arbitrary number of actions.

Also, the payoff matrix of the second agent is currently the transpose for convenience reasons. Feel free to change it to account for non-symmetric payoffs.

@jazzbob
Copy link
Author

jazzbob commented Oct 12, 2018

Thanks for your prompt answer. I'll send a PR when I get there. There may be a lack of time and/or ability, though. :)

@jakobnicolaus
Copy link
Collaborator

Great - let us know if you get stuck.

@jazzbob
Copy link
Author

jazzbob commented Oct 12, 2018

For DiCE, it seems like there is an issue in policy.py in lines 105 to 115. The logits created by Sonnet's BatchApply (line 106) use a default n_dims=2. Changing that yields an exception in the Sonnet code. I did not dig into that yet.

For now, I worked around that by changing the logits in policy.py. While this seems to work syntactically (i.e. no exceptions at runtime, at least simple baselines are learned), I wonder whether policy semantics are preserved currently. To be precise, I mostly wonder about line 109. What is happening here?

logits = tf.concat([logits, tf.zeros_like(logits)], -1)

Do you see an option to get the logits in a shape that represents the number of actions in the environment without setting n_dims for Sonnet's BatchApply?

On a side note, I removed the masking (lines 113 to 115), as in my case, all actions are available to agents at all times. Would this harm DiCE's semantics?

@jakobnicolaus
Copy link
Collaborator

Sorry for the delay on our end.

It would be good to keep the masking in the code, since our environments in principle support this.
You only need one parameter to parameterise a binary random variable, but the sampling expects 2 logits for the different options. That's what the line is for you are referring to.

I think the easiest way to go about all of this is to make the weights of size: #of actions x #of states.
Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants