Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Implement self-play for two-player zero-sum games #103

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

RPegoud
Copy link
Contributor

@RPegoud RPegoud commented Jul 15, 2024

Issue: #99

Description:
Add self-play versions of DQN and PPO for two-player zero-sum games in PGX environments.

Checklist:

  • Determine how to keep the value estimation consistent (e.g. flip the board, use a negative discount)
  • Add PGX environment configs
  • Implement self-play for DQN
  • And for PPO
  • (optional) If possible, for AlphaZero

@RPegoud
Copy link
Contributor Author

RPegoud commented Jul 15, 2024

Hey, I finally have some time to work on this. It seems that PGX flips the board at each step, therefore the observation should be consistent between co-players:

def _step(state: State, action: Array):
    ...
    state = _apply_move(state, a)
    state = _flip(state)
    ...
    state = _update_history(state)
    state = state.replace(legal_action_mask=_legal_action_mask(state))  # type: ignore
    state = _check_termination(state)
    return state

For the values, using negative discounts should do the trick. This seems to be the approach used in Deepmind's MCTX library.

The main question I have at the moment is how do we evaluate the agent's performance when training with self-play? One option would be to train for N steps, collect M parameter checkpoints, and measure the performance of all agents in a round-robin setting or something similar. What do you think?

@EdanToledo
Copy link
Owner

Hey, I finally have some time to work on this. It seems that PGX flips the board at each step, therefore the observation should be consistent between co-players:

def _step(state: State, action: Array):
    ...
    state = _apply_move(state, a)
    state = _flip(state)
    ...
    state = _update_history(state)
    state = state.replace(legal_action_mask=_legal_action_mask(state))  # type: ignore
    state = _check_termination(state)
    return state

For the values, using negative discounts should do the trick. This seems to be the approach used in Deepmind's MCTX library.

The main question I have at the moment is how do we evaluate the agent's performance when training with self-play? One option would be to train for N steps, collect M parameter checkpoints, and measure the performance of all agents in a round-robin setting or something similar. What do you think?

So this is a good question, i like the idea of doing a round robin like that but maybe the easiest thing to do for now is simply evaluate against the last checkpoint before 1 iteration of training. We can do round robin style afterwards i think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants