XOXO²

Introduction

The project was created during the final project phase of the Data Science Bootcamp at the Spiced Academy in Berlin in November 2023.
The project goal was to use Reinforcement Learning to teach an agent how to play Ultimate Tic-Tac-Toe (U_T-T-T).
In this group project, we first created a Monte Carlo Tree Search (MCTS) search algorithm from scratch. Then we used an Artificial Neural Network (ANN) to implement Proximal Policy Optimization (PPO) to improve the performance of our agent.
Addtionally, we implemented an interactive interface with flask and html, as a final product, where the user can play Ultimate Tic-Tac-Toe against our engine.

Project members:
Lukas Kaufmann, Paul Kotyrba, Nils Schillmann, Sreejith Thamban.
Thanks to Jannes Brunner.

Rules of Ultimate Tic-Tac-Toe

Ultimate Tic-Tac-Toe (U_T-T-T) is played on nine tic-tac-toe boards arranged in a 3 × 3 grid.
Playing on a field inside a board (local game), determines the next board in which the opponent must play their next move.
The goal is to win three boards (local games) in a row.
You can play your next move at any board, if you are directed to play in a full board or a board that has been won and therefore is already closed.

Interface Example

About Reinforcement Learning

Reinforcement Learning is used not only in gaming environments, but also has use cases in robotics, self-driving cars and the development of generative AI. We were interested in exploring this topic, since it was not part of the curriculum of our bootcamp and has gained more and more importance over the years.

Model types

As mentioned before, we first created a Monte Carlo Tree Search Algorithm (MCTS) from scratch and additionally implemented a memory file from former iterations that the model can load and therefore learn from.
As a second model, we created a Neural Network structure to perform Proximal Policy Optimization (PPO).

MCTS

The MCTS works as follows: One iteration is completed after selecting a move, expanding and exploring the game tree (all nodes in the tree are possible game states or moves that can be played) and simulating the outcome from there by playing random moves. The outcome of this playout is than backpropagated to the top, along the path that has been played. The visit counts and reward values of these nodes are then updated. In order to access these results in the future, we implemented a memory function. This memory file grows large in size quickly, though.

PPO

In the PPO model, our agent performs an action (=move) in the environment (=game) and gets a new game state, as well as rewards for his actions. The actor network predicts probabilites for each move and the critic network estimates the value of the given board state. Compared to the MCTS memory file, the file sizes for the two networks are very small and therefore more useful in uploading them into the frontend / playable gaming interface.

Model evaluation

In order to test the performance of our agents, we let them play U_T-T-T against our baseline model, which performs random moves, as well as against each other.

Finally, we put our two strongest agents - MCTS with memory and PPO+MCTS - in the ring to play against each other.

The final results show, that our PPO+MCTS agent has the strongest performance and wins about 60% of the games against MCTS with memory.

It is therefore our current strongest agent and connected with the interface.

Files and Folders

The following tree-like folder structure diagram provides orientation over our repository:

.
├── data/
│   ├── eda/
│   │   └── ... ->  files for 'model_analysis.ipynb'
│   ├── local/
│   │   └── ... ->  local files, will not be uploaded to repository
│   └── models/
│       ├── mcts/
│       │   └── mcts_memory_small.pkl
│       └── ppo/
│           ├── ppo_v1_actor.pth
│           └── ppo_v1_critic.pth
├── frontend/
│   └── ds-uttt/
│       └── ... ->  interface setup
├── images/
│   └── ... ->  files for 'README.md'
├── logs
├── src/
│   ├── agents/
│   │   ├── network/
│   │   │   └── network.py
│   │   ├── agent.py
│   │   ├── human.py
│   │   ├── mcts.py
│   │   ├── ppo.py
│   │   └── random.py
│   ├── environments/
│   │   ├── game
│   │   └── uttt_env
│   ├── old_files/
│   │   └── ... ->  outdated files for comprehension purposes
│   ├── tests/
│   │   ├── test_game.py
│   │   └── test_mcts.py
│   ├── main_flask.py
│   ├── playout.py
│   ├── test_ppo.py
│   ├── train_mcts.py
│   └── train_ppo.py
├── environment.yml
├── model_analysis.ipynb
└── README.md

Setup Environment

To set up with anaconda / mamba:

conda env create --file environment.yml

To update the environment:

conda env update --file environment.yml

To activate the environment:

conda activate ds-uttt

Setup Interface

TO RUN GAME (backend):

use the command: "flask --debug --app main_flask run" (Note: make sure you are in the correct folder)
open a new terminal window, go to ../frontend/ds-uttt

TO RUN GAME (frontend): 3. use the command: "npm run dev"

click or copy the URL: "http://localhost:5173/"
The game should be running in your browser. (Note: if you experience some graphical issues, try another browser)

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
data		data
frontend		frontend
images		images
logs		logs
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
model_analysis.ipynb		model_analysis.ipynb
package-lock.json		package-lock.json

KaufmannLukas/ds-ultimate-tic-tac-toe

Folders and files

Latest commit

History

Repository files navigation

XOXO²

Introduction

Rules of Ultimate Tic-Tac-Toe

Interface Example

About Reinforcement Learning

Model types

MCTS

PPO

Model evaluation

Files and Folders

Setup Environment

Setup Interface

About

Topics

Resources

Stars

Watchers

Forks

Languages