A machine learning project that implements a digital Rubik's Cube solver using reinforcement learning techniques, specifically the Proximal Policy Optimization (PPO) algorithm. The solver creates a virtual 3ร3 Rubik's Cube environment, scrambles it, and employs deep reinforcement learning to find optimal solutions.
- Digital Rubik's Cube Environment: Complete 3D cube simulation with one-hot encoded color representation
- Reinforcement Learning: PPO algorithm implementation with custom neural network architecture
- Progressive Training: Curriculum learning approach with increasing scramble complexity
- Visual Feedback: Colored terminal output for cube visualization
- Model Persistence: Save and load trained models for continued training or testing
- Performance Metrics: Success rate tracking and solving statistics
Install the required packages:
pip install gymnasium
pip install stable-baselines3
pip install numpy
pip install torchExecute the main script to start training or testing:
python main.pyRubiksCubeSolver/
โโโ main.py # Main training and testing script
โโโ rubiks.py # Rubik's cube implementation and move functions
โโโ models/ # Directory containing trained model files
โ โโโ model-*.zip # Saved PPO models for different scramble levels
โโโ README.md # Project documentation
โโโ LICENSE # MIT License
The Rubik's Cube is represented using a dictionary structure where each face is a 3ร3 NumPy array with one-hot encoded colors:
- White:
[1, 0, 0, 0, 0, 0] - Red:
[0, 1, 0, 0, 0, 0] - Yellow:
[0, 0, 1, 0, 0, 0] - Orange:
[0, 0, 0, 1, 0, 0] - Blue:
[0, 0, 0, 0, 1, 0] - Green:
[0, 0, 0, 0, 0, 1]
The implementation supports all standard Rubik's Cube moves:
- Face Rotations: F, R, B, L, U, D (clockwise)
- Prime Moves: F', R', B', L', U', D' (counter-clockwise)
The RubiksCubeEnv class implements a Gymnasium environment with:
- Action Space: 12 discrete actions (6 face rotations + 6 prime moves)
- Observation Space: 324-dimensional binary vector (54 squares ร 6 colors)
- Reward System: Negative reward per step (-1) to encourage efficiency
- Episode Termination: Success (cube solved) or timeout (step limit reached)
The PPO agent uses a custom neural network with:
- Policy Network: 5 hidden layers of 256 neurons each
- Value Network: 5 hidden layers of 256 neurons each
- Activation Function: ReLU
- Algorithm: Proximal Policy Optimization (PPO)
To train a model with progressive difficulty:
# Set training = True in main.py
training = True
if training:
for scrambles in range(1, 21):
env.scrambles = scrambles
env.time_limit = scrambles ** 2
model.learn(total_timesteps=50000 * scrambles)
model.save(f"models/model-{date}--50k-{scrambles}s")To test a model's performance:
# Set testing = True in main.py
testing = True
if testing:
# Load a trained model
reloaded_model = PPO.load("models/model-050824--4s")
# Test on 4-move scrambles
env.scrambles = 4
env.time_limit = 16
# ... testing loopYou can also manually interact with the cube:
from rubiks import cube, front, right, up, print_cube
# Perform moves
front(cube)
right(cube)
up(cube)
# Display the cube
print_cube(cube)initialize_cube(): Creates a solved cube statescramble_cube(): Randomly scrambles the cube with N movesis_solved(): Checks if the cube is in solved stateprint_cube(): Displays the cube with colored output
All move functions are available in rubiks.py:
front(),front_prime()right(),right_prime()back(),back_prime()left(),left_prime()up(),up_prime()down(),down_prime()
onehotstate(): Converts cube to flattened observation vectorclear_terminal(): Cross-platform terminal clearingrotate_face_clockwise(): NumPy-based face rotation
scramble: Number of scramble moves (default: 0)time_limit: Maximum steps per episode (default: 10)
total_timesteps: Training duration per difficulty levelpolicy_kwargs: Neural network architecture settingsverbose: Training output verbosity
The project includes pre-trained models for different scramble complexities:
model-*--1s.zip: 1-move scramblesmodel-*--2s.zip: 2-move scrambles- ...up to 8+ move scrambles
Success rates vary by scramble complexity, with simpler scrambles achieving higher solve rates.
Contributions are welcome! Areas for improvement:
- Reward Engineering: Implement Manhattan distance or other heuristics
- Advanced Algorithms: Experiment with A3C, SAC, or other RL algorithms
- Curriculum Learning: Improve training progression strategies
- Performance Optimization: Enhance solving efficiency and success rates
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAI Gymnasium for the RL environment framework
- Stable-Baselines3 for the PPO implementation
- NumPy for efficient array operations
Note: This is an educational project demonstrating the application of reinforcement learning to combinatorial puzzles. The current implementation focuses on learning and experimentation rather than optimal solving performance.