This implementation provides a reinforcement learning agent using a Deep Q-Network (DQN) to navigate in a 2D environment. The agent learns to find the shortest path to a goal by iteratively exploring the environment and improving its policy.
The main class that handles interactions with the environment. Key functionalities:
- Initialization of Q-network, replay buffer, and learning parameters
- Action selection using epsilon-greedy policy
- Conversion between discrete and continuous action spaces
- Storage of transitions in the replay buffer
- Processing rewards and updating the Q-network
- Epsilon decay over time using a cosine schedule
Stores transitions (state, action, reward, next_state) for experience replay:
- Uses a double-ended queue with a maximum size of 100,000 transitions
- Allows sampling of transitions for training
A neural network implementation using PyTorch:
- Two hidden layers with 100 units each and ReLU activation
- Input dimension of 2 (for 2D state space)
- Output dimension of 4 (for the four possible discrete actions)
Handles the training of the Q-network:
- Maintains both a primary Q-network and a target Q-network
- Updates the target network periodically to stabilize learning
- Implements loss calculation using the Bellman equation
- Uses Adam optimizer for gradient updates
The agent can take four discrete actions that are converted to continuous movements:
- 0: Move left (-0.02, 0)
- 1: Move right (0.02, 0)
- 2: Move up (0, 0.02)
- 3: Move down (0, -0.02)
- The agent explores the environment using epsilon-greedy policy
- Experiences are stored in the replay buffer
- After collecting sufficient data, mini-batches are sampled for training
- The Q-network is updated to minimize the temporal difference error
- The target network is periodically updated to match the Q-network
- Epsilon decreases over time according to a cosine decay schedule
Rewards are based on the distance to the goal:
- Base reward: 1 - distance_to_goal
- Higher rewards for being closer to the goal
- Scaled rewards for different distance thresholds
- Episode length: 230 steps
- Buffer size before training: 120 transitions
- Mini-batch size: 100 transitions
- Target network update frequency: Every 55 steps
- Discount factor (gamma): 0.99
- Learning rate: 0.005
- Initial epsilon: 1.0
This agent is designed to be used in a compatible environment that provides:
- A 2D state representation
- Distance to goal measurement
- The ability to apply continuous actions
The agent can be integrated into a simulation or training loop by calling its methods in the following order:
get_next_action(state)
to get the next action to take- Apply the action in the environment to get next_state and distance_to_goal
set_next_state_and_distance(next_state, distance_to_goal)
to update the agenthas_finished_episode()
to check if the episode has ended
For evaluation, the get_greedy_action(state)
method can be used to get the best action without exploration.