Skip to content

Implementation of "Explaining an Agent’s Future Beliefs through Temporally Decomposing Future Reward Estimators"

Notifications You must be signed in to change notification settings

pseudo-rnd-thoughts/temporal-reward-decomposition

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Explaining an Agent’s Future Beliefs through Temporally Decomposing Future Reward Estimators

Future reward estimation is a core component of reinforcement learning agents, i.e., Q-value and state-value functions, predicting an agent’s sum of future rewards. However, their scalar output prevents understanding when or what individual future rewards an agent expects to receive. We address this by modifying an agent’s future reward estimator to predict their next N expected re wards, referred to as Temporal Reward Decomposition (TRD). This unlocks novel explanations of agent behaviour including when and what quantity rewards the agent expects to receive and their confidence in that, measuring an input feature’s temporal importance on decision-making, and understanding the impact of different actions using their differences in future rewards. Furthermore, we show that DQN agents trained on Atari environments can be efficiently retrained to incorporate TRD with minimal impact on performance.

For the full paper - https://arxiv.org/abs/2408.08230

Example explanations

Temporal Reward Decomposition explicitly predicts the agent's next N expected rewards providing novel explanatory opportunities. We highlight the three shown in the paper

What's the future expected rewards?

Example future rewards for Atari Space Invaders

Example future rewards for Atari Breakout

These figures can be more interesting when the agent's expected rewards over an episode is plotted

SpaceInvaders-expected-reward.mp4
Breakout-expected-reward.mp4

What observation features influence future rewards?

Utilising saliency map algorithms, we can plot the observation feature importance for individual future rewards rather than the cumulative future rewards like normal

Observation feature importance for Atari Breakout environment

What's the difference in two actions in terms of future rewards

Using the future expected rewards are explain the differences between two difference actions.

Example constrastive explanation for Atari Seaquest environment

Project structure

The training scripts are provided in temporal_reward_decomposition with the default hyperparameters and for the QDagger based algorithms request the pre-trained DQN models to be downloaded. See the dqn-models/readme.md for more detail.

Trained TRD models are provided in trd-models

Citation

todo

About

Implementation of "Explaining an Agent’s Future Beliefs through Temporally Decomposing Future Reward Estimators"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages