Multiple terminal states #3

siddhya · 2023-02-07T14:43:54Z

Thanks a lot for such a well-documented code. It is making it really easy for me to adapt for my use-case.

My MDP has multiple terminal states and I was wondering how to change the local_action_probabilities() code for that scenario. The Ziebert paper mentions you have to do it for all terminal states but not sure how to combine them. Thanks for your help!

qzed · 2023-02-07T15:26:27Z

I think you don't need to change the code. It should work with a list of states thanks to numpy's indexing.

The general idea is that you set all terminal states equal to one, i.e.
$$\forall s_i \in S_\mathrm{terminal} : Z_{s_i, t=0} = 1$$
Then you recursively back-up from those terminal states by first computing $Z_{a_{i,j},t}$ from $Z_{s_i,t-1}$ and $Z_{s_i,t}$ from $Z_{a_{i,j},t}$. So the back-up process is essentially the same.

siddhya · 2023-02-08T00:28:53Z

I am using MaxEnt for Iterated Prisoner's Dilemma. It is an unending game and I limited it to n iterations. So, any state can be terminal state. Will try adding all the states to the list, thanks!

qzed · 2023-02-10T22:07:19Z

Ah, interesting. I haven't worked with this (and IRL in general) for a while so please be aware that I might not be the best option on answering theoretical questions about this.

However, I think that initializing all states to one and then iterating for some fixed number of steps should theoretically work and be a sound approximation to the unending game. Now, there could be a technical issue in that the $Z_{s_i}$ and $Z_{a_{i,j}}$ will "run off" towards infinity if $\exp(R(s_i)) > 1$ or zero if $\exp(R(s_i)) < 1$ and you iterate for a long time. In the end, what we're really interested in though is $$P(a_{i,j}, s_i) = \frac{Z_{a_{i,j}}}{Z_{s_i}},$$ which is a relative quantity. So if that becomes a problem, you could maybe after each iteration try to find some scalar $c$ (maybe max/min/avg) and divide all $Z_{s_i}$ and $Z_{a_{i,j}}$ by it (but maybe there are better options if you think a bit longer about it than I did).

Instead of that, however, you probably rather want to consider using maximum causal entropy IRL (see local_causal_action_probabilities()). The theoretical foundation there is based on "soft" Bellman equations (although I'm honestly not sure I can give you the full theoretical derivation on that any more), due to which you essentially compute action- and state-value functions, Q and V. The nice thing there is that this now uses a softmax to compute the state-value function V. That means it doesn't run off and, ideally, converges. Now, I'm not sure if it will converge to a single solution in your endless case, so maybe you might need to change the while delta > eps loop to something else.

Also, for the causal function you'll have to specify the terminal reward function yourself (because of this check), but you can just set it to zero everywhere and it should work fine (i.e. terminal = np.zeros(n_states)).

siddhya · 2023-02-15T11:30:05Z

Thanks. As a first step (to verify the new world env I added) I set one of the states as the terminal state. That worked fairly well, and the reward function was recovered close to the original.

After that I tried the infinite horizon case. For that I commented out p_transition[terminal, :, :] = 0.0 in expected_svf_from_policy() and also changed the loop to for _ in range(2 * n_states). But the recovered reward function is not correct for both maxent and maxent_causal.

At this point I am not sure if it is a question of tweaking some parameters or I have got something fundamentally wrong.

qzed · 2023-05-11T16:41:44Z

Sorry for the late response. I think this suggests that the expected SVF doesn't converge to something meaningful, which is probably more of a theoretical or algorithmical issue than just tweaking parameters. Re convergence, Ziebart et al. write

Given parameter weights, the partition function, Z(θ), always converges for
finite horizon problems and infinite horizons problems with discounted reward
weights. For infinite horizon problems with zero-reward absorbing states, the
partition function can fail to converge even when the rewards of all states are
negative.

There is a paper (https://arxiv.org/abs/2012.00889) proposing an algorithm that doesn't explicitly rely on terminal states (and has some other improvements), so maybe that would work better here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple terminal states #3

Multiple terminal states #3

siddhya commented Feb 7, 2023

qzed commented Feb 7, 2023

siddhya commented Feb 8, 2023

qzed commented Feb 10, 2023

siddhya commented Feb 15, 2023

qzed commented May 11, 2023

Multiple terminal states #3

Multiple terminal states #3

Comments

siddhya commented Feb 7, 2023

qzed commented Feb 7, 2023

siddhya commented Feb 8, 2023

qzed commented Feb 10, 2023

siddhya commented Feb 15, 2023

qzed commented May 11, 2023