-
Notifications
You must be signed in to change notification settings - Fork 100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dict policy #491
Dict policy #491
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NeroBlackstone thanks for your contribution! Can you have a look at my comments and make those changes before we merge?
lib/POMDPTools/src/Policies/dict.jl
Outdated
end | ||
end | ||
if max_action === nothing | ||
max_action = available_actions[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
max_action = available_actions[1] | |
max_action = first(available_actions) |
first
is preferable to [1]
because it will work for containers that are not indexable (e.g.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly, I think it would be better to use a default policy here, i.e. store default
in the struct and call action(p.default, s)
here. I think it would be even better to have a user-defined default value (implicitly you have defined it with max_action_value=0
here. Then, if none of the values in the Dict
are higher, it will use the default policy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. So I will add a new field default_policy
in struct ValueDictPolicy
.
And I think it would be better if we have two constructors for ValueDictPolicy
.
ValueDictPolicy(mdp::MDP)
: Random return one of action from actions(p.mdp,s)
, if max value action is not exist.
ValueDictPolicy(mdp::MDP, default_policy::Policy)
: Return action by default_policy
, if max value action is not exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we let user define default value?
I set max_action_value=0
because I want to keep the same behavior with ValuePolicy
, the default value of ValuePolicy
is 0, before we encounter a specific state-action pair in the environment. Souce Code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly, I think it would be better to use a default policy here, i.e. store
default
in the struct and callaction(p.default, s)
here.
Do you mean to expose how to select action to the user?
Or just a policy seal in the struct and only execute in action()
function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set max_action_value=0 because I want to keep the same behavior with ValuePolicy, the default value of ValuePolicy is 0, before we encounter a specific state-action pair in the environment. Souce Code
Yes, but in ValuePolicy
the user can change that if they want to. I think you should have
struct ValueDictPolicy{M<:MDP, T<:AbstractDict{Tuple,Float64}, P<:Policy} <: Policy
mdp::M
value_dict::T
default_value::Float64
default_policy::P
end
with the constructor setting default_value=-Inf
unless the user overrides it. And then
max_action_value = p.default_value
initially, and
if isnothing(max_action)
return action(p.default_policy)
near the end of the function. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think default_value
is very helpful, users could control the "initial value" of the "value table" (it's dict here)
But default_policy
, it will be available on the situation "there is no max value action, so let default_policy help me decide what action to return" only. In most instances, it's the first time we encounter a state, so all the values are the same.
But shall we let users decide what action returned, when all the action values are the same?
I'm not sure, maybe just returning random action is better?
Could you please give me an example about "in this situation we must let users define policy select action"?
Any solver algorithm need control policy here?
Thank you very much, if you have any ideas please let me know, I will finish it.
Co-authored-by: Zachary Sunberg <[email protected]>
I have added |
3107806
to
5ff641a
Compare
5ff641a
to
b53dcc8
Compare
Deleted some test code. |
d29d2f7
to
b53dcc8
Compare
I just added the default policy if you want to see what I meant. Thanks again for your contribution @NeroBlackstone ! |
Thank you very much, I learn a lot. |
* add dict policy * update dict.jl * add docs for DIctPolicy * Apply suggestions from code review Co-authored-by: Zachary Sunberg <[email protected]> * add default_value * revert runtest.jl * add return type * added a default policy, changed actionvalues to valuemap * updated docs --------- Co-authored-by: Zachary Sunberg <[email protected]>
Related issue