What would be a good interface for specifying the exploration policy?
It is implemented differently here and in DeepQLearning.jl.
- What is implemented here: Just allows a limited set of possible policy e.g.
EpsGreedyPolicy and uses the internal of that policy to access the Q value. I think it is pretty bad: EpsGreedyPolicy should be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improve EpsGreedyPolicy then the code here will break.
- In
DeepQLearning.jl, the user must pass in a function f and f(policy, env, obs, global_step, rng) will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.
- A suggestion is to use a function
action(::ExplorationPolicy, current_policy, env, obs, rng). Dispatching on the type of ExplorationPolicy and having users implement their own type seems more julian than passing a function. The method action is not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.
Any thoughts?
What would be a good interface for specifying the exploration policy?
It is implemented differently here and in
DeepQLearning.jl.EpsGreedyPolicyand uses the internal of that policy to access the Q value. I think it is pretty bad:EpsGreedyPolicyshould be agnostic to the type of policy for the greedy part (right now it assumes a tabular policy I think), if we improveEpsGreedyPolicythen the code here will break.DeepQLearning.jl, the user must pass in a functionfandf(policy, env, obs, global_step, rng)will be called to return the action. I took inspiration from MCTS.jl for this. However it is not super convenient to define decaying epsilon schedule with this approach.action(::ExplorationPolicy, current_policy, env, obs, rng). Dispatching on the type ofExplorationPolicyand having users implement their own type seems more julian than passing a function. The methodactionis not super consistent with the rest of the POMDPs.jl interface since it takes the current policy and the environment as input.Any thoughts?