-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset encapsulation with Plugin Manager #291
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. I think this will greatly help with maintaining the scripts in the bin folder :)
See the inline comments for feedback.
keras_retinanet/bin/train.py
Outdated
:return: None | ||
""" | ||
|
||
pm = PluginManagerSingleton.get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to avoid any use of global state (including singletons). Would it be an option to simply return a list of all loaded plugins from this function? That has the added benefit that it would be really easy to add a fallback that doesn't use yapsy: we only have to return a fixed list of dataset plugins.
Or maybe even turn it into a dictionary mapping dataset_type
to plugin_object
. Then the consumer of the result has to know nothing about yapsy at all.
If for some reason we really need a plugin manager object (though it doesn't look like we do), then I'd prefer to create a new plugin manager and return it from this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elegance of the singleton is that it should be accessible from everywhere. I don't necessarily think that creating this at program time, and passing it everywhere is very elegant. If it's needed in a new location ( quite feasible ), then it becomes a pain to weave this through. The reason I chose the Singleton
variant is just so that we can't accidentally make a secondary plugin system and have two lots of state.
Are you saying to create a dictionary mapping that lies within the train script? This removes from the advantage of the plugin system. Currently the training implementation needs to know nothing about plugins specific implementation, it allows plugins to self-describe themselves and what they trigger to. If we make a static mapping, it defeats the purpose of this mechanism surely?
Perhaps the naming of this function is a misnomer. It more accurately reflects plugin discovery and initialisation. Where it searches based on defined parameters, and info files, and creates objects of those via collectPlugins
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The elegance of the singleton is that it should be accessible from everywhere. I don't necessarily think that creating this at program time, and passing it everywhere is very elegant. If it's needed in a new location ( quite feasible ), then it becomes a pain to weave this through. The reason I chose the Singleton variant is just so that we can't accidentally make a secondary plugin system and have two lots of state.
The flip side is that you get an near invisible dependency on the singleton in potentially many places. That makes it harder to switch to a different plugin system if we ever want to. It's is also not clear to me if the singleton will be sufficient when we want to add a second plugin type (for backbone selection for example).
Are you saying to create a dictionary mapping that lies within the train script? This removes from the advantage of the plugin system. Currently the training implementation needs to know nothing about plugins specific implementation, it allows plugins to self-describe themselves and what they trigger to. If we make a static mapping, it defeats the purpose of this mechanism surely?
What I mean is that we can build a dictionary of dataset types to plugin objects from the list of plugins found by the plugin manager. That way we can completely hide (and decouple) the system used to find the plugins from the train script.
The added advantage is that a fallback to a hardcoded dictionary is really easy to implement so that yapsy can remain an optional dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flip side is that you get an near invisible dependency on the singleton in potentially many places. That makes it harder to switch to a different plugin system if we ever want to. It's is also not clear to me if the singleton will be sufficient when we want to add a second plugin type (for backbone selection for example).
Implementing another plugin system would just be a case of either utilising its singleton, or wrapping in our own singleton to make sure only one exists. At the time of choosing the singleton, we weren't requiring it for backbone type, I had considered it. In that case we would just create wrappers for Dataset
and Backbone
plugin systems if we wanted to keep them separate.
What I mean is that we can build a dictionary of dataset types to plugin objects from the list of plugins found by the plugin manager. That way we can completely hide (and decouple) the system used to find the plugins from the train script.
Sounds like more evidence for creating a util wrapper to me? If we created our own util DatasetPluginManger which handled the creation of the plugin system, initialisation, loading, and had some utility funcs likeload_plugins
which returns this dict?
From the sounds of it, that util wrapper is just to act as a middleman between utilisation within train and the implementation of the plugin solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the sounds of it, that util wrapper is just to act as a middleman between utilisation within train and the implementation of the plugin solution.
Yes, with the advantage that it decouples the scripts from the plugin manager and provides an easy minimal interface that can be shared with the other scripts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Proposed API for this? Anything in particular you'd need from this.
I'm thinking:
- Initialisation of Plugin Manager against 'plugins' folder
- Load plugins - Searches for plugins and collects them, creating them.
- getPlugins - For returning this dict of name -> plugin object mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking simply a single function: load_plugins(paths)
. The one argument would be a list of search paths, the returned value a dictionary mapping dataset_type
to plugin_object
.
;limitations under the License. | ||
|
||
[Core] | ||
Name = Coco Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be an idea to use the plugin name here to mean the name of the subparser? So in this case that would be pascal
. Instead of an attribute called dataset_type
on the plugin object we could then just use the plugin name to look them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The spirit of the Info file denotes that as the name. I think it should remain as human readable, as it's intended use is for relaying back to users. The justification behind 'dataset_type' is so that we have a common ident for a given dataset. Eg, VOC is PASCAL VOC, yet we only use pascal
as the ident. It allows Datasets to be correctly named ( with even more info ), then also self-identify their short-hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the other hand, finding out the command line argument for a plugin now requires to look at the source of the plugin. To me it makes sense if that information comes from the plugin description file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, ideally plugins would have their own documentation. However, because the subparser decorator adds entries to the subparser, if a user types an incorrect flag it should correct them to say it's not of type <csv|coco|voc>. I believe this was the case during my testing. Remember, these plugins should be able to exist separately from RetinaNet, perhaps residing as a pip package, etc. I think it's very important to keep high-level human readability in mind, like with every other package/plugin system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want human readability to come at the cost of making it easy for humans to understand which plugin belongs to which parser. Considering that we already have a human readable description field, I think the name is a perfect candidate for being the link between the plugin file and the subparser name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, the name and description reside in the plugin domain not the RetinaNet domain. For instance, I may have multiple plugins which define 'coco' for me. Maybe taken from different years etc. With the plugin system I can programmatically enable/disable plugins and chose which are active which may be of use in test cases or scripts users may want to write. I don't believe description to be a suitable replacement for name, both are required together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find that very compelling. If the user wants to enable/disable plugins they could do that by modifying the used plugin path, or like systemd does by creating/removing symlinks (even windows has those nowadays).
The user should not have to modify code to enable or disable plugins. To me that sort of defeats the purpose of having a plugin system.
The main thing a dataset plugin does is hook in on the creation of a dataset generator. At that place where that hooking occurs, it is identified by a string. So that string is already an identifier for the plugin. I don't want to have two different identifiers.
Slight sidestep: does yapsy allow access to the name of the plugin description file?
keras_retinanet/bin/train.py
Outdated
generators = plugin.plugin_object.get_generator(args, transform_generator=transform_generator) | ||
break | ||
else: | ||
print("Plugin ({}) does not contain a definition for get_generator and cannot be utilised".format(plugin.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, a warning would make sense. /edit: GitHub doesn't order the comments the way I expected. I expected this comment to come first: #291 (comment)
I would also prefer to move the checking of plugins into the load_plugins
function (possibly by delegating to a check_dataset_plugin
function) and just assume they are correct here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delegation of checking sounds like a good idea. Similar justification behind check args currently.
keras_retinanet/utils/plugin.py
Outdated
super(DatasetPlugin, self).__init__() | ||
|
||
|
||
def parser_args(self, parser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function name implies that it returns the definitions of parser arguments. But what it has to do is register its arguments (or in this case subparser) with the given parser. I would prefer a name along the lines of register_arguments
or register_parser_args
or some such.
Also, since this is the base for all other plugins, I think we should really document the expected behaviour of these functions well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register_parser_args
sounds very fitting.
Yes, documenting base API sounds like a good call.
keras_retinanet/utils/plugin.py
Outdated
pass | ||
|
||
def check_args(self, parsed_args): | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is pass
the best default implementation here? The function should return a boolean, so either return False/True
or raise NotImplementedError(...)
seems like a better choice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the way which our check args has worked previously has been assumed that if nothing was raised, then it's fine. As we run checks, and raise where appropriate. Therefore, perhaps pass
is suitable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, the old check_args()
returns the arguments. Bit odd, but not a big problem. With the changes here I see that pass
is indeed a good default implementation for check_args()
.
keras_retinanet/utils/plugin.py
Outdated
def check_args(self, parsed_args): | ||
pass | ||
|
||
def get_generator(self, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here raising a NotImplementedError
definitely seems like the best default implementation.
Regarding the name of the function, get_...
to me implies a cheap retrieval operation. A dataset might do arbitrary things when being constructed, including reading files from disk or even the internet. In general, it's not expected to be a cheap operation, so I would prefer a name like make_generators
or create_generators
. In it's current form the name should also be plural since the function can return multiple generators.
I'm a bit on the fence if we should return a dictionary with multiple generators, or have separate functions for the train and evaluation generator (which might return None). Or maybe we should just return one generator and invoke it twice if we also want an evaluation generator. This last option sounds best but would probably require a change in the way we parse arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! NotImplementedError
. Agreed.
Replacing get_
with a more 'cost associative word' makes sense. Not sure that create_
fits either. It is a get, as it returns a generator. Yet create might not imply a return.
Not quite sure I follow what you're suggesting for return functions ( wrapping the generator in a lambda??? ), nor the invoking twice eval generator. Can I get some clarification on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replacing get_ with a more 'cost associative word' makes sense. Not sure that create_ fits either. It is a get, as it returns a generator. Yet create might not imply a return.
To me create
and make
do imply a return. But if you have an alternative that fits, that also works for me.
Not quite sure I follow what you're suggesting for return functions ( wrapping the generator in a lambda??? ), nor the invoking twice eval generator. Can I get some clarification on that?
What I mean is that now we have one function that has to return two generators depending on the arguments: a generator for training and one for evaluation. That means each dataset plugin now has to checks if the user wants an evaluation generator.
We could also have two functions instead of one function returning two generators. Then a global option can disable evaluation without specific support from the dataset plugins.
Even neater would be one function that can be called multiple times to create multiple generators with different parameters. That would be a simpler interface for the dataset plugin and allow more flexibility in using it. However, that would require being able to divide arguments into separate calls to make_generator
, and currently our argument parsing isn't capable of doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could explicitly have create_training_generator
and create_evaluation_generator
, is this what you mean? We can then catch NotImplemented
exceptions within the test script if a Dataset doesn't provide the means to train or evaluate?
Typically, the difference between Evaluation and Training generators is just one or two baked in args. The majority of the args passed to that function are used by both anyhow, which then makes sense to return both generators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could explicitly have create_training_generator and create_evaluation_generator, is this what you mean?
That is one option I was considering, yes.
We can then catch NotImplemented exceptions within the test script if a Dataset doesn't provide the means to train or evaluate?
NotImplemented
is not allowed for the training generator, so wouldn't be caught for that. As for create_evaluation_generator
, that could simply have a default implementation that returns None
.
I do prefer splitting because it removes the reponsibility of checking if --no-evaluation
was passed from each dataset plugin. Suppose that we want to rename that flag in the future, that would be pretty difficult if each dataset plugin has to check it.
Ideally, I don't even want plugins to have access to global options, but right now that might complicate things too far. But at least splitting the functions is a step in the right direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be the script responsibility for whether to actually use evaluation
. It still is entirely possible to have a Dataset which isn't meant to be trained on, just provides an Evaluation mechanism. Think transfer learning without finetuning for example. I'm thinking beyond train.py a little bit here. Dataset just provides what it can provide in those contexts.
Base plugin will have NotImplError for those functions. Datasets should override those and either return a Generator
or None
. Any script calling to create_evaluation_generator
or create_training_generator
should expect to check if not None
prior to utilising. And even then we can wrap the use of a generator in a try
block to catch anything unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be the script responsibility for whether to actually use evaluation. It still is entirely possible to have a Dataset which isn't meant to be trained on, just provides an Evaluation mechanism.
True. Treating the train generator specially doesn't really make sense from the plugins point of view.
Base plugin will have NotImplError for those functions. Datasets should override those and either return a Generator or None. Any script calling to create_evaluation_generator or create_training_generator should expect to check if not None prior to utilising. And even then we can wrap the use of a generator in a try block to catch anything unexpected.
We could also make them all return None
in the base interface I suppose. Or document that if a dataset plugin doesn't support one type of dataset generator that it should leave it raising a NotImplementedError
. Otherwise each script that wants to use a specific generator needs to check for both None
and NotImplementedError
.
keras_retinanet/bin/train.py
Outdated
try: | ||
getattr(plugin.plugin_object, 'dataset_type') | ||
except AttributeError: | ||
print("Plugin ({}) does not contain a definition for dataset_type and cannot be utilised.".format(plugin.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make sense to use the warning [1] module for this. If we ever expose some way to modify the warning handling, that can be used to let the user turn these into errors on request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at using logging.warning
for things like this. But I was considering some built-in logger/warning system from Python, but saw no evidence of its use currently, so held off. Again, it can always be future PRs which introduce logging on a wider scale than just this feature. Definitely agree on the use of logging/warning though over debug prints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's done internally in a few places, but not in the scripts now. Still, never to early to start doing things the right way.
But it can be a separate PR if you prefer (and it doesn't have to be you doing it ^^).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Volunteering? ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't really my intention, no :)
keras_retinanet/bin/train.py
Outdated
#Load plugins first, as their procedures are needed for parsing args. | ||
print("Loading plugins...") | ||
# Load Plugins | ||
load_plugins(['plugins']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to expose the plugin path to users. Ideally, the plugin path comes from the following places:
- Command line flags.
- An environment variable.
- Some built-in default suitable for packaging. (That generally means a packager has to be able to set it during a configure step. Not sure how python programs normally deal with this.)
Ideally command line flags give enough control to precisely determine the the plugin path as desired. Default behaviour should probably be to append to the built-in path(s) and any paths from the environment, but allow disabling built-in paths and paths from the environment.
This isn't necessarily required before merging, but it's definitely something I want to have up to scratch. It would require changes to argument parsing to atleast a two-stage process.
Before merging though, I don't think it currently picks up the built-in plugins? It only looks in the plugins
folder in the current working directory now, it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the plugin folder should remain static, as part of RetinaNet. E.g Most games that utilise this pattern have a singular defined plugins folder, a central location for users to augment. I think if we start having multiple places to look for things become decentralised far too quickly. An organised location for encapsulating Datasets seems more sensible than a wild-west approach, personally. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one big value of a plugin system is to allow third party contributions without modifying the source of the project. So allowing the loading of plugins from user-defined locations seems logical to me.
However, the current plugin path seems to be relative to the working directory. If we want to have a plugins folder in the keras-retinanet package, we should probably make it relative to the location of the python code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% with regard to the concreteness of paths in relation to the module. But typically all solutions which utilise Plugin systems have a designated folder for dropping 'modules'/plugins into. Adding more command line arguments will just complicate use of RetinaNet. A dedicated and singular 'bin' to drop things into reduces the amount of checking required. If we have 50 paths to check for plugins, that's a lot of folders, makes error checking harder.
The idea is that these plugins are designed to run with RetinaNet, so I don't see a need for them to be elsewhere really. If people need other directories as development areas, either copy them in, or symlink. Multiple folder locations certainly isn't the convention within industry from what I've seen; however, I'm happy to be proven wrong - I just don't see the added benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to have one specific directory with additional plugins we don't even need something like yapsy. We could just have that directory be a python module and iterate over the submodules.
However, I always keep in mind the end-goal for any software project: having it packaged and installed system wide by a package manager (such as pacman). That means its files and directories will be root owned, and users will not be able to drop in any plugins.
Additionally, users may want to have different sets of plugins if they're unlucky enough that plugins they're interested in may have claimed conflicting names.
That's why I think the user should have full control over the plugin path user by any application.
And to keep things easy for the user, there should be useful defaults that work most of the time.
But again, this is not crucial for the first step. But if we have a fixed path it must atleast always pick up the default bundled datasets.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the goal is to have one specific directory with additional plugins we don't even need something like yapsy. We could just have that directory be a python module and iterate over the submodules.
As stated before, this was the first attempt for a plugin system; however, Yapsy handles a lot of error-handling and compatibility stuff. Especially as it's still a raw python implementation, we would simply be reinventing the wheel and it would take longer to put all the error checking in place.
However, I always keep in mind the end-goal for any software project: having it packaged and installed system wide by a package manager (such as pacman). That means its files and directories will be root owned, and users will not be able to drop in any plugins.
I see your point. Indeed, if it is installed via a package manager and the plugins folder is owned by root, it may become more difficult for users to actually utilise the system. It then becomes important to augment plugins via some user-space locations.
I'd say flags, but we're getting to the stage where the incantations are huge. It's almost at a point where loading in a JSON/XML config file is more user friendly than args. I might lean towards environment variable use. It makes sense that they would define some region on their system for user level scripts. Packages can define environment variables when installing to the system, for the user. E.g. RETINANET_PLUGIN_DIR
would be ~/.retinanet_plugins/
for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The perfect solution for determining the plugin path to me is a combination of all of them: command line flags, an environment variable and some built-in default path(s) (in that order of priority).
But an initial solution which only uses some built-in default(s) and an environment variable is also good. Even the environment bit is optional as far as I'm concerned for the first version.
If you do add an environment variable, I'd go for RETINANET_PLUGIN_PATH
though, and allow colon-separated paths (like $PATH
).
66940b0
to
27a2613
Compare
Did a handful of the changes, primarily renames and error handling. Also redid the merge commit so that you won't have any issues. The reason not everything is done, as I believe there's more discussion which can happen for future PRs.
*Documentation of the API for Plugins.
In my mind this would constitute 3 logical features.
I would like this to get merged, as it does become rather hard to rebase all these changes, and still try and maintain a pretty history. The work left to do I think needs more design discussion, and can work as additional PRs once this has been worked out. EDIT: Last two commits relate to adding Yapsy as a dependency both via Setup and for Travis. I'm unfamiliar with how you're doing CI, so let me know if those need removing. |
I must admit I didn't thoroughly read every part of the discussion, but here's my 2 cents:
|
We can't merge as is until at least a few problems are resolved. I branched off from this PR on the dataset-plugins branch. Rebased again, and I'm working through some remaining issues. I'll open a new PR when I think it's ready for review. |
6bfedce
to
fd9a8a9
Compare
Merged into dataset-plugins branch. |
PR for Iss #284. Plugin system for handling encapsulation of Datasets. This includes generators, checking of argparse arguments, and definition of the subparser for a given dataset.
Future work will include specific evaluation methods etc. If it can be self-contained it can be a candidate.
As discussed with @de-vri-es, yes it introduces a dependency; however, Yapsy is very minimal and is a pure implementation. Early attempts for this PR looked at rolling our own plugin based system, but the import system of python is fickle at the best of times. Yapsy maintains 2.7 and 3.x compatibility, and does a lot of the edge case detection already. The design benefits of this approach should outweigh one additional dependency via pip.
In future should a bespoke solution be written, it is entirely possible to replace the few lines here and there where we invoke commands on plugins. As architecturally, they should be synonymous.
I understand this is quite the design shift. Any questions, feel free to ask. Heavy critique and review is welcomed.