Dataset encapsulation with Plugin Manager #291

awilliamson · 2018-02-14T22:22:37Z

PR for Iss #284. Plugin system for handling encapsulation of Datasets. This includes generators, checking of argparse arguments, and definition of the subparser for a given dataset.
Future work will include specific evaluation methods etc. If it can be self-contained it can be a candidate.

As discussed with @de-vri-es, yes it introduces a dependency; however, Yapsy is very minimal and is a pure implementation. Early attempts for this PR looked at rolling our own plugin based system, but the import system of python is fickle at the best of times. Yapsy maintains 2.7 and 3.x compatibility, and does a lot of the edge case detection already. The design benefits of this approach should outweigh one additional dependency via pip.

In future should a bespoke solution be written, it is entirely possible to replace the few lines here and there where we invoke commands on plugins. As architecturally, they should be synonymous.

I understand this is quite the design shift. Any questions, feel free to ask. Heavy critique and review is welcomed.

de-vri-es

Thanks for the PR. I think this will greatly help with maintaining the scripts in the bin folder :)

See the inline comments for feedback.

de-vri-es · 2018-02-14T22:55:17Z

keras_retinanet/bin/train.py

+    :return: None
+    """
+
+    pm = PluginManagerSingleton.get()


I prefer to avoid any use of global state (including singletons). Would it be an option to simply return a list of all loaded plugins from this function? That has the added benefit that it would be really easy to add a fallback that doesn't use yapsy: we only have to return a fixed list of dataset plugins.

Or maybe even turn it into a dictionary mapping dataset_type to plugin_object. Then the consumer of the result has to know nothing about yapsy at all.

If for some reason we really need a plugin manager object (though it doesn't look like we do), then I'd prefer to create a new plugin manager and return it from this function.

The elegance of the singleton is that it should be accessible from everywhere. I don't necessarily think that creating this at program time, and passing it everywhere is very elegant. If it's needed in a new location ( quite feasible ), then it becomes a pain to weave this through. The reason I chose the Singleton variant is just so that we can't accidentally make a secondary plugin system and have two lots of state.

Are you saying to create a dictionary mapping that lies within the train script? This removes from the advantage of the plugin system. Currently the training implementation needs to know nothing about plugins specific implementation, it allows plugins to self-describe themselves and what they trigger to. If we make a static mapping, it defeats the purpose of this mechanism surely?

Perhaps the naming of this function is a misnomer. It more accurately reflects plugin discovery and initialisation. Where it searches based on defined parameters, and info files, and creates objects of those via collectPlugins.

The elegance of the singleton is that it should be accessible from everywhere. I don't necessarily think that creating this at program time, and passing it everywhere is very elegant. If it's needed in a new location ( quite feasible ), then it becomes a pain to weave this through. The reason I chose the Singleton variant is just so that we can't accidentally make a secondary plugin system and have two lots of state.

The flip side is that you get an near invisible dependency on the singleton in potentially many places. That makes it harder to switch to a different plugin system if we ever want to. It's is also not clear to me if the singleton will be sufficient when we want to add a second plugin type (for backbone selection for example).

Are you saying to create a dictionary mapping that lies within the train script? This removes from the advantage of the plugin system. Currently the training implementation needs to know nothing about plugins specific implementation, it allows plugins to self-describe themselves and what they trigger to. If we make a static mapping, it defeats the purpose of this mechanism surely?

What I mean is that we can build a dictionary of dataset types to plugin objects from the list of plugins found by the plugin manager. That way we can completely hide (and decouple) the system used to find the plugins from the train script.

The added advantage is that a fallback to a hardcoded dictionary is really easy to implement so that yapsy can remain an optional dependency.

The flip side is that you get an near invisible dependency on the singleton in potentially many places. That makes it harder to switch to a different plugin system if we ever want to. It's is also not clear to me if the singleton will be sufficient when we want to add a second plugin type (for backbone selection for example).

Implementing another plugin system would just be a case of either utilising its singleton, or wrapping in our own singleton to make sure only one exists. At the time of choosing the singleton, we weren't requiring it for backbone type, I had considered it. In that case we would just create wrappers for Dataset and Backbone plugin systems if we wanted to keep them separate.

What I mean is that we can build a dictionary of dataset types to plugin objects from the list of plugins found by the plugin manager. That way we can completely hide (and decouple) the system used to find the plugins from the train script.
Sounds like more evidence for creating a util wrapper to me? If we created our own util DatasetPluginManger which handled the creation of the plugin system, initialisation, loading, and had some utility funcs like load_plugins which returns this dict?

From the sounds of it, that util wrapper is just to act as a middleman between utilisation within train and the implementation of the plugin solution.

From the sounds of it, that util wrapper is just to act as a middleman between utilisation within train and the implementation of the plugin solution.

Yes, with the advantage that it decouples the scripts from the plugin manager and provides an easy minimal interface that can be shared with the other scripts.

Sounds good. Proposed API for this? Anything in particular you'd need from this.
I'm thinking:

Initialisation of Plugin Manager against 'plugins' folder

Load plugins - Searches for plugins and collects them, creating them.

getPlugins - For returning this dict of name -> plugin object mappings.

I was thinking simply a single function: load_plugins(paths). The one argument would be a list of search paths, the returned value a dictionary mapping dataset_type to plugin_object.

de-vri-es · 2018-02-14T23:01:52Z

keras_retinanet/plugins/coco/coco.dataset

+;limitations under the License.
+
+[Core]
+Name = Coco Dataset


Would it be an idea to use the plugin name here to mean the name of the subparser? So in this case that would be pascal. Instead of an attribute called dataset_type on the plugin object we could then just use the plugin name to look them up.

The spirit of the Info file denotes that as the name. I think it should remain as human readable, as it's intended use is for relaying back to users. The justification behind 'dataset_type' is so that we have a common ident for a given dataset. Eg, VOC is PASCAL VOC, yet we only use pascal as the ident. It allows Datasets to be correctly named ( with even more info ), then also self-identify their short-hand.

On the other hand, finding out the command line argument for a plugin now requires to look at the source of the plugin. To me it makes sense if that information comes from the plugin description file.

Well, ideally plugins would have their own documentation. However, because the subparser decorator adds entries to the subparser, if a user types an incorrect flag it should correct them to say it's not of type <csv|coco|voc>. I believe this was the case during my testing. Remember, these plugins should be able to exist separately from RetinaNet, perhaps residing as a pip package, etc. I think it's very important to keep high-level human readability in mind, like with every other package/plugin system.

I don't want human readability to come at the cost of making it easy for humans to understand which plugin belongs to which parser. Considering that we already have a human readable description field, I think the name is a perfect candidate for being the link between the plugin file and the subparser name.

However, the name and description reside in the plugin domain not the RetinaNet domain. For instance, I may have multiple plugins which define 'coco' for me. Maybe taken from different years etc. With the plugin system I can programmatically enable/disable plugins and chose which are active which may be of use in test cases or scripts users may want to write. I don't believe description to be a suitable replacement for name, both are required together.

I don't find that very compelling. If the user wants to enable/disable plugins they could do that by modifying the used plugin path, or like systemd does by creating/removing symlinks (even windows has those nowadays).

The user should not have to modify code to enable or disable plugins. To me that sort of defeats the purpose of having a plugin system.

The main thing a dataset plugin does is hook in on the creation of a dataset generator. At that place where that hooking occurs, it is identified by a string. So that string is already an identifier for the plugin. I don't want to have two different identifiers.

Slight sidestep: does yapsy allow access to the name of the plugin description file?

de-vri-es · 2018-02-14T23:08:24Z

keras_retinanet/bin/train.py

+                    generators = plugin.plugin_object.get_generator(args, transform_generator=transform_generator)
+                    break
+                else:
+                    print("Plugin ({}) does not contain a definition for get_generator and cannot be utilised".format(plugin.name))


Same here, a warning would make sense. /edit: GitHub doesn't order the comments the way I expected. I expected this comment to come first: #291 (comment)

I would also prefer to move the checking of plugins into the load_plugins function (possibly by delegating to a check_dataset_plugin function) and just assume they are correct here.

Delegation of checking sounds like a good idea. Similar justification behind check args currently.

de-vri-es · 2018-02-14T23:11:38Z

keras_retinanet/utils/plugin.py

+        super(DatasetPlugin, self).__init__()
+
+
+    def parser_args(self, parser):


This function name implies that it returns the definitions of parser arguments. But what it has to do is register its arguments (or in this case subparser) with the given parser. I would prefer a name along the lines of register_arguments or register_parser_args or some such.

Also, since this is the base for all other plugins, I think we should really document the expected behaviour of these functions well.

register_parser_args sounds very fitting.

Yes, documenting base API sounds like a good call.

de-vri-es · 2018-02-14T23:12:43Z

keras_retinanet/utils/plugin.py

+        pass
+
+    def check_args(self, parsed_args):
+        pass


Is pass the best default implementation here? The function should return a boolean, so either return False/True or raise NotImplementedError(...) seems like a better choice.

I believe the way which our check args has worked previously has been assumed that if nothing was raised, then it's fine. As we run checks, and raise where appropriate. Therefore, perhaps pass is suitable?

Ah, the old check_args() returns the arguments. Bit odd, but not a big problem. With the changes here I see that pass is indeed a good default implementation for check_args().

de-vri-es · 2018-02-14T23:13:55Z

keras_retinanet/utils/plugin.py

+    def check_args(self, parsed_args):
+        pass
+
+    def get_generator(self, args):


Here raising a NotImplementedError definitely seems like the best default implementation.

Regarding the name of the function, get_... to me implies a cheap retrieval operation. A dataset might do arbitrary things when being constructed, including reading files from disk or even the internet. In general, it's not expected to be a cheap operation, so I would prefer a name like make_generators or create_generators. In it's current form the name should also be plural since the function can return multiple generators.

I'm a bit on the fence if we should return a dictionary with multiple generators, or have separate functions for the train and evaluation generator (which might return None). Or maybe we should just return one generator and invoke it twice if we also want an evaluation generator. This last option sounds best but would probably require a change in the way we parse arguments.

Yes! NotImplementedError. Agreed.

Replacing get_ with a more 'cost associative word' makes sense. Not sure that create_ fits either. It is a get, as it returns a generator. Yet create might not imply a return.

Not quite sure I follow what you're suggesting for return functions ( wrapping the generator in a lambda??? ), nor the invoking twice eval generator. Can I get some clarification on that?

Replacing get_ with a more 'cost associative word' makes sense. Not sure that create_ fits either. It is a get, as it returns a generator. Yet create might not imply a return.

To me create and make do imply a return. But if you have an alternative that fits, that also works for me.

Not quite sure I follow what you're suggesting for return functions ( wrapping the generator in a lambda??? ), nor the invoking twice eval generator. Can I get some clarification on that?

What I mean is that now we have one function that has to return two generators depending on the arguments: a generator for training and one for evaluation. That means each dataset plugin now has to checks if the user wants an evaluation generator.

We could also have two functions instead of one function returning two generators. Then a global option can disable evaluation without specific support from the dataset plugins.

Even neater would be one function that can be called multiple times to create multiple generators with different parameters. That would be a simpler interface for the dataset plugin and allow more flexibility in using it. However, that would require being able to divide arguments into separate calls to make_generator, and currently our argument parsing isn't capable of doing that.

We could explicitly have create_training_generator and create_evaluation_generator, is this what you mean? We can then catch NotImplemented exceptions within the test script if a Dataset doesn't provide the means to train or evaluate?
Typically, the difference between Evaluation and Training generators is just one or two baked in args. The majority of the args passed to that function are used by both anyhow, which then makes sense to return both generators.

We could explicitly have create_training_generator and create_evaluation_generator, is this what you mean?

That is one option I was considering, yes.

We can then catch NotImplemented exceptions within the test script if a Dataset doesn't provide the means to train or evaluate?

NotImplemented is not allowed for the training generator, so wouldn't be caught for that. As for create_evaluation_generator, that could simply have a default implementation that returns None.

I do prefer splitting because it removes the reponsibility of checking if --no-evaluation was passed from each dataset plugin. Suppose that we want to rename that flag in the future, that would be pretty difficult if each dataset plugin has to check it.

Ideally, I don't even want plugins to have access to global options, but right now that might complicate things too far. But at least splitting the functions is a step in the right direction.

I agree it should be the script responsibility for whether to actually use evaluation. It still is entirely possible to have a Dataset which isn't meant to be trained on, just provides an Evaluation mechanism. Think transfer learning without finetuning for example. I'm thinking beyond train.py a little bit here. Dataset just provides what it can provide in those contexts.

Base plugin will have NotImplError for those functions. Datasets should override those and either return a Generator or None. Any script calling to create_evaluation_generator or create_training_generator should expect to check if not None prior to utilising. And even then we can wrap the use of a generator in a try block to catch anything unexpected.

I agree it should be the script responsibility for whether to actually use evaluation. It still is entirely possible to have a Dataset which isn't meant to be trained on, just provides an Evaluation mechanism.

True. Treating the train generator specially doesn't really make sense from the plugins point of view.

Base plugin will have NotImplError for those functions. Datasets should override those and either return a Generator or None. Any script calling to create_evaluation_generator or create_training_generator should expect to check if not None prior to utilising. And even then we can wrap the use of a generator in a try block to catch anything unexpected.

We could also make them all return None in the base interface I suppose. Or document that if a dataset plugin doesn't support one type of dataset generator that it should leave it raising a NotImplementedError. Otherwise each script that wants to use a specific generator needs to check for both None and NotImplementedError.

de-vri-es · 2018-02-14T23:16:02Z

keras_retinanet/bin/train.py

+        try:
+            getattr(plugin.plugin_object, 'dataset_type')
+        except AttributeError:
+            print("Plugin ({}) does not contain a definition for dataset_type and cannot be utilised.".format(plugin.name))


I think it would make sense to use the warning [1] module for this. If we ever expose some way to modify the warning handling, that can be used to let the user turn these into errors on request.

[1] https://docs.python.org/3/library/warnings.html

I looked at using logging.warning for things like this. But I was considering some built-in logger/warning system from Python, but saw no evidence of its use currently, so held off. Again, it can always be future PRs which introduce logging on a wider scale than just this feature. Definitely agree on the use of logging/warning though over debug prints.

It's done internally in a few places, but not in the scripts now. Still, never to early to start doing things the right way.

But it can be a separate PR if you prefer (and it doesn't have to be you doing it ^^).

Volunteering? ;)

Wasn't really my intention, no :)

de-vri-es · 2018-02-14T23:35:06Z

keras_retinanet/bin/train.py

+    #Load plugins first, as their procedures are needed for parsing args.
+    print("Loading plugins...")
+    # Load Plugins
+    load_plugins(['plugins'])


I'd like to expose the plugin path to users. Ideally, the plugin path comes from the following places:

Command line flags.

An environment variable.

Some built-in default suitable for packaging. (That generally means a packager has to be able to set it during a configure step. Not sure how python programs normally deal with this.)

Ideally command line flags give enough control to precisely determine the the plugin path as desired. Default behaviour should probably be to append to the built-in path(s) and any paths from the environment, but allow disabling built-in paths and paths from the environment.

This isn't necessarily required before merging, but it's definitely something I want to have up to scratch. It would require changes to argument parsing to atleast a two-stage process.

Before merging though, I don't think it currently picks up the built-in plugins? It only looks in the plugins folder in the current working directory now, it seems.

I believe the plugin folder should remain static, as part of RetinaNet. E.g Most games that utilise this pattern have a singular defined plugins folder, a central location for users to augment. I think if we start having multiple places to look for things become decentralised far too quickly. An organised location for encapsulating Datasets seems more sensible than a wild-west approach, personally. Thoughts?

I think one big value of a plugin system is to allow third party contributions without modifying the source of the project. So allowing the loading of plugins from user-defined locations seems logical to me.

However, the current plugin path seems to be relative to the working directory. If we want to have a plugins folder in the keras-retinanet package, we should probably make it relative to the location of the python code?

I'm not 100% with regard to the concreteness of paths in relation to the module. But typically all solutions which utilise Plugin systems have a designated folder for dropping 'modules'/plugins into. Adding more command line arguments will just complicate use of RetinaNet. A dedicated and singular 'bin' to drop things into reduces the amount of checking required. If we have 50 paths to check for plugins, that's a lot of folders, makes error checking harder.
The idea is that these plugins are designed to run with RetinaNet, so I don't see a need for them to be elsewhere really. If people need other directories as development areas, either copy them in, or symlink. Multiple folder locations certainly isn't the convention within industry from what I've seen; however, I'm happy to be proven wrong - I just don't see the added benefit.

If the goal is to have one specific directory with additional plugins we don't even need something like yapsy. We could just have that directory be a python module and iterate over the submodules.

However, I always keep in mind the end-goal for any software project: having it packaged and installed system wide by a package manager (such as pacman). That means its files and directories will be root owned, and users will not be able to drop in any plugins.

Additionally, users may want to have different sets of plugins if they're unlucky enough that plugins they're interested in may have claimed conflicting names.
That's why I think the user should have full control over the plugin path user by any application.

And to keep things easy for the user, there should be useful defaults that work most of the time.

But again, this is not crucial for the first step. But if we have a fixed path it must atleast always pick up the default bundled datasets.

If the goal is to have one specific directory with additional plugins we don't even need something like yapsy. We could just have that directory be a python module and iterate over the submodules.

As stated before, this was the first attempt for a plugin system; however, Yapsy handles a lot of error-handling and compatibility stuff. Especially as it's still a raw python implementation, we would simply be reinventing the wheel and it would take longer to put all the error checking in place.

However, I always keep in mind the end-goal for any software project: having it packaged and installed system wide by a package manager (such as pacman). That means its files and directories will be root owned, and users will not be able to drop in any plugins.

I see your point. Indeed, if it is installed via a package manager and the plugins folder is owned by root, it may become more difficult for users to actually utilise the system. It then becomes important to augment plugins via some user-space locations.

I'd say flags, but we're getting to the stage where the incantations are huge. It's almost at a point where loading in a JSON/XML config file is more user friendly than args. I might lean towards environment variable use. It makes sense that they would define some region on their system for user level scripts. Packages can define environment variables when installing to the system, for the user. E.g. RETINANET_PLUGIN_DIR would be ~/.retinanet_plugins/ for example.

The perfect solution for determining the plugin path to me is a combination of all of them: command line flags, an environment variable and some built-in default path(s) (in that order of priority).

But an initial solution which only uses some built-in default(s) and an environment variable is also good. Even the environment bit is optional as far as I'm concerned for the first version.

If you do add an environment variable, I'd go for RETINANET_PLUGIN_PATH though, and allow colon-separated paths (like $PATH).

awilliamson · 2018-02-22T22:34:42Z

Did a handful of the changes, primarily renames and error handling. Also redid the merge commit so that you won't have any issues.

The reason not everything is done, as I believe there's more discussion which can happen for future PRs.

How we wish to achieve the Plugin Manager abstraction class, alongside @de-vri-es's hatred for globals. We can wrap the existing mechanism, but do we want to remove singleton? Is this a good idea? Why remove it? How? Replaced by What?
Mechanism to check plugins apriori. Once loaded we wish to validate their methods and attributes. This requires an additional layer ( this will work well in tandem with the above change ). Whereby we can deactivate invalid plugins, then only check is_activated when invoking methods upon them.

*Documentation of the API for Plugins.

How to treat dataset_type attribute, and the Dataset info file name attribute.

In my mind this would constitute 3 logical features.

Plugin Manager abstraction, with wrapped validity checks. Enables us to swap out the underlying plugin manager implementation should a homebrew solution come up. Needs API definition, and resolution to the singleton.
Documentation of existing Dataset API. Again, more discussion and definition probably required ( Looking likely to use Sphinx with Documentation Generation stemming from existing commenting syntax ).
Info file disputes.

I would like this to get merged, as it does become rather hard to rebase all these changes, and still try and maintain a pretty history. The work left to do I think needs more design discussion, and can work as additional PRs once this has been worked out.

EDIT: Last two commits relate to adding Yapsy as a dependency both via Setup and for Travis. I'm unfamiliar with how you're doing CI, so let me know if those need removing.

hgaiser · 2018-02-23T14:50:53Z

I must admit I didn't thoroughly read every part of the discussion, but here's my 2 cents:

I don't quite understand the added value of yapsy. In python, I would expect python-plugins to be a simple case of listing all python modules inside a directory, importing them and checking a predefined function (create_generator, parse_args, etc.). What does yapsy offer that this method doesn't?
I must agree with @de-vri-es, I'm not a big fan of singletons and I think we should avoid using them.

de-vri-es · 2018-02-25T01:38:03Z

We can't merge as is until at least a few problems are resolved. I branched off from this PR on the dataset-plugins branch. Rebased again, and I'm working through some remaining issues.

I'll open a new PR when I think it's ready for review.

de-vri-es · 2018-02-26T19:17:38Z

Merged into dataset-plugins branch.

awilliamson added the enhancement label Feb 14, 2018

de-vri-es reviewed Feb 14, 2018

View reviewed changes

de-vri-es mentioned this pull request Feb 15, 2018

Plugin based mechanism for Dataset definitions #284

Open

awilliamson force-pushed the generator_merge branch from 66940b0 to 27a2613 Compare February 22, 2018 22:27

de-vri-es changed the base branch from master to dataset-plugins February 26, 2018 19:14

de-vri-es merged commit fd9a8a9 into fizyr:dataset-plugins Feb 26, 2018

de-vri-es force-pushed the dataset-plugins branch from 6bfedce to fd9a8a9 Compare February 26, 2018 19:17

		super(DatasetPlugin, self).__init__()


		def parser_args(self, parser):

Dataset encapsulation with Plugin Manager #291

Dataset encapsulation with Plugin Manager #291

Conversation

awilliamson commented Feb 14, 2018

de-vri-es left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-vri-es Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-vri-es Feb 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-vri-es Feb 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-vri-es Feb 20, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

de-vri-es Feb 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awilliamson Feb 20, 2018 • edited by de-vri-es Loading

Choose a reason for hiding this comment

de-vri-es Feb 20, 2018 • edited Loading

Choose a reason for hiding this comment

awilliamson commented Feb 22, 2018 • edited Loading

hgaiser commented Feb 23, 2018

de-vri-es commented Feb 25, 2018

de-vri-es commented Feb 26, 2018

de-vri-es left a comment •

edited

Loading

de-vri-es Feb 14, 2018 •

edited

Loading

de-vri-es Feb 14, 2018 •

edited

Loading

de-vri-es Feb 20, 2018 •

edited

Loading

de-vri-es Feb 20, 2018 •

edited

Loading

de-vri-es Feb 15, 2018 •

edited

Loading

awilliamson Feb 20, 2018 •

edited by de-vri-es

Loading

de-vri-es Feb 20, 2018 •

edited

Loading

awilliamson commented Feb 22, 2018 •

edited

Loading