-
Notifications
You must be signed in to change notification settings - Fork 5
feature: Introduce precedence order for resolving configuration when using project composition #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
tomreitz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @johncmerfeld, thanks for this PR, and sorry for taking a while to respond.
I took some time to remember what's going on with state_configs vs. user_configs...
state_configsis just theconfigsblock of the mainearthmover.ymlfile; it is read in and set inEarthmover.__init__(), and used in a few places in theEarthmoverclass.user_configsis constructed inEarthmover.compile()and (eventually) represents the whole earthmover structure that will be run - what's output asearthmover_compiled.ymlbyearthmover compile.
(The reasons these two are separate are complicated but have to do with the fact that earthmover.yml can contain config.parameter_defaults which are needed prior to compile. Also their names are unclear and could likely be improved.)
As you've noted, the problem is that some important things in state_configs, like the output_dir and tmp_dir may get changed by the compilation process - by packages that get merged in - but state_configs is never updated with this information from user_configs's compiled config block.
I talked to @jayckaiser about this and he has lots of ideas to improve these parts of earthmover's code, including possibly
- rethinking whether we need separate
state_configsanduser_configsor if they could be merged (and better-named!) - minimizing the number of places where state (whatever it's called) get's modified
- moving anything in
earthmover.pythat usesoutput_dir(mainly creating the dir if it doesn't exist) tonodes/destination.py(the only place it's used) - moving all the
metadataelements to top-level variables, to make updating one element of metadata less cumbersome
However, to get this issue fixed, we'd like to prioritize getting this PR merged and not hold it up with larger structural changes. To that end...
The two core lines you've added
self.state_configs.update(full_user_configs.get("config", {}))
self.metadata['output_dir'] = self.state_configs['output_dir']mirror corresponding lines in __init__() here and here. There's also a third important line that should be updated as well - I believe this is the source of Jay's troubles overriding the tmp_dir in SC earthbeam runs, so we should make sure this PR also fixes it.
We're thinking it would make sense to factor this out to a separate little function (to avoid duplicating this code) like
def update_state_configs(self, configs):
# Prepare the output directory for destinations.
self.state_configs['output_dir'] = os.path.expanduser(configs['output_dir'])
# Set the temporary directory in cases of disk-spillage.
dask.config.set({'temporary_directory': configs['tmp_dir']})
# Update metadata output_dir
self.metadata["output_dir"] = configs["output_dir"]which can then be called at the end of __init__() as
self.update_state_configs(self.state_configs)and from the area you're modifying in compile() as
self.user_configs = self.merge_packages() or self.user_configs
self.user_configs = self.inject_cli_overrides(self.user_configs)
self.update_state_configs(self.user_configs.get("config"))
if to_disk:
self.user_configs.to_disk(self.compiled_yaml_file)(Note that, by moving those last two to_disk lines down, the above also fixes a bug where the compiled YAML file was outputted to disk before inject_cli_overrides(), resulting in earthmover_compiled.yml potentially not actually reflecting what will be run.)
Hope this all makes sense, let me know if you have questions, would like to look through it together, or if you prefer I make these changes to this branch.
Thanks @tomreitz - this makes sense and I'm glad to hear it; I was on teh fence about factoring some of this out so glad there's an even more compelling reason to do so. I should be able to get to this later this afternoon. But if you want to just go in and make the change before then, feel free to just ping me and do so. Otherwise I'll assume I'll do it and push something today! |
|
I've pushed the changes discussed here, per your message @johncmerfeld. Please take a look and let us know if you have any questions/concerns. I'll re-tag @jayckaiser for another review. Thanks! |
Rework how we think about the config block during project composition. Now uses the following precedence order for config options, from lowest to highest:
config_defaultsconfig/parameter_defaultsin parent package(s) (throw an error if sibling configs conflict)config/parameter_defaultsin top-level "child" package-pparams passed to the CLI--setvalues passed to the CLIUpdated some example projects to test this - 11 and 11a have default and toggle-able behavior . I believe everything the docs say on this subject is correct, given the changes.
Older version of this PR
Consider the following situation:
Where the Airflow earthmover.yaml is a wrapper over the primary bundle configuration. The underlying bundle has a default
output_dirthat can be overridden by the paramOUTPUT_DIR, but the wrapper bundle does not take this param. When the above command is run, the files do not go toem-outputbut to the underlying bundle's defaultoutput_dirCurrently, such CLI env var values (e.g.
em-output) are not injected during compilation. This is because the config variables are populated duringEarthmover.__init__(), and not checked again duringcompile(), after all packages have been merged.Questions for reviewers
user_configsobject? I don't understand the various yaml_parser methods super well@jayckaiser it would be ideal if you're able to test this out to see if it solves the problem you've been having (or I can do the test if you add me to the relevant server)