Skip to content

Conversation

@johncmerfeld
Copy link
Contributor

@johncmerfeld johncmerfeld commented Aug 28, 2025

Rework how we think about the config block during project composition. Now uses the following precedence order for config options, from lowest to highest:

  1. Hardcoded global config_defaults
  2. config / parameter_defaults in parent package(s) (throw an error if sibling configs conflict)
  3. config / parameter_defaults in top-level "child" package
  4. -p params passed to the CLI
  5. --set values passed to the CLI

Updated some example projects to test this - 11 and 11a have default and toggle-able behavior . I believe everything the docs say on this subject is correct, given the changes.


Older version of this PR

Consider the following situation:

earthmover run -c airflow/dags/earthmover/assessments/ACCESS/earthmover.yaml -p '{
"INPUT_FILE": earthmover_edfi_bundles/assessments/ACCESS/data/sample_anonymized_file_ACCESS.csv",
"OUTPUT_DIR": "em-output",
"STUDENT_ID_NAME": "StateStudentID",
"SCHOOL_YEAR" : "2024",
"ALTERNATE": "N" }' 

Where the Airflow earthmover.yaml is a wrapper over the primary bundle configuration. The underlying bundle has a default output_dir that can be overridden by the param OUTPUT_DIR, but the wrapper bundle does not take this param. When the above command is run, the files do not go to em-output but to the underlying bundle's default output_dir

Currently, such CLI env var values (e.g. em-output) are not injected during compilation. This is because the config variables are populated during Earthmover.__init__(), and not checked again during compile(), after all packages have been merged.

Questions for reviewers

  1. Is this solution a "correct" usage of the underlying user_configs object? I don't understand the various yaml_parser methods super well
  2. Should we document this behavior? The long comment in the code felt more like user-facing documentation. On the other hand, maybe users just assume this behavior is already happening with project composition

@jayckaiser it would be ideal if you're able to test this out to see if it solves the problem you've been having (or I can do the test if you add me to the relevant server)

@johncmerfeld johncmerfeld self-assigned this Aug 28, 2025
@johncmerfeld johncmerfeld added the bug Something isn't working label Aug 28, 2025
Copy link
Collaborator

@tomreitz tomreitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @johncmerfeld, thanks for this PR, and sorry for taking a while to respond.

I took some time to remember what's going on with state_configs vs. user_configs...

  • state_configs is just the configs block of the main earthmover.yml file; it is read in and set in Earthmover.__init__(), and used in a few places in the Earthmover class.
  • user_configs is constructed in Earthmover.compile() and (eventually) represents the whole earthmover structure that will be run - what's output as earthmover_compiled.yml by earthmover compile.

(The reasons these two are separate are complicated but have to do with the fact that earthmover.yml can contain config.parameter_defaults which are needed prior to compile. Also their names are unclear and could likely be improved.)

As you've noted, the problem is that some important things in state_configs, like the output_dir and tmp_dir may get changed by the compilation process - by packages that get merged in - but state_configs is never updated with this information from user_configs's compiled config block.

I talked to @jayckaiser about this and he has lots of ideas to improve these parts of earthmover's code, including possibly

  • rethinking whether we need separate state_configs and user_configs or if they could be merged (and better-named!)
  • minimizing the number of places where state (whatever it's called) get's modified
  • moving anything in earthmover.py that uses output_dir (mainly creating the dir if it doesn't exist) to nodes/destination.py (the only place it's used)
  • moving all the metadata elements to top-level variables, to make updating one element of metadata less cumbersome

However, to get this issue fixed, we'd like to prioritize getting this PR merged and not hold it up with larger structural changes. To that end...

The two core lines you've added

self.state_configs.update(full_user_configs.get("config", {}))

self.metadata['output_dir'] = self.state_configs['output_dir']

mirror corresponding lines in __init__() here and here. There's also a third important line that should be updated as well - I believe this is the source of Jay's troubles overriding the tmp_dir in SC earthbeam runs, so we should make sure this PR also fixes it.

We're thinking it would make sense to factor this out to a separate little function (to avoid duplicating this code) like

def update_state_configs(self, configs):
    # Prepare the output directory for destinations.
    self.state_configs['output_dir'] = os.path.expanduser(configs['output_dir'])

    # Set the temporary directory in cases of disk-spillage.
    dask.config.set({'temporary_directory': configs['tmp_dir']})

    # Update metadata output_dir
    self.metadata["output_dir"] = configs["output_dir"]

which can then be called at the end of __init__() as

self.update_state_configs(self.state_configs)

and from the area you're modifying in compile() as

self.user_configs = self.merge_packages() or self.user_configs
self.user_configs = self.inject_cli_overrides(self.user_configs)
self.update_state_configs(self.user_configs.get("config"))
if to_disk:
    self.user_configs.to_disk(self.compiled_yaml_file)

(Note that, by moving those last two to_disk lines down, the above also fixes a bug where the compiled YAML file was outputted to disk before inject_cli_overrides(), resulting in earthmover_compiled.yml potentially not actually reflecting what will be run.)

Hope this all makes sense, let me know if you have questions, would like to look through it together, or if you prefer I make these changes to this branch.

@johncmerfeld
Copy link
Contributor Author

Hey @johncmerfeld, thanks for this PR, and sorry for taking a while to respond.

I took some time to remember what's going on with state_configs vs. user_configs...

  • state_configs is just the configs block of the main earthmover.yml file; it is read in and set in Earthmover.__init__(), and used in a few places in the Earthmover class.
  • user_configs is constructed in Earthmover.compile() and (eventually) represents the whole earthmover structure that will be run - what's output as earthmover_compiled.yml by earthmover compile.

(The reasons these two are separate are complicated but have to do with the fact that earthmover.yml can contain config.parameter_defaults which are needed prior to compile. Also their names are unclear and could likely be improved.)

As you've noted, the problem is that some important things in state_configs, like the output_dir and tmp_dir may get changed by the compilation process - by packages that get merged in - but state_configs is never updated with this information from user_configs's compiled config block.

I talked to @jayckaiser about this and he has lots of ideas to improve these parts of earthmover's code, including possibly

  • rethinking whether we need separate state_configs and user_configs or if they could be merged (and better-named!)
  • minimizing the number of places where state (whatever it's called) get's modified
  • moving anything in earthmover.py that uses output_dir (mainly creating the dir if it doesn't exist) to nodes/destination.py (the only place it's used)
  • moving all the metadata elements to top-level variables, to make updating one element of metadata less cumbersome

However, to get this issue fixed, we'd like to prioritize getting this PR merged and not hold it up with larger structural changes. To that end...

The two core lines you've added

self.state_configs.update(full_user_configs.get("config", {}))

self.metadata['output_dir'] = self.state_configs['output_dir']

mirror corresponding lines in __init__() here and here. There's also a third important line that should be updated as well - I believe this is the source of Jay's troubles overriding the tmp_dir in SC earthbeam runs, so we should make sure this PR also fixes it.

We're thinking it would make sense to factor this out to a separate little function (to avoid duplicating this code) like

def update_state_configs(self, configs):
    # Prepare the output directory for destinations.
    self.state_configs['output_dir'] = os.path.expanduser(configs['output_dir'])

    # Set the temporary directory in cases of disk-spillage.
    dask.config.set({'temporary_directory': configs['tmp_dir']})

    # Update metadata output_dir
    self.metadata["output_dir"] = configs["output_dir"]

which can then be called at the end of __init__() as

self.update_state_configs(self.state_configs)

and from the area you're modifying in compile() as

self.user_configs = self.merge_packages() or self.user_configs
self.user_configs = self.inject_cli_overrides(self.user_configs)
self.update_state_configs(self.user_configs.get("config"))
if to_disk:
    self.user_configs.to_disk(self.compiled_yaml_file)

(Note that, by moving those last two to_disk lines down, the above also fixes a bug where the compiled YAML file was outputted to disk before inject_cli_overrides(), resulting in earthmover_compiled.yml potentially not actually reflecting what will be run.)

Hope this all makes sense, let me know if you have questions, would like to look through it together, or if you prefer I make these changes to this branch.

Thanks @tomreitz - this makes sense and I'm glad to hear it; I was on teh fence about factoring some of this out so glad there's an even more compelling reason to do so. I should be able to get to this later this afternoon. But if you want to just go in and make the change before then, feel free to just ping me and do so. Otherwise I'll assume I'll do it and push something today!

@tomreitz
Copy link
Collaborator

I've pushed the changes discussed here, per your message @johncmerfeld. Please take a look and let us know if you have any questions/concerns.

I'll re-tag @jayckaiser for another review. Thanks!

@tomreitz tomreitz self-requested a review September 12, 2025 16:43
@johncmerfeld johncmerfeld marked this pull request as draft October 31, 2025 16:43
@johncmerfeld johncmerfeld changed the title bugfix: Projects with imported packages not injecting CLI env vars into underlying configuration feature: Introduce precedence order for resolving configuration when using project composition Oct 31, 2025
@johncmerfeld johncmerfeld marked this pull request as ready for review October 31, 2025 19:03
@johncmerfeld johncmerfeld requested review from jayckaiser and tomreitz and removed request for jayckaiser and tomreitz October 31, 2025 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants