Skip to content

EarthbeamDAG.get_filename strips extensions from directories #116

@alchenist

Description

@alchenist

get_filename is currently used in three places in EarthbeamDAG and does two things, strip off parent directories and strip off the extension of the path it is given:

  • In run_earthmover, it's used to construct the Earthmover output directory and state file. In this case, we probably do intend to remove the extension, since we pass in the path to a file, not a directory.
  • In upload_to_s3, it's used to construct the S3 filepath to write to. This is used for both upload_raw_to_s3 and upload_em_to_s3 tasks; for the former, it is passed the path to the raw file, for which we probably intend to remove the extension, but in the latter, it's passed earthmover_results['data_dir']. For Sharefile files where data_dir may contain periods in the name, this means we further modify the last part of the S3 path.
  • And in run_lightbeam, it's used to create the path for the results file, used for both the run_lightbeam_validate and run_lightbeam tasks. Both of these are passed earthmover_results['data_dir']; again, for Sharefile files where this directory may have periods in the name, this may be unexpected behavior.

This is causing some downstream impact in assessment load status reporting being shown to partners. In cases where multiple files in the same Sharefile run contain periods and have the same 'prefix' (e.g. sat_files_2025.01.01_part_1.csv and sat_files_2025.01.01_part_2.csv), this results in Lightbeam results being written to the same file (e.g. '/efs/emlb/results/.../sat_files_2025.01/lightbeam_send_results.json') and then at least one set of results being omitted from log_lb_to_snowflake. If we rely on S3 results, this means we are also writing Earthmover results (but not raw files) to potentially duplicate S3 paths.

A few fixes come to mind:

  • Don't call EarthbeamDAG.get_filename when we need to strip parent directories off of a directory; just use os.path.basename directly. In the case of upload_to_s3, we'd need some extra logic to check what we're being passed. We would change run_lightbeam to not use get_filename as well.
  • Modify EarthbeamDAG.get_filename to only strip off the extension if it's passed a file, not a directory. We are only dealing with local paths (so far), so os.path.isdir should work fine for the check. This means minimal modification to current code outside of get_filename.
  • Don't strip off file extensions at any point and remove that behavior from EarthbeamDAG.get_filename; just wrap os.path.basename. The less processing we do to the original filename, the lower the chance we do anything unexpected to it. However, this might lead to breaks in the historical EMLB logging data.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions