-
Notifications
You must be signed in to change notification settings - Fork 0
Description
get_filename is currently used in three places in EarthbeamDAG and does two things, strip off parent directories and strip off the extension of the path it is given:
- In
run_earthmover, it's used to construct the Earthmover output directory and state file. In this case, we probably do intend to remove the extension, since we pass in the path to a file, not a directory. - In
upload_to_s3, it's used to construct the S3 filepath to write to. This is used for bothupload_raw_to_s3andupload_em_to_s3tasks; for the former, it is passed the path to the raw file, for which we probably intend to remove the extension, but in the latter, it's passedearthmover_results['data_dir']. For Sharefile files where data_dir may contain periods in the name, this means we further modify the last part of the S3 path. - And in
run_lightbeam, it's used to create the path for the results file, used for both therun_lightbeam_validateandrun_lightbeamtasks. Both of these are passedearthmover_results['data_dir']; again, for Sharefile files where this directory may have periods in the name, this may be unexpected behavior.
This is causing some downstream impact in assessment load status reporting being shown to partners. In cases where multiple files in the same Sharefile run contain periods and have the same 'prefix' (e.g. sat_files_2025.01.01_part_1.csv and sat_files_2025.01.01_part_2.csv), this results in Lightbeam results being written to the same file (e.g. '/efs/emlb/results/.../sat_files_2025.01/lightbeam_send_results.json') and then at least one set of results being omitted from log_lb_to_snowflake. If we rely on S3 results, this means we are also writing Earthmover results (but not raw files) to potentially duplicate S3 paths.
A few fixes come to mind:
- Don't call
EarthbeamDAG.get_filenamewhen we need to strip parent directories off of a directory; just useos.path.basenamedirectly. In the case ofupload_to_s3, we'd need some extra logic to check what we're being passed. We would changerun_lightbeamto not useget_filenameas well. - Modify
EarthbeamDAG.get_filenameto only strip off the extension if it's passed a file, not a directory. We are only dealing with local paths (so far), soos.path.isdirshould work fine for the check. This means minimal modification to current code outside ofget_filename. - Don't strip off file extensions at any point and remove that behavior from
EarthbeamDAG.get_filename; just wrapos.path.basename. The less processing we do to the original filename, the lower the chance we do anything unexpected to it. However, this might lead to breaks in the historical EMLB logging data.