Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset file cleanup #2625

Open
cdbethune opened this issue May 19, 2021 · 2 comments
Open

Dataset file cleanup #2625

cdbethune opened this issue May 19, 2021 · 2 comments
Assignees

Comments

@cdbethune
Copy link
Collaborator

We accumulate a lot of files associated with datasets in the $D3MOUTPUTDIR and $D3MTEMPDIR that should be cleaned up. Ideally, the contents of the augmented directory would contain the only copy of the data needed, and any other copies / archives created during import etc. would be cleaned up.

Some examples:

  • CSV and ZIP files that are copied over as part of an import from the client
  • Copies of datasets generated during import
  • Contents of the batch subdirectory (associated with import)

The tempdir contains all the data generated by model runs - if we wanted to remove old files there, we would need to assess the impact on the system of those files being missing (how it impacts caching for instance).

@phorne-uncharted
Copy link
Contributor

A PR was merged that added a background process that deletes files. Currently, zips & csv & anything else used during import gets deleted. The batches also get deleted.

Leaving this issue open so that a few more things can be deleted as needed (soft deleted datasets? pipeline outputs?)

@phorne-uncharted
Copy link
Contributor

A reorg of the folder structure should probably be done to better insulate different parts of the system.

Specifically, deleting datasets now is problematic due to downstream dataset dependencies when cloning occurs. Storing all resources into a separate folder removes that dependency. There may be other aspects of the folder structure that need revisiting to promote better disk storage use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants