Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update before_dataset_saved to return data so mutations can be applied #4450

Open
datajoely opened this issue Jan 30, 2025 · 1 comment
Open
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@datajoely
Copy link
Contributor

datajoely commented Jan 30, 2025

Description

I have a massive dataset which is too big to work on quickly if we use the full table. I wanted to write a hook that would kick in based on run environment, intercept the data before it was saved and add a limit(n) to operation before it was saved by the catalog. To my surprise it turned out that this wasn't possible.

Another possible use case would be something like PII stripping before save.

I solved this problem with a custom dataset but it feels clunky.

Context

The hook implementation (like all hooks actually) return None

hookimpl

@hook_spec
    def before_dataset_saved(self, dataset_name: str, data: Any, node: Node) -> None:
        """Hook to be invoked before a dataset is saved to the catalog.

        Args:
            dataset_name: name of the dataset to be saved to the catalog.
            data: the actual data to be saved to the catalog.
            node: The ``Node`` that ran.
        """
        pass

Possible Implementation

The runner tasky.py implements this hook and could be tweaked based on whether anything is returned.

for name, data in items:
+      if retuned_data := hook_manager.hook.before_dataset_saved(
+          dataset_name=name, data=data, node=node
+      ):
+         catalog.save(name, returned_data)
+      else:
          catalog.save(name, data)
      hook_manager.hook.after_dataset_saved(
          dataset_name=name, data=data, node=node
      )
      return node
@datajoely datajoely added the Issue: Feature Request New feature or improvement to existing feature label Jan 30, 2025
@DimedS
Copy link
Member

DimedS commented Jan 31, 2025

Thanks for the proposal, @datajoely! I think that's a reasonable idea - let's implement it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
Status: No status
Development

No branches or pull requests

2 participants