Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enforce Dataframe Backend Checks #514

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Conversation

ryantwolf
Copy link
Collaborator

@ryantwolf ryantwolf commented Feb 3, 2025

Description

Refactors all modules to inherit from nemo_curator.modules.base.Module. With this change, the base class now checks to see if the dataframe backend for the Module matches what is passed in, and will throw an error prior to computation if there is a mismatch. We also prove a new module ToBackend() that allows easy chaining of CPU and GPU modules.

Usage

from nemo_curator import Module, Sequential, ToBackend
from nemo_curator.datasets import DocumentDataset

class CPUModule(Module):
    def __init__(self):
        super().__init__(input_backend="pandas")

    def call(self, dataset: DocumentDataset):
        dataset.df["cpu_lengths"] = dataset.df["text"].str.len()
        return dataset


class GPUModule(Module):
    def __init__(self):
        super().__init__(input_backend="cudf")

    def call(self, dataset: DocumentDataset):
        dataset.df["gpu_lengths"] = dataset.df["text"].str.len()
        return dataset

def main():
  client = get_client(cluster_type="gpu")
  # Load a dataset with a pandas CPU backend
  dataset = DocumentDataset.from_pandas(...)
  pipeline = Sequential([
    CPUModule(),
    ToBackend("cudf"),
    GPUModule(),
  ])

  curated_dataset = pipeline(dataset)
  curated_dataset.to_json(...)

if __name__ == "__main__":
  main()

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
Signed-off-by: Ryan Wolf <[email protected]>
@ryantwolf ryantwolf requested review from VibhuJawa, sarahyurick, ayushdg and praateekmahajan and removed request for sarahyurick February 3, 2025 21:57
tests/test_backends.py Outdated Show resolved Hide resolved
docs/user-guide/cpuvsgpu.rst Outdated Show resolved Hide resolved
nemo_curator/modules/__init__.py Show resolved Hide resolved
if self.input_backend == "any":
return

backend = type(ddf._meta).__module__.split(".")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test case for this? And also type hints for ddf which would be dd.DataFrame

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What part of this do you want to test? I'm not quite seeing which part of this you want me to isolate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As someone reading this, it's not obviously clear what will be the output of .__module__.split(".")[0] of the type of object's _meta attribute. So I would've proposed that you move this to a function and have tests for that and use that function here.
However I see you've changed this to is_cudf_type(..) which is much more readable, and now we can unittest that directly that.

Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. Left a review, with mostly nits around type hints and function renaming and some class abstraction opinions.

@ryantwolf
Copy link
Collaborator Author

@sarahyurick @praateekmahajan thanks for your reviews. I addressed the feedback that I immediately agreed with/understood. I left some follow up comments and questions on the other points.

docs/user-guide/cpuvsgpu.rst Show resolved Hide resolved
nemo_curator/modules/base.py Outdated Show resolved Hide resolved
nemo_curator/modules/base.py Show resolved Hide resolved
nemo_curator/modules/base.py Outdated Show resolved Hide resolved
nemo_curator/modules/to_backend.py Show resolved Hide resolved
tests/test_backends.py Outdated Show resolved Hide resolved
tests/test_backends.py Outdated Show resolved Hide resolved
tests/test_backends.py Outdated Show resolved Hide resolved
tests/test_backends.py Show resolved Hide resolved
Signed-off-by: Ryan Wolf <[email protected]>
Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@praateekmahajan praateekmahajan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks for adding this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants