Skip to content

Commit 8a113f7

Browse files
ryantwolfphilm001
authored andcommitted
Enforce Dataframe Backend Checks (NVIDIA-NeMo#514)
* Add module and to backend Signed-off-by: Ryan Wolf <[email protected]> * Add backend tests Signed-off-by: Ryan Wolf <[email protected]> * Fix tests Signed-off-by: Ryan Wolf <[email protected]> * Add switch backend tests Signed-off-by: Ryan Wolf <[email protected]> * Update modules to use module interface Signed-off-by: Ryan Wolf <[email protected]> * Directly invoke module init Signed-off-by: Ryan Wolf <[email protected]> * Fix call method Signed-off-by: Ryan Wolf <[email protected]> * Fix shuffle call method Signed-off-by: Ryan Wolf <[email protected]> * Add docs and more tests Signed-off-by: Ryan Wolf <[email protected]> * Fix list formatting in docs Signed-off-by: Ryan Wolf <[email protected]> * Address Sarah and Praateek's reviews Signed-off-by: Ryan Wolf <[email protected]> * Fix modifier get_backend to backend Signed-off-by: Ryan Wolf <[email protected]> * Address Ayush's review Signed-off-by: Ryan Wolf <[email protected]> --------- Signed-off-by: Ryan Wolf <[email protected]>
1 parent e7f064d commit 8a113f7

File tree

3 files changed

+3
-5
lines changed

3 files changed

+3
-5
lines changed

nemo_curator/modules/add_id.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ class AddId(BaseModule):
2727
def __init__(
2828
self, id_field, id_prefix: str = "doc_id", start_index: Optional[int] = None
2929
) -> None:
30-
super().__init__(input_backend="any")
30+
super().__init__(input_backend="pandas")
3131
self.id_field = id_field
3232
self.id_prefix = id_prefix
3333
self.start_index = start_index

nemo_curator/modules/exact_dedup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ def hash_documents(
146146
# TODO: Generalize ty using self.hash_method
147147
return df.apply(lambda x: md5(x.encode()).hexdigest())
148148

149-
def identify_duplicates(self, dataset: DocumentDataset) -> DocumentDataset:
149+
def call(self, dataset: DocumentDataset) -> Union[DocumentDataset, str]:
150150
"""
151151
Find document ID's for exact duplicates in a given DocumentDataset
152152
Parameters

nemo_curator/modules/fuzzy_dedup/fuzzyduplicates.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,9 +131,7 @@ def __init__(
131131
profile_dir=self.config.profile_dir,
132132
)
133133

134-
def identify_duplicates(
135-
self, dataset: DocumentDataset
136-
) -> Optional[DocumentDataset]:
134+
def call(self, dataset: DocumentDataset):
137135
"""
138136
Parameters
139137
----------

0 commit comments

Comments
 (0)