Allow the same tokenizer stage to be used for multiple classifiers

Several of our text classifiers use the same tokenizer. Instead of tokenizing for each classifier, we should allow the outputs of a single tokenization step to be used for multiple models in a pipeline.

This is almost already possible through Curator's use of composite stages, but in the model stages we currently drop the tokenized columns after classification is completed. We should create a `drop_tokens` boolean to control the behavior here:

- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/base.py#L142
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/fineweb_edu.py#L124
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/prompt_task_complexity.py#L266
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/aegis.py#L221

Then, we can update our documentation for how to use the same tokens for multiple classifiers (i.e., how to construct a pipeline that is `TokenizerStage` -> `ModelStage1` -> `ModelStage2` ...). This documentation should include a list of which text classifiers use the same tokenizers.

As future work, we can consider handling this logic on the Curator side, instead of the user needing to directly interact with tokenizer and model stages. In other words, we can try to add logic to detect if generated tokens for a classifier can be used later within the same pipeline for another classifier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow the same tokenizer stage to be used for multiple classifiers #1236

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Allow the same tokenizer stage to be used for multiple classifiers #1236

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions