-
Notifications
You must be signed in to change notification settings - Fork 191
Description
Several of our text classifiers use the same tokenizer. Instead of tokenizing for each classifier, we should allow the outputs of a single tokenization step to be used for multiple models in a pipeline.
This is almost already possible through Curator's use of composite stages, but in the model stages we currently drop the tokenized columns after classification is completed. We should create a drop_tokens boolean to control the behavior here:
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/base.py#L142
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/fineweb_edu.py#L124
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/prompt_task_complexity.py#L266
- https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/stages/text/classifiers/aegis.py#L221
Then, we can update our documentation for how to use the same tokens for multiple classifiers (i.e., how to construct a pipeline that is TokenizerStage -> ModelStage1 -> ModelStage2 ...). This documentation should include a list of which text classifiers use the same tokenizers.
As future work, we can consider handling this logic on the Curator side, instead of the user needing to directly interact with tokenizer and model stages. In other words, we can try to add logic to detect if generated tokens for a classifier can be used later within the same pipeline for another classifier.