Skip to content

Allow the same tokenizer stage to be used for multiple classifiers #1236

@sarahyurick

Description

@sarahyurick

Several of our text classifiers use the same tokenizer. Instead of tokenizing for each classifier, we should allow the outputs of a single tokenization step to be used for multiple models in a pipeline.

This is almost already possible through Curator's use of composite stages, but in the model stages we currently drop the tokenized columns after classification is completed. We should create a drop_tokens boolean to control the behavior here:

Then, we can update our documentation for how to use the same tokens for multiple classifiers (i.e., how to construct a pipeline that is TokenizerStage -> ModelStage1 -> ModelStage2 ...). This documentation should include a list of which text classifiers use the same tokenizers.

As future work, we can consider handling this logic on the Curator side, instead of the user needing to directly interact with tokenizer and model stages. In other words, we can try to add logic to detect if generated tokens for a classifier can be used later within the same pipeline for another classifier.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions