New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Image + text + audio uniform processors #30511
base: main
Are you sure you want to change the base?
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
cc @edbeeching who also has quite some experience trying to standardise processors when he ported Llava to TRL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 👏
What a beautiful PR. Thank you for handling a very tricky problem and applying such an elegant solution. This is going to be great for standardising our processors.
Only request is that we have some tests for the processors. I realise most of the processors don't have tests, so this is possibly better suited to a follow-up PR. Ideally we'd have e.g. a processor mixin that would check e.g. processing_kwargs
is implemented. Most important I think is just to make sure default behaviour is as expected and return_tensors works.
@@ -53,6 +61,77 @@ | |||
} | |||
|
|||
|
|||
class TextKwargs(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is honestly so beautiful ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree! One question: where is the doc for these args? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also FMI (but a comment about that would be nice) what does total=False
do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will transfer some doc - with the inherited class solution we can do that nicely :) total=False
means that not all the arguments are required. Basically total=True will guarantee you that all elements will be found - which is not what we want here
src/transformers/processing_utils.py
Outdated
image_mean: Optional[Union[float, List[float]]] | ||
image_std: Optional[Union[float, List[float]]] | ||
do_pad: Optional[bool] | ||
pad_and_return_pixel_mask: Optional[bool] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How easy is it to deprecate these arguments? I'd really like to see pad_and_return
(or any x_and_y
style flags to be removed eventually)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with this framework it's deprecate-able for sure :)
src/transformers/processing_utils.py
Outdated
|
||
class AudioKwargs(TypedDict, total=False): | ||
sampling_rate: Optional[int] | ||
# raw_speech: Optional[List[List[float]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one commented out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, because I thought it was a processing kwarg but it seems to be a deprecated positional arg that was used before audio
, will remove it altogether!
input_data_format: Optional[Union[str, ChannelDimension]] | ||
|
||
|
||
class AudioKwargs(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - so the audio feature extractors have little to no configuration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a few more will need to be added! For example the Wave2vecFeatureExtractor
accepts
raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
padding: Union[bool, str, PaddingStrategy] = False,
max_length: Optional[int] = None,
truncation: bool = False,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
sampling_rate: Optional[int] = None,
**kwargs,
```
But most of these are using the same naming as a `tokenizer`, so I wanted to gather a few more before deciding on the default
encoding = self.tokenizer( | ||
text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs | ||
) | ||
text_kwargs = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤩
# BC for explicit return_tensors | ||
common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs} | ||
if "return_tensors" in common_kwargs: | ||
return_tensors = kwargs.pop("return_tensors", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? It's not clear to me here why we check in common_kwargs
and pop from kwargs
return_tensors = kwargs.pop("return_tensors", None) | |
return_tensors = common_kwargs.pop("return_tensors", None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, good catch!
image_features = self.image_processor(images, **images_kwargs) | ||
|
||
# BC for explicit return_tensors | ||
common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we prepare common_kwargs here and not use them? Is it for consistency in the processor patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will correct this part in align
:)
max_length (`int`, *optional*, defaults to `max_length`): | ||
Maximum padding value to use to pad the input text during tokenization. | ||
|
||
return_tensors (`str` or [`~utils.TensorType`], *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return_tensors
is the only one I'm not sure we want to remove from the docstring, as it's something users have to specify if they want to be able to use the outputs for a model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a good point, in general anything that is common
could be here
if pad_and_return_pixel_mask: | ||
do_pad = pad_and_return_pixel_mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I let this slip through the net when I was reviewing. If we're doing this, then we should really have a deprecation message to 1) avoid this dual arg use 2) remove x_and_y kind of arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think before it was just popping from kwargs the x_and_y arg and assigning to do_pad
. I'm all for removing dual args in general, and deprecation warnings are indeed missing from that one - will add that!
Alright, should be a bit better! I added some more typing for audio inputs to protect them a bit more. Worked on a test to enforce this structure as well, should be able to push it after 🌉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
There's a few tiny inits. Main comment is about the configurability of the processors, and how we control the behaviour.
Would be good to get @ArthurZucker's view on this too, as it's a big and important piece of work!
encodings = self.tokenizer( | ||
text, | ||
**text_kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - should stay on one line
encodings = self.tokenizer( | |
text, | |
**text_kwargs, | |
) | |
encodings = self.tokenizer(text, **text_kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, don't know why make fixup moved it so!
inputs = self.image_processor( | ||
images, | ||
**images_kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inputs = self.image_processor( | |
images, | |
**images_kwargs, | |
inputs = self.image_processor(images, **images_kwargs |
return self.current_processor( | ||
images, | ||
**images_kwargs, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return self.current_processor( | |
images, | |
**images_kwargs, | |
) | |
return self.current_processor(images, **images_kwargs) |
self.processing_kwargs: ProcessingKwargs = { | ||
"common_kwargs": {"return_tensors": None}, | ||
"text_kwargs": { | ||
"text_pair": None, | ||
"text_target": None, | ||
"text_pair_target": None, | ||
"add_special_tokens": True, | ||
"padding": "max_length", | ||
"truncation": True, | ||
"max_length": 64, | ||
"stride": 0, | ||
"is_split_into_words": False, | ||
"pad_to_multiple_of": None, | ||
"return_token_type_ids": None, | ||
"return_attention_mask": None, | ||
"return_overflowing_tokens": False, | ||
"return_special_tokens_mask": False, | ||
"return_offsets_mapping": False, | ||
"return_length": False, | ||
"verbose": True, | ||
}, | ||
"images_kwargs": { | ||
"do_crop_margin": None, | ||
"do_resize": None, | ||
"size": None, | ||
"resample": None, | ||
"do_thumbnail": None, | ||
"do_align_long_axis": None, | ||
"do_pad": None, | ||
"do_rescale": None, | ||
"rescale_factor": None, | ||
"do_normalize": None, | ||
"image_mean": None, | ||
"image_std": None, | ||
"data_format": "channels_first", | ||
"input_data_format": None, | ||
}, | ||
"audio_kwargs": {}, | ||
"videos_kwargs": {}, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I'd say about the current design is the default behaviour of a processor isn't configurable. As users who use multimodal models just want to use the processor, and processors might share the default processing classes they bundle together e.g. CLIP, we might want to be able to override this in the init, such that I can save and load processors with custom behaviour. For example:
from transformers import AlignProcessor
processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
# I have to pass in max_length in for every call
output = processor(images=images, text=text, max_length=128)
processor = AlignProcessor.from_pretrained("kakaobrain/align-base", max_length=128)
processor.save_pretrained("my_new_model")
processor = AlignProcessor.from_pretrained("my_new_model")
# The processor automatically defaults to the max value I want when calling
output = processor(images=images, text=text)
The correct way, might be to say that the e.g. tokenizer behaviour should be configured instead. At the moment, I think if I change the tokenizer, and build the processor, it will use the processor's max_length default, rather than the tokenizer.
from transformers import AlignProcessor, AutoTokenizer, AutoImageProcessor
tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base", max_length=128)
image_processor = AutoImageProcessor.from_pretrained("kakaobrain/align-base")
processor = AlignProcessor(tokenizer, image_processor)
# Is max_length 128 or 64 used here?
outputs = processor(text=text, images=images)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice indeed!
text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs | ||
) | ||
text_kwargs = { | ||
**self.processing_kwargs["text_kwargs"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is basically creating a dictionary with the default, and is updates with the kwargs.
- It's very nice
- It's a tad bit not explicit enough for me (for newbies, it would be hard to guess that this is what's happening IMO)
Thus I am wondering if there is a way to leverage theProcessingKwargs
class rather than use the class's keys.
For that what I mean is something along the line ofself.processing_kwargs.update_with_kwargs("text_kwargs",**kwargs)
could be more understandable? (I don't know)
@@ -53,6 +61,77 @@ | |||
} | |||
|
|||
|
|||
class TextKwargs(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree! One question: where is the doc for these args? 👀
truncation: Optional[bool] | ||
pad_to_multiple_of: Optional[int] | ||
return_attention_mask: Optional[bool] | ||
return_tensors: Optional[Union[str, TensorType]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a common arg should not be repeated no?
@@ -53,6 +61,77 @@ | |||
} | |||
|
|||
|
|||
class TextKwargs(TypedDict, total=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also FMI (but a comment about that would be nice) what does total=False
do?
images: ImageInput = None, | ||
audio=None, | ||
videos=None, | ||
**kwargs, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main question is how are type hints and signature scan done for this? Is there a way to say that the kwargs include text kwargs and image kwargs + benefit from having the doc as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://typing.readthedocs.io/en/latest/spec/callables.html#unpack-kwargs
We might have to simply define a class for each processor, but seems to me like it's the best compromise no? |
Agree with you @ArthurZucker - I'll update the PR to have that, looks super nice |
What does this PR do?
This PR is a stab at uniformizing the processors across all transformers models. If we are happy with the design, I'll expand it to all existing models. It only touches on some text + audio and text + images models as experiment subjects. Linked with #28711 which has a larger scope, and several previous discussions with team members.
What changes:
TypedDict
that does not need to be total. We can expand this typed dict (inprocessing_utils
) as processors get uniformized.kwargs
set by a user will always override default processing kwargs.call
.What that allows:
Limitations:
Tests pass (I think). What's missing:
Who can review?
Models: