Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image + text + audio uniform processors #30511

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

molbap
Copy link
Contributor

@molbap molbap commented Apr 26, 2024

What does this PR do?

This PR is a stab at uniformizing the processors across all transformers models. If we are happy with the design, I'll expand it to all existing models. It only touches on some text + audio and text + images models as experiment subjects. Linked with #28711 which has a larger scope, and several previous discussions with team members.

What changes:

  • There is now an attribute in the constructor that stores the processing kwargs needed to be passed to various encoders down the line.
  • These nested attributes (one dictionary for text, one for images, one for audio, one for video) are typed with a TypedDict that does not need to be total. We can expand this typed dict (in processing_utils) as processors get uniformized.
  • The processors are called with the same kwargs signature, text, images, audio, videos. kwargs set by a user will always override default processing kwargs.
  • Slicing of positional args is removed and replaced by named kwargs corresponding to the modality. Order of kwargs is not constant as a consequence to preserve BC.
  • Inputs (text, images, audio, videos) are now always typed in the call.

What that allows:

  • We know each model has its own processing logic/way to mix inputs. This way of typing both the modalities sent to the processors AND the processing arguments allow a faster design of future processors, hence faster review, and faster merge.
  • The reason for that is that we push away from the function call the actual mixing of modalities and their specific processing, which can be handled explicitly with the kwargs passed.
  • With TypedDict, type hints are preserved even in the nesting. I hesitated with pydantic/dataclasses and opted for TypedDict because it is less flexible, and we want to enforce a standard.

Limitations:

  • This still relies on kwargs for the processing.
  • Few models tested on that PR - more will follow in other PRs.

Tests pass (I think). What's missing:

  • A consistent way (with maybe logger.warning_once) to warn a user they are using a deprecated call.
  • Typing of input modalities need to be improved and be more flexible especially for audio&videos
  • A test in the CI to check that the enforced designed is respected for a new model added. This can be done in a separate PR.

Who can review?

Models:

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lewtun
Copy link
Member

lewtun commented Apr 29, 2024

cc @edbeeching who also has quite some experience trying to standardise processors when he ported Llava to TRL

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏

What a beautiful PR. Thank you for handling a very tricky problem and applying such an elegant solution. This is going to be great for standardising our processors.

Only request is that we have some tests for the processors. I realise most of the processors don't have tests, so this is possibly better suited to a follow-up PR. Ideally we'd have e.g. a processor mixin that would check e.g. processing_kwargs is implemented. Most important I think is just to make sure default behaviour is as expected and return_tensors works.

@@ -53,6 +61,77 @@
}


class TextKwargs(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is honestly so beautiful ❤️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree! One question: where is the doc for these args? 👀

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also FMI (but a comment about that would be nice) what does total=False do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will transfer some doc - with the inherited class solution we can do that nicely :) total=False means that not all the arguments are required. Basically total=True will guarantee you that all elements will be found - which is not what we want here

image_mean: Optional[Union[float, List[float]]]
image_std: Optional[Union[float, List[float]]]
do_pad: Optional[bool]
pad_and_return_pixel_mask: Optional[bool]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How easy is it to deprecate these arguments? I'd really like to see pad_and_return (or any x_and_y style flags to be removed eventually)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this framework it's deprecate-able for sure :)


class AudioKwargs(TypedDict, total=False):
sampling_rate: Optional[int]
# raw_speech: Optional[List[List[float]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one commented out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, because I thought it was a processing kwarg but it seems to be a deprecated positional arg that was used before audio, will remove it altogether!

input_data_format: Optional[Union[str, ChannelDimension]]


class AudioKwargs(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - so the audio feature extractors have little to no configuration?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a few more will need to be added! For example the Wave2vecFeatureExtractor accepts

        raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]],
        padding: Union[bool, str, PaddingStrategy] = False,
        max_length: Optional[int] = None,
        truncation: bool = False,
        pad_to_multiple_of: Optional[int] = None,
        return_attention_mask: Optional[bool] = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
        sampling_rate: Optional[int] = None,
        **kwargs,
        ```
But most of these are using the same naming as a `tokenizer`, so I wanted to gather a few more before deciding on the default

encoding = self.tokenizer(
text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
)
text_kwargs = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤩

# BC for explicit return_tensors
common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs}
if "return_tensors" in common_kwargs:
return_tensors = kwargs.pop("return_tensors", None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? It's not clear to me here why we check in common_kwargs and pop from kwargs

Suggested change
return_tensors = kwargs.pop("return_tensors", None)
return_tensors = common_kwargs.pop("return_tensors", None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, good catch!

image_features = self.image_processor(images, **images_kwargs)

# BC for explicit return_tensors
common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we prepare common_kwargs here and not use them? Is it for consistency in the processor patterns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will correct this part in align :)

max_length (`int`, *optional*, defaults to `max_length`):
Maximum padding value to use to pad the input text during tokenization.

return_tensors (`str` or [`~utils.TensorType`], *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_tensors is the only one I'm not sure we want to remove from the docstring, as it's something users have to specify if they want to be able to use the outputs for a model

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good point, in general anything that is common could be here

Comment on lines 214 to 215
if pad_and_return_pixel_mask:
do_pad = pad_and_return_pixel_mask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I let this slip through the net when I was reviewing. If we're doing this, then we should really have a deprecation message to 1) avoid this dual arg use 2) remove x_and_y kind of arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think before it was just popping from kwargs the x_and_y arg and assigning to do_pad. I'm all for removing dual args in general, and deprecation warnings are indeed missing from that one - will add that!

@molbap
Copy link
Contributor Author

molbap commented May 7, 2024

Alright, should be a bit better! I added some more typing for audio inputs to protect them a bit more. Worked on a test to enforce this structure as well, should be able to push it after 🌉

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

There's a few tiny inits. Main comment is about the configurability of the processors, and how we control the behaviour.

Would be good to get @ArthurZucker's view on this too, as it's a big and important piece of work!

Comment on lines +159 to +162
encodings = self.tokenizer(
text,
**text_kwargs,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - should stay on one line

Suggested change
encodings = self.tokenizer(
text,
**text_kwargs,
)
encodings = self.tokenizer(text, **text_kwargs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, don't know why make fixup moved it so!

Comment on lines +154 to +156
inputs = self.image_processor(
images,
**images_kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inputs = self.image_processor(
images,
**images_kwargs,
inputs = self.image_processor(images, **images_kwargs

Comment on lines +145 to +148
return self.current_processor(
images,
**images_kwargs,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return self.current_processor(
images,
**images_kwargs,
)
return self.current_processor(images, **images_kwargs)

Comment on lines +48 to +87
self.processing_kwargs: ProcessingKwargs = {
"common_kwargs": {"return_tensors": None},
"text_kwargs": {
"text_pair": None,
"text_target": None,
"text_pair_target": None,
"add_special_tokens": True,
"padding": "max_length",
"truncation": True,
"max_length": 64,
"stride": 0,
"is_split_into_words": False,
"pad_to_multiple_of": None,
"return_token_type_ids": None,
"return_attention_mask": None,
"return_overflowing_tokens": False,
"return_special_tokens_mask": False,
"return_offsets_mapping": False,
"return_length": False,
"verbose": True,
},
"images_kwargs": {
"do_crop_margin": None,
"do_resize": None,
"size": None,
"resample": None,
"do_thumbnail": None,
"do_align_long_axis": None,
"do_pad": None,
"do_rescale": None,
"rescale_factor": None,
"do_normalize": None,
"image_mean": None,
"image_std": None,
"data_format": "channels_first",
"input_data_format": None,
},
"audio_kwargs": {},
"videos_kwargs": {},
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I'd say about the current design is the default behaviour of a processor isn't configurable. As users who use multimodal models just want to use the processor, and processors might share the default processing classes they bundle together e.g. CLIP, we might want to be able to override this in the init, such that I can save and load processors with custom behaviour. For example:

from transformers import AlignProcessor

processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
# I have to pass in max_length in for every call
output = processor(images=images, text=text, max_length=128)

processor = AlignProcessor.from_pretrained("kakaobrain/align-base", max_length=128)
processor.save_pretrained("my_new_model")

processor = AlignProcessor.from_pretrained("my_new_model")
# The processor automatically defaults to the max value I want when calling
output = processor(images=images, text=text)

The correct way, might be to say that the e.g. tokenizer behaviour should be configured instead. At the moment, I think if I change the tokenizer, and build the processor, it will use the processor's max_length default, rather than the tokenizer.

from transformers import AlignProcessor, AutoTokenizer, AutoImageProcessor

tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base", max_length=128)
image_processor = AutoImageProcessor.from_pretrained("kakaobrain/align-base")
processor = AlignProcessor(tokenizer, image_processor)

# Is max_length 128 or 64 used here? 
outputs = processor(text=text, images=images)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point - I tried the second example with model_max_length and it seems to work fine:
image

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice indeed!

text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
)
text_kwargs = {
**self.processing_kwargs["text_kwargs"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is basically creating a dictionary with the default, and is updates with the kwargs.

  1. It's very nice
  2. It's a tad bit not explicit enough for me (for newbies, it would be hard to guess that this is what's happening IMO)
    Thus I am wondering if there is a way to leverage the ProcessingKwargs class rather than use the class's keys.
    For that what I mean is something along the line of self.processing_kwargs.update_with_kwargs("text_kwargs",**kwargs) could be more understandable? (I don't know)

@@ -53,6 +61,77 @@
}


class TextKwargs(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree! One question: where is the doc for these args? 👀

truncation: Optional[bool]
pad_to_multiple_of: Optional[int]
return_attention_mask: Optional[bool]
return_tensors: Optional[Union[str, TensorType]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a common arg should not be repeated no?

@@ -53,6 +61,77 @@
}


class TextKwargs(TypedDict, total=False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also FMI (but a comment about that would be nice) what does total=False do?

images: ImageInput = None,
audio=None,
videos=None,
**kwargs,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main question is how are type hints and signature scan done for this? Is there a way to say that the kwargs include text kwargs and image kwargs + benefit from having the doc as well?

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image seems like if you unpack the kwargs using the class it should work to have a good __call__ signature?

https://typing.readthedocs.io/en/latest/spec/callables.html#unpack-kwargs

@ArthurZucker
Copy link
Collaborator

We might have to simply define a class for each processor, but seems to me like it's the best compromise no?

@molbap
Copy link
Contributor Author

molbap commented May 14, 2024

Agree with you @ArthurZucker - I'll update the PR to have that, looks super nice

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants