Image + text + audio uniform processors #30511

molbap · 2024-04-26T20:58:53Z

What does this PR do?

This PR is a stab at uniformizing the processors across all transformers models. If we are happy with the design, I'll expand it to all existing models. It only touches on some text + audio and text + images models as experiment subjects. Linked with #28711 which has a larger scope, and several previous discussions with team members.

What changes:

There is now an attribute in the constructor that stores the processing kwargs needed to be passed to various encoders down the line.
These nested attributes (one dictionary for text, one for images, one for audio, one for video) are typed with a TypedDict that does not need to be total. We can expand this typed dict (in processing_utils) as processors get uniformized.
The processors are called with the same kwargs signature, text, images, audio, videos. kwargs set by a user will always override default processing kwargs.
Slicing of positional args is removed and replaced by named kwargs corresponding to the modality. Order of kwargs is not constant as a consequence to preserve BC.
Inputs (text, images, audio, videos) are now always typed in the call.

What that allows:

We know each model has its own processing logic/way to mix inputs. This way of typing both the modalities sent to the processors AND the processing arguments allow a faster design of future processors, hence faster review, and faster merge.
The reason for that is that we push away from the function call the actual mixing of modalities and their specific processing, which can be handled explicitly with the kwargs passed.
With TypedDict, type hints are preserved even in the nesting. I hesitated with pydantic/dataclasses and opted for TypedDict because it is less flexible, and we want to enforce a standard.

Limitations:

This still relies on kwargs for the processing.
Few models tested on that PR - more will follow in other PRs.

Tests pass (I think). What's missing:

A consistent way (with maybe logger.warning_once) to warn a user they are using a deprecated call.
Typing of input modalities need to be improved and be more flexible especially for audio&videos
A test in the CI to check that the enforced designed is respected for a new model added. This can be done in a separate PR.

Who can review?

Models:

vision models: @amyeroberts
speech models: @sanchit-gandhi

HuggingFaceDocBuilderDev · 2024-04-26T21:17:57Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lewtun · 2024-04-29T07:15:19Z

cc @edbeeching who also has quite some experience trying to standardise processors when he ported Llava to TRL

amyeroberts

👏 👏 👏

What a beautiful PR. Thank you for handling a very tricky problem and applying such an elegant solution. This is going to be great for standardising our processors.

Only request is that we have some tests for the processors. I realise most of the processors don't have tests, so this is possibly better suited to a follow-up PR. Ideally we'd have e.g. a processor mixin that would check e.g. processing_kwargs is implemented. Most important I think is just to make sure default behaviour is as expected and return_tensors works.

amyeroberts · 2024-05-02T08:52:01Z

src/transformers/processing_utils.py

@@ -53,6 +61,77 @@
 }


+class TextKwargs(TypedDict, total=False):


This is honestly so beautiful ❤️

agree! One question: where is the doc for these args? 👀

also FMI (but a comment about that would be nice) what does total=False do?

Yes, will transfer some doc - with the inherited class solution we can do that nicely :) total=False means that not all the arguments are required. Basically total=True will guarantee you that all elements will be found - which is not what we want here

amyeroberts · 2024-05-02T08:53:37Z

src/transformers/processing_utils.py

+ image_mean: Optional[Union[float, List[float]]]
+ image_std: Optional[Union[float, List[float]]]
+ do_pad: Optional[bool]
+ pad_and_return_pixel_mask: Optional[bool]


How easy is it to deprecate these arguments? I'd really like to see pad_and_return (or any x_and_y style flags to be removed eventually)

I think with this framework it's deprecate-able for sure :)

amyeroberts · 2024-05-02T08:54:00Z

src/transformers/processing_utils.py

+
+class AudioKwargs(TypedDict, total=False):
+ sampling_rate: Optional[int]
+ # raw_speech: Optional[List[List[float]]]


Why is this one commented out?

ah, because I thought it was a processing kwarg but it seems to be a deprecated positional arg that was used before audio, will remove it altogether!

amyeroberts · 2024-05-02T08:54:20Z

src/transformers/processing_utils.py

+ input_data_format: Optional[Union[str, ChannelDimension]]
+
+
+class AudioKwargs(TypedDict, total=False):


Interesting - so the audio feature extractors have little to no configuration?

I think a few more will need to be added! For example the Wave2vecFeatureExtractor accepts

raw_speech: Union[np.ndarray, List[float], List[np.ndarray], List[List[float]]], padding: Union[bool, str, PaddingStrategy] = False, max_length: Optional[int] = None, truncation: bool = False, pad_to_multiple_of: Optional[int] = None, return_attention_mask: Optional[bool] = None, return_tensors: Optional[Union[str, TensorType]] = None, sampling_rate: Optional[int] = None, **kwargs, ``` But most of these are using the same naming as a `tokenizer`, so I wanted to gather a few more before deciding on the default

amyeroberts · 2024-05-02T08:59:07Z

src/transformers/models/align/processing_align.py

- encoding = self.tokenizer(
- text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
- )
+ text_kwargs = {


amyeroberts · 2024-05-02T09:25:36Z

src/transformers/models/align/processing_align.py

+ # BC for explicit return_tensors
+ common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs}
+ if "return_tensors" in common_kwargs:
+ return_tensors = kwargs.pop("return_tensors", None)


Maybe? It's not clear to me here why we check in common_kwargs and pop from kwargs

Suggested change

return_tensors = kwargs.pop("return_tensors", None)

return_tensors = common_kwargs.pop("return_tensors", None)

yes, good catch!

amyeroberts · 2024-05-02T09:26:08Z

src/transformers/models/align/processing_align.py

+ image_features = self.image_processor(images, **images_kwargs)
+
+ # BC for explicit return_tensors
+ common_kwargs = {**self.processing_kwargs["common_kwargs"], **kwargs}


Why do we prepare common_kwargs here and not use them? Is it for consistency in the processor patterns?

will correct this part in align :)

amyeroberts · 2024-05-02T09:28:52Z

src/transformers/models/align/processing_align.py

- max_length (`int`, *optional*, defaults to `max_length`):
- Maximum padding value to use to pad the input text during tokenization.
-
- return_tensors (`str` or [`~utils.TensorType`], *optional*):


return_tensors is the only one I'm not sure we want to remove from the docstring, as it's something users have to specify if they want to be able to use the outputs for a model

that's a good point, in general anything that is common could be here

amyeroberts · 2024-05-02T09:45:28Z

src/transformers/models/bridgetower/image_processing_bridgetower.py

+ if pad_and_return_pixel_mask:
+ do_pad = pad_and_return_pixel_mask


I think I let this slip through the net when I was reviewing. If we're doing this, then we should really have a deprecation message to 1) avoid this dual arg use 2) remove x_and_y kind of arguments

I think before it was just popping from kwargs the x_and_y arg and assigning to do_pad. I'm all for removing dual args in general, and deprecation warnings are indeed missing from that one - will add that!

molbap · 2024-05-07T18:40:39Z

Alright, should be a bit better! I added some more typing for audio inputs to protect them a bit more. Worked on a test to enforce this structure as well, should be able to push it after 🌉

amyeroberts

Looks great!

There's a few tiny inits. Main comment is about the configurability of the processors, and how we control the behaviour.

Would be good to get @ArthurZucker's view on this too, as it's a big and important piece of work!

amyeroberts · 2024-05-13T11:47:20Z

src/transformers/models/donut/processing_donut.py

+ encodings = self.tokenizer(
+ text,
+ **text_kwargs,
+ )


nit - should stay on one line

Suggested change

encodings = self.tokenizer(

text,

**text_kwargs,

)

encodings = self.tokenizer(text, **text_kwargs)

yes, don't know why make fixup moved it so!

amyeroberts · 2024-05-13T11:47:32Z

src/transformers/models/donut/processing_donut.py

+ inputs = self.image_processor(
+ images,
+ **images_kwargs,


Suggested change

inputs = self.image_processor(

images,

**images_kwargs,

inputs = self.image_processor(images, **images_kwargs

amyeroberts · 2024-05-13T11:47:49Z

src/transformers/models/donut/processing_donut.py

+ return self.current_processor(
+ images,
+ **images_kwargs,
+ )


Suggested change

return self.current_processor(

images,

**images_kwargs,

)

return self.current_processor(images, **images_kwargs)

amyeroberts · 2024-05-13T12:07:53Z

src/transformers/models/align/processing_align.py

+ self.processing_kwargs: ProcessingKwargs = {
+ "common_kwargs": {"return_tensors": None},
+ "text_kwargs": {
+ "text_pair": None,
+ "text_target": None,
+ "text_pair_target": None,
+ "add_special_tokens": True,
+ "padding": "max_length",
+ "truncation": True,
+ "max_length": 64,
+ "stride": 0,
+ "is_split_into_words": False,
+ "pad_to_multiple_of": None,
+ "return_token_type_ids": None,
+ "return_attention_mask": None,
+ "return_overflowing_tokens": False,
+ "return_special_tokens_mask": False,
+ "return_offsets_mapping": False,
+ "return_length": False,
+ "verbose": True,
+ },
+ "images_kwargs": {
+ "do_crop_margin": None,
+ "do_resize": None,
+ "size": None,
+ "resample": None,
+ "do_thumbnail": None,
+ "do_align_long_axis": None,
+ "do_pad": None,
+ "do_rescale": None,
+ "rescale_factor": None,
+ "do_normalize": None,
+ "image_mean": None,
+ "image_std": None,
+ "data_format": "channels_first",
+ "input_data_format": None,
+ },
+ "audio_kwargs": {},
+ "videos_kwargs": {},
+ }


The only thing I'd say about the current design is the default behaviour of a processor isn't configurable. As users who use multimodal models just want to use the processor, and processors might share the default processing classes they bundle together e.g. CLIP, we might want to be able to override this in the init, such that I can save and load processors with custom behaviour. For example:

from transformers import AlignProcessor processor = AlignProcessor.from_pretrained("kakaobrain/align-base") # I have to pass in max_length in for every call output = processor(images=images, text=text, max_length=128) processor = AlignProcessor.from_pretrained("kakaobrain/align-base", max_length=128) processor.save_pretrained("my_new_model") processor = AlignProcessor.from_pretrained("my_new_model") # The processor automatically defaults to the max value I want when calling output = processor(images=images, text=text)

The correct way, might be to say that the e.g. tokenizer behaviour should be configured instead. At the moment, I think if I change the tokenizer, and build the processor, it will use the processor's max_length default, rather than the tokenizer.

from transformers import AlignProcessor, AutoTokenizer, AutoImageProcessor tokenizer = AutoTokenizer.from_pretrained("kakaobrain/align-base", max_length=128) image_processor = AutoImageProcessor.from_pretrained("kakaobrain/align-base") processor = AlignProcessor(tokenizer, image_processor) # Is max_length 128 or 64 used here? outputs = processor(text=text, images=images)

That's a good point - I tried the second example with model_max_length and it seems to work fine:

ArthurZucker

Very nice indeed!

ArthurZucker · 2024-05-13T13:16:27Z

src/transformers/models/align/processing_align.py

- text, padding=padding, max_length=max_length, return_tensors=return_tensors, **kwargs
- )
+ text_kwargs = {
+ **self.processing_kwargs["text_kwargs"],


So this is basically creating a dictionary with the default, and is updates with the kwargs.

It's very nice

It's a tad bit not explicit enough for me (for newbies, it would be hard to guess that this is what's happening IMO)
Thus I am wondering if there is a way to leverage the ProcessingKwargs class rather than use the class's keys.
For that what I mean is something along the line of self.processing_kwargs.update_with_kwargs("text_kwargs",**kwargs) could be more understandable? (I don't know)

ArthurZucker · 2024-05-13T13:19:48Z

src/transformers/processing_utils.py

@@ -53,6 +61,77 @@
 }


+class TextKwargs(TypedDict, total=False):


agree! One question: where is the doc for these args? 👀

ArthurZucker · 2024-05-13T13:20:27Z

src/transformers/processing_utils.py

+ truncation: Optional[bool]
+ pad_to_multiple_of: Optional[int]
+ return_attention_mask: Optional[bool]
+ return_tensors: Optional[Union[str, TensorType]]


this is a common arg should not be repeated no?

ArthurZucker · 2024-05-13T13:20:57Z

src/transformers/processing_utils.py

@@ -53,6 +61,77 @@
 }


+class TextKwargs(TypedDict, total=False):


also FMI (but a comment about that would be nice) what does total=False do?

ArthurZucker · 2024-05-13T13:23:35Z

src/transformers/models/align/processing_align.py

+ images: ImageInput = None,
+ audio=None,
+ videos=None,
+ **kwargs,


My main question is how are type hints and signature scan done for this? Is there a way to say that the kwargs include text kwargs and image kwargs + benefit from having the doc as well?

ArthurZucker

seems like if you unpack the kwargs using the class it should work to have a good __call__ signature?

https://typing.readthedocs.io/en/latest/spec/callables.html#unpack-kwargs

ArthurZucker · 2024-05-13T13:40:28Z

We might have to simply define a class for each processor, but seems to me like it's the best compromise no?

molbap · 2024-05-14T09:53:15Z

Agree with you @ArthurZucker - I'll update the PR to have that, looks super nice

molbap added 30 commits January 25, 2024 11:02

expand kwargs from align

42ecf48

remove kwargs from altclip processor

ccb2147

add explicit args for donut processor

f999e0c

add explicit call to current processor for in context manager

8fb3a6b

format

a90c766

remove unused kwargs

49cb6cc

move conditions for encodings

3ac1c7e

improve flow over text/image

7a819fd

[breaking] pass explicit args to bridgetower

9cc38b7

wwsMerge branch 'main' into improve_multimodal_processors

ff6a950

add default kwargs for BC

7db64a0

fix bridgetower

41674d9

debug bridgetower image proc

618a687

format

f39cdc1

move kwargs message to info level

9a6f97d

add debug messages

380f82f

fix arguments not being passed in bridgetower

75f15d3

keep backwards compat for processing + modify testing args dict

3df5faa

Merge branch 'main' into improve_multimodal_processors

5ad0694

fix quality

69e5a2d

log kwargs mismatch to info level

68c2f40

fix quality

e1e4084

Merge branch 'main' into improve_multimodal_processors

bfa81e5

address comments

4b557b0

fix typo

b7fc377

fix expected tests for bridgetower

270bb9e

fix conflicts

94a1b75

Merge branch 'main' into improve_multimodal_processors

6603bf0

fix valid processor keys

004c961

remove unused arg list

c2e49f5

molbap added 5 commits April 18, 2024 16:35

skeleton draft - uniform processor call

3238dd3

fix quality

3afde22

add broken wav2vec audio processing

eb99e29

Merge branch 'main' into improve_multimodal_processors

c6afd63

uniforrmize processor calls

3b824b5

molbap added 2 commits April 28, 2024 11:06

fix default

d8c2a6e

protect image types

78433a1

Merge branch 'main' into image_text_audio_processors

0cd3d66

amyeroberts approved these changes May 2, 2024

View reviewed changes

molbap added 11 commits May 7, 2024 07:07

Merge branch 'main' into image_text_audio_processors

d9c51c1

improve audio kwargs typing

43ba3bd

create audio input type

dcfa8db

fix docstrings for common kwargs

7048fee

add deprecation warning for dual kwargs

25ba7ba

fix imports

36ba3cc

fix quality

05fea5d

fix quality + copies

e05b14b

fix quality on setup 👀

46a1fe8

protect import

13a1909

fix quality

7bf3615

This was referenced May 10, 2024

Make chat templates part of ProcessorMixin #30744

Draft

Add video modality for InstrucBLIP #30182

Open

amyeroberts reviewed May 13, 2024

View reviewed changes

ArthurZucker reviewed May 13, 2024

View reviewed changes

		@@ -53,6 +61,77 @@
		}


		class TextKwargs(TypedDict, total=False):

		input_data_format: Optional[Union[str, ChannelDimension]]


		class AudioKwargs(TypedDict, total=False):

	return_tensors = kwargs.pop("return_tensors", None)
	return_tensors = common_kwargs.pop("return_tensors", None)

		if pad_and_return_pixel_mask:
		do_pad = pad_and_return_pixel_mask

Image + text + audio uniform processors #30511

Are you sure you want to change the base?

Image + text + audio uniform processors #30511

Conversation

molbap commented Apr 26, 2024

What does this PR do?

What changes:

What that allows:

Limitations:

Tests pass (I think). What's missing:

Who can review?

HuggingFaceDocBuilderDev commented Apr 26, 2024

lewtun commented Apr 29, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap commented May 7, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

ArthurZucker commented May 13, 2024

molbap commented May 14, 2024