Bug in Phi4 processor

### System Info

- `transformers` version: 4.51.0.dev0 (Commit 0d6a60f)
- Platform: Linux-5.14.0-503.22.1.el9_5.x86_64-x86_64-with-glibc2.35
- Python version: 3.11.11
- Huggingface_hub version: 0.29.3
- Safetensors version: 0.5.3
- Accelerate version: 1.5.1
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (GPU?): 2.6.0+cu126 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed

### Who can help?

@Cyrilvallez 

https://github.com/huggingface/transformers/blob/0d6a60fe55fe051a1a68f2026d19223ed57b3c75/src/transformers/models/phi4_multimodal/processing_phi4_multimodal.py#L135-L136

How does `Phi4MultimodalProcessor` configure the `image_token` and `audio_token` into the tokenizer? The lines above try to retrieve the token ids from the tokenizer which I can't find where to be initialized.

As a result, when I run the following code, the processor cannot properly generates an input:
```python
processor = Phi4MultimodalProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct")

image = Image.fromarray(generate_random_image(resolution=(720, 480)))
inputs = processor(text="<|image_1|>", images=image, return_tensors="pt").to(dtype=torch.bfloat16, device="cuda")
```
```
 image = Image.fromarray(generate_random_image(resolution=(720, 480)))
---> inputs = processor(text="<|audio_1|>", audios=audio, return_tensors="pt").to(dtype=torch.bfloat16, device="cuda")
     inputs["labels"] = inputs["input_ids"].clone()
     outputs = model(**inputs)

File /opt/conda/lib/python3.11/site-packages/transformers/models/phi4_multimodal/processing_phi4_multimodal.py:135, in Phi4MultimodalProcessor.__call__(self, text, images, audios, **kwargs)
     elif not isinstance(text, list) and not isinstance(text[0], str):
         raise ValueError("Invalid input text. Please provide a string, or a list of strings")
-->  image_token = self.tokenizer.image_token
     audio_token = self.tokenizer.audio_token
     processed_text = [re.sub(self.fake_image_token_pattern, image_token, t) for t in text]

File /opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:1108, in SpecialTokensMixin.__getattr__(self, key)
            return self.convert_tokens_to_ids(attr_as_tokens) if attr_as_tokens is not None else None
    if key not in self.__dict__:
->      raise AttributeError(f"{self.__class__.__name__} has no attribute {key}")
    else:
        return super().__getattr__(key)

AttributeError: GPT2TokenizerFast has no attribute image_token
```

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

- Use the following function to generate a fake image:
```python
import numpy as np
from PIL import Image

def generate_random_image(resolution:tuple[int, int]):
  width, height = resolution
  image = Image.fromarray(np.random.randint(0, 256, size=(height, width, 3), dtype=np.uint8))
  return image
```

- Call processor
```python
processor = Phi4MultimodalProcessor.from_pretrained("microsoft/Phi-4-multimodal-instruct")
image = generate_random_image(resolution=(720, 480))
inputs = processor(text="<|image_1|>", images=image, return_tensors="pt").to(dtype=torch.bfloat16, device="cuda")
```

### Expected behavior

Return an output with `input_ids`, `attention_mask`, `pixel_values`, etc, without an error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug in Phi4 processor #37122

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	image_token = self.tokenizer.image_token
	audio_token = self.tokenizer.audio_token

Bug in Phi4 processor #37122

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions