You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The ByT5 model was presented in [ByT5: Towards a token-free future with pre-trained byte-to-byte models](https://arxiv.org/abs/2105.13626) by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir
29
-
Kale, Adam Roberts, Colin Raffel.
30
-
31
-
The abstract from the paper is the following:
32
-
33
-
*Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units.
34
-
Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from
35
-
the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they
36
-
can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by
37
-
removing complex and error-prone text preprocessing pipelines. Since byte or character sequences are longer than token
38
-
sequences, past work on token-free models has often introduced new model architectures designed to amortize the cost of
39
-
operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with
40
-
minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count,
41
-
training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level
42
-
counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on
43
-
tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of
44
-
pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our
45
-
experiments.*
46
-
47
-
This model was contributed by [patrickvonplaten](https://huggingface.co/patrickvonplaten). The original code can be
48
-
found [here](https://github.com/google-research/byt5).
24
+
# ByT5
49
25
50
-
<Tip>
26
+
[ByT5](https://huggingface.co/papers/2105.13626) is tokenizer-free version of the [T5](./t5) model designed to works directly on raw UTF-8 bytes. This means it can process any language, more robust to noise like typos, and simpler to use because it doesn't require a preprocessing pipeline.
51
27
52
-
ByT5's architecture is based on the T5v1.1 model, refer to [T5v1.1's documentation page](t5v1.1) for the API reference. They
53
-
only differ in how inputs should be prepared for the model, see the code examples below.
28
+
You can find all the original ByT5 checkpoints under the [Google](https://huggingface.co/google?search_models=byt5) organization.
54
29
55
-
</Tip>
30
+
> [!TIP]
31
+
> Refer to the [T5](./t5) docs for more examples of how to apply ByT5 to different language tasks.
56
32
57
-
Since ByT5 was pre-trained unsupervisedly, there's no real advantage to using a task prefix during single-task
58
-
fine-tuning. If you are doing multi-task fine-tuning, you should use a prefix.
33
+
The example below demonstrates how to generate text with [`Pipeline`], [`AutoModel`] and from the command line.
59
34
35
+
<hfoptionsid="usage">
36
+
<hfoptionid="Pipeline">
60
37
61
-
## Usage example
38
+
```python
39
+
import torch
40
+
from transformers import pipeline
41
+
42
+
pipeline = pipeline(
43
+
task="text2text-generation",
44
+
model="google/byt5-small",
45
+
torch_dtype=torch.float16,
46
+
device=0
47
+
)
48
+
pipeline("translate English to French: The weather is nice today")
49
+
```
62
50
63
-
ByT5 works on raw UTF-8 bytes, so it can be used without a tokenizer:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
57
+
58
+
tokenizer = AutoTokenizer.from_pretrained(
59
+
"google/byt5-small"
60
+
)
61
+
model = AutoModelForSeq2SeqLM.from_pretrained(
62
+
"google/byt5-small",
63
+
torch_dtype=torch.float16,
64
+
device_map="auto"
65
+
)
66
+
67
+
input_ids = tokenizer("summarize: Photosynthesis is the process by which plants, algae, and some bacteria convert light energy into chemical energy.", return_tensors="pt").to("cuda")
>>> model = T5ForConditionalGeneration.from_pretrained("google/byt5-small")
73
+
</hfoption>
74
+
<hfoptionid="transformers-cli">
70
75
71
-
>>> num_special_tokens =3
72
-
>>># Model has 3 special tokens which take up the input ids 0,1,2 of ByT5.
73
-
>>># => Need to shift utf-8 character encodings by 3 before passing ids to model.
76
+
```bash
77
+
echo -e "translate English to French: Life is beautiful."| transformers-cli run --task text2text-generation --model google/byt5-small --device 0
78
+
```
74
79
75
-
>>> input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + num_special_tokens
80
+
</hfoption>
81
+
</hfoptions>
76
82
77
-
>>> labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + num_special_tokens
83
+
## Quantization
78
84
79
-
>>> loss = model(input_ids, labels=labels).loss
80
-
>>> loss.item()
81
-
2.66
82
-
```
85
+
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the [Quantization](../quantization/overview) overview for more available quantization backends.
83
86
84
-
For batched inference and training it is however recommended to make use of the tokenizer:
87
+
The example below uses [torchao](../quantization/torchao)to only quantize the weights to int4.
0 commit comments