From 685e7f284235e55458c5d4f7848b053926b26b70 Mon Sep 17 00:00:00 2001 From: eduard-balamatiuc Date: Sun, 22 Sep 2024 18:18:05 +0300 Subject: [PATCH 1/3] create rum folder --- chapters/rum/_toctree.yml | 201 ++++++ chapters/rum/chapter0/1.mdx | 110 +++ chapters/rum/chapter1/1.mdx | 109 +++ chapters/rum/chapter1/10.mdx | 258 +++++++ chapters/rum/chapter1/2.mdx | 26 + chapters/rum/chapter1/3.mdx | 329 +++++++++ chapters/rum/chapter1/4.mdx | 178 +++++ chapters/rum/chapter1/5.mdx | 22 + chapters/rum/chapter1/6.mdx | 21 + chapters/rum/chapter1/7.mdx | 21 + chapters/rum/chapter1/8.mdx | 32 + chapters/rum/chapter1/9.mdx | 16 + chapters/rum/chapter2/1.mdx | 25 + chapters/rum/chapter2/2.mdx | 353 ++++++++++ chapters/rum/chapter2/3.mdx | 228 ++++++ chapters/rum/chapter2/4.mdx | 240 +++++++ chapters/rum/chapter2/5.mdx | 338 +++++++++ chapters/rum/chapter2/6.mdx | 164 +++++ chapters/rum/chapter2/7.mdx | 18 + chapters/rum/chapter2/8.mdx | 310 ++++++++ chapters/rum/chapter3/1.mdx | 26 + chapters/rum/chapter3/2.mdx | 385 ++++++++++ chapters/rum/chapter3/3.mdx | 172 +++++ chapters/rum/chapter3/3_tf.mdx | 199 ++++++ chapters/rum/chapter3/4.mdx | 359 ++++++++++ chapters/rum/chapter3/5.mdx | 25 + chapters/rum/chapter3/6.mdx | 301 ++++++++ chapters/rum/chapter4/1.mdx | 22 + chapters/rum/chapter4/2.mdx | 96 +++ chapters/rum/chapter4/3.mdx | 641 +++++++++++++++++ chapters/rum/chapter4/4.mdx | 87 +++ chapters/rum/chapter4/5.mdx | 12 + chapters/rum/chapter4/6.mdx | 228 ++++++ chapters/rum/chapter5/1.mdx | 22 + chapters/rum/chapter5/2.mdx | 167 +++++ chapters/rum/chapter5/3.mdx | 744 ++++++++++++++++++++ chapters/rum/chapter5/4.mdx | 287 ++++++++ chapters/rum/chapter5/5.mdx | 406 +++++++++++ chapters/rum/chapter5/6.mdx | 518 ++++++++++++++ chapters/rum/chapter5/7.mdx | 16 + chapters/rum/chapter5/8.mdx | 231 ++++++ chapters/rum/chapter6/1.mdx | 19 + chapters/rum/chapter6/10.mdx | 283 ++++++++ chapters/rum/chapter6/2.mdx | 257 +++++++ chapters/rum/chapter6/3.mdx | 473 +++++++++++++ chapters/rum/chapter6/3b.mdx | 642 +++++++++++++++++ chapters/rum/chapter6/4.mdx | 123 ++++ chapters/rum/chapter6/5.mdx | 360 ++++++++++ chapters/rum/chapter6/6.mdx | 374 ++++++++++ chapters/rum/chapter6/7.mdx | 381 ++++++++++ chapters/rum/chapter6/8.mdx | 565 +++++++++++++++ chapters/rum/chapter6/9.mdx | 16 + chapters/rum/chapter7/1.mdx | 38 + chapters/rum/chapter7/2.mdx | 981 ++++++++++++++++++++++++++ chapters/rum/chapter7/3.mdx | 1044 +++++++++++++++++++++++++++ chapters/rum/chapter7/4.mdx | 1002 ++++++++++++++++++++++++++ chapters/rum/chapter7/5.mdx | 1072 ++++++++++++++++++++++++++++ chapters/rum/chapter7/6.mdx | 914 ++++++++++++++++++++++++ chapters/rum/chapter7/7.mdx | 1203 ++++++++++++++++++++++++++++++++ chapters/rum/chapter7/8.mdx | 22 + chapters/rum/chapter7/9.mdx | 329 +++++++++ chapters/rum/chapter8/1.mdx | 17 + chapters/rum/chapter8/2.mdx | 364 ++++++++++ chapters/rum/chapter8/3.mdx | 164 +++++ chapters/rum/chapter8/4.mdx | 792 +++++++++++++++++++++ chapters/rum/chapter8/4_tf.mdx | 486 +++++++++++++ chapters/rum/chapter8/5.mdx | 92 +++ chapters/rum/chapter8/6.mdx | 12 + chapters/rum/chapter8/7.mdx | 204 ++++++ chapters/rum/chapter9/1.mdx | 37 + chapters/rum/chapter9/2.mdx | 118 ++++ chapters/rum/chapter9/3.mdx | 186 +++++ chapters/rum/chapter9/4.mdx | 147 ++++ chapters/rum/chapter9/5.mdx | 67 ++ chapters/rum/chapter9/6.mdx | 102 +++ chapters/rum/chapter9/7.mdx | 236 +++++++ chapters/rum/chapter9/8.mdx | 24 + chapters/rum/chapter9/9.mdx | 239 +++++++ chapters/rum/events/1.mdx | 49 ++ chapters/rum/events/2.mdx | 165 +++++ chapters/rum/events/3.mdx | 9 + 81 files changed, 21551 insertions(+) create mode 100644 chapters/rum/_toctree.yml create mode 100644 chapters/rum/chapter0/1.mdx create mode 100644 chapters/rum/chapter1/1.mdx create mode 100644 chapters/rum/chapter1/10.mdx create mode 100644 chapters/rum/chapter1/2.mdx create mode 100644 chapters/rum/chapter1/3.mdx create mode 100644 chapters/rum/chapter1/4.mdx create mode 100644 chapters/rum/chapter1/5.mdx create mode 100644 chapters/rum/chapter1/6.mdx create mode 100644 chapters/rum/chapter1/7.mdx create mode 100644 chapters/rum/chapter1/8.mdx create mode 100644 chapters/rum/chapter1/9.mdx create mode 100644 chapters/rum/chapter2/1.mdx create mode 100644 chapters/rum/chapter2/2.mdx create mode 100644 chapters/rum/chapter2/3.mdx create mode 100644 chapters/rum/chapter2/4.mdx create mode 100644 chapters/rum/chapter2/5.mdx create mode 100644 chapters/rum/chapter2/6.mdx create mode 100644 chapters/rum/chapter2/7.mdx create mode 100644 chapters/rum/chapter2/8.mdx create mode 100644 chapters/rum/chapter3/1.mdx create mode 100644 chapters/rum/chapter3/2.mdx create mode 100644 chapters/rum/chapter3/3.mdx create mode 100644 chapters/rum/chapter3/3_tf.mdx create mode 100644 chapters/rum/chapter3/4.mdx create mode 100644 chapters/rum/chapter3/5.mdx create mode 100644 chapters/rum/chapter3/6.mdx create mode 100644 chapters/rum/chapter4/1.mdx create mode 100644 chapters/rum/chapter4/2.mdx create mode 100644 chapters/rum/chapter4/3.mdx create mode 100644 chapters/rum/chapter4/4.mdx create mode 100644 chapters/rum/chapter4/5.mdx create mode 100644 chapters/rum/chapter4/6.mdx create mode 100644 chapters/rum/chapter5/1.mdx create mode 100644 chapters/rum/chapter5/2.mdx create mode 100644 chapters/rum/chapter5/3.mdx create mode 100644 chapters/rum/chapter5/4.mdx create mode 100644 chapters/rum/chapter5/5.mdx create mode 100644 chapters/rum/chapter5/6.mdx create mode 100644 chapters/rum/chapter5/7.mdx create mode 100644 chapters/rum/chapter5/8.mdx create mode 100644 chapters/rum/chapter6/1.mdx create mode 100644 chapters/rum/chapter6/10.mdx create mode 100644 chapters/rum/chapter6/2.mdx create mode 100644 chapters/rum/chapter6/3.mdx create mode 100644 chapters/rum/chapter6/3b.mdx create mode 100644 chapters/rum/chapter6/4.mdx create mode 100644 chapters/rum/chapter6/5.mdx create mode 100644 chapters/rum/chapter6/6.mdx create mode 100644 chapters/rum/chapter6/7.mdx create mode 100644 chapters/rum/chapter6/8.mdx create mode 100644 chapters/rum/chapter6/9.mdx create mode 100644 chapters/rum/chapter7/1.mdx create mode 100644 chapters/rum/chapter7/2.mdx create mode 100644 chapters/rum/chapter7/3.mdx create mode 100644 chapters/rum/chapter7/4.mdx create mode 100644 chapters/rum/chapter7/5.mdx create mode 100644 chapters/rum/chapter7/6.mdx create mode 100644 chapters/rum/chapter7/7.mdx create mode 100644 chapters/rum/chapter7/8.mdx create mode 100644 chapters/rum/chapter7/9.mdx create mode 100644 chapters/rum/chapter8/1.mdx create mode 100644 chapters/rum/chapter8/2.mdx create mode 100644 chapters/rum/chapter8/3.mdx create mode 100644 chapters/rum/chapter8/4.mdx create mode 100644 chapters/rum/chapter8/4_tf.mdx create mode 100644 chapters/rum/chapter8/5.mdx create mode 100644 chapters/rum/chapter8/6.mdx create mode 100644 chapters/rum/chapter8/7.mdx create mode 100644 chapters/rum/chapter9/1.mdx create mode 100644 chapters/rum/chapter9/2.mdx create mode 100644 chapters/rum/chapter9/3.mdx create mode 100644 chapters/rum/chapter9/4.mdx create mode 100644 chapters/rum/chapter9/5.mdx create mode 100644 chapters/rum/chapter9/6.mdx create mode 100644 chapters/rum/chapter9/7.mdx create mode 100644 chapters/rum/chapter9/8.mdx create mode 100644 chapters/rum/chapter9/9.mdx create mode 100644 chapters/rum/events/1.mdx create mode 100644 chapters/rum/events/2.mdx create mode 100644 chapters/rum/events/3.mdx diff --git a/chapters/rum/_toctree.yml b/chapters/rum/_toctree.yml new file mode 100644 index 000000000..c8364cc6d --- /dev/null +++ b/chapters/rum/_toctree.yml @@ -0,0 +1,201 @@ +- title: 0. Setup + sections: + - local: chapter0/1 + title: Introduction + +- title: 1. Transformer models + sections: + - local: chapter1/1 + title: Introduction + - local: chapter1/2 + title: Natural Language Processing + - local: chapter1/3 + title: Transformers, what can they do? + - local: chapter1/4 + title: How do Transformers work? + - local: chapter1/5 + title: Encoder models + - local: chapter1/6 + title: Decoder models + - local: chapter1/7 + title: Sequence-to-sequence models + - local: chapter1/8 + title: Bias and limitations + - local: chapter1/9 + title: Summary + - local: chapter1/10 + title: End-of-chapter quiz + quiz: 1 + +- title: 2. Using 🤗 Transformers + sections: + - local: chapter2/1 + title: Introduction + - local: chapter2/2 + title: Behind the pipeline + - local: chapter2/3 + title: Models + - local: chapter2/4 + title: Tokenizers + - local: chapter2/5 + title: Handling multiple sequences + - local: chapter2/6 + title: Putting it all together + - local: chapter2/7 + title: Basic usage completed! + - local: chapter2/8 + title: End-of-chapter quiz + quiz: 2 + +- title: 3. Fine-tuning a pretrained model + sections: + - local: chapter3/1 + title: Introduction + - local: chapter3/2 + title: Processing the data + - local: chapter3/3 + title: Fine-tuning a model with the Trainer API or Keras + local_fw: { pt: chapter3/3, tf: chapter3/3_tf } + - local: chapter3/4 + title: A full training + - local: chapter3/5 + title: Fine-tuning, Check! + - local: chapter3/6 + title: End-of-chapter quiz + quiz: 3 + +- title: 4. Sharing models and tokenizers + sections: + - local: chapter4/1 + title: The Hugging Face Hub + - local: chapter4/2 + title: Using pretrained models + - local: chapter4/3 + title: Sharing pretrained models + - local: chapter4/4 + title: Building a model card + - local: chapter4/5 + title: Part 1 completed! + - local: chapter4/6 + title: End-of-chapter quiz + quiz: 4 + +- title: 5. The 🤗 Datasets library + sections: + - local: chapter5/1 + title: Introduction + - local: chapter5/2 + title: What if my dataset isn't on the Hub? + - local: chapter5/3 + title: Time to slice and dice + - local: chapter5/4 + title: Big data? 🤗 Datasets to the rescue! + - local: chapter5/5 + title: Creating your own dataset + - local: chapter5/6 + title: Semantic search with FAISS + - local: chapter5/7 + title: 🤗 Datasets, check! + - local: chapter5/8 + title: End-of-chapter quiz + quiz: 5 + +- title: 6. The 🤗 Tokenizers library + sections: + - local: chapter6/1 + title: Introduction + - local: chapter6/2 + title: Training a new tokenizer from an old one + - local: chapter6/3 + title: Fast tokenizers' special powers + - local: chapter6/3b + title: Fast tokenizers in the QA pipeline + - local: chapter6/4 + title: Normalization and pre-tokenization + - local: chapter6/5 + title: Byte-Pair Encoding tokenization + - local: chapter6/6 + title: WordPiece tokenization + - local: chapter6/7 + title: Unigram tokenization + - local: chapter6/8 + title: Building a tokenizer, block by block + - local: chapter6/9 + title: Tokenizers, check! + - local: chapter6/10 + title: End-of-chapter quiz + quiz: 6 + +- title: 7. Main NLP tasks + sections: + - local: chapter7/1 + title: Introduction + - local: chapter7/2 + title: Token classification + - local: chapter7/3 + title: Fine-tuning a masked language model + - local: chapter7/4 + title: Translation + - local: chapter7/5 + title: Summarization + - local: chapter7/6 + title: Training a causal language model from scratch + - local: chapter7/7 + title: Question answering + - local: chapter7/8 + title: Mastering NLP + - local: chapter7/9 + title: End-of-chapter quiz + quiz: 7 + +- title: 8. How to ask for help + sections: + - local: chapter8/1 + title: Introduction + - local: chapter8/2 + title: What to do when you get an error + - local: chapter8/3 + title: Asking for help on the forums + - local: chapter8/4 + title: Debugging the training pipeline + local_fw: { pt: chapter8/4, tf: chapter8/4_tf } + - local: chapter8/5 + title: How to write a good issue + - local: chapter8/6 + title: Part 2 completed! + - local: chapter8/7 + title: End-of-chapter quiz + quiz: 8 + +- title: 9. Building and sharing demos + new: true + subtitle: I trained a model, but how can I show it off? + sections: + - local: chapter9/1 + title: Introduction to Gradio + - local: chapter9/2 + title: Building your first demo + - local: chapter9/3 + title: Understanding the Interface class + - local: chapter9/4 + title: Sharing demos with others + - local: chapter9/5 + title: Integrations with the Hugging Face Hub + - local: chapter9/6 + title: Advanced Interface features + - local: chapter9/7 + title: Introduction to Blocks + - local: chapter9/8 + title: Gradio, check! + - local: chapter9/9 + title: End-of-chapter quiz + quiz: 9 + +- title: Course Events + sections: + - local: events/1 + title: Live sessions and workshops + - local: events/2 + title: Part 2 release event + - local: events/3 + title: Gradio Blocks party diff --git a/chapters/rum/chapter0/1.mdx b/chapters/rum/chapter0/1.mdx new file mode 100644 index 000000000..40e21bf91 --- /dev/null +++ b/chapters/rum/chapter0/1.mdx @@ -0,0 +1,110 @@ +# Introduction[[introduction]] + +Welcome to the Hugging Face course! This introduction will guide you through setting up a working environment. If you're just starting the course, we recommend you first take a look at [Chapter 1](/course/chapter1), then come back and set up your environment so you can try the code yourself. + +All the libraries that we'll be using in this course are available as Python packages, so here we'll show you how to set up a Python environment and install the specific libraries you'll need. + +We'll cover two ways of setting up your working environment, using a Colab notebook or a Python virtual environment. Feel free to choose the one that resonates with you the most. For beginners, we strongly recommend that you get started by using a Colab notebook. + +Note that we will not be covering the Windows system. If you're running on Windows, we recommend following along using a Colab notebook. If you're using a Linux distribution or macOS, you can use either approach described here. + +Most of the course relies on you having a Hugging Face account. We recommend creating one now: [create an account](https://huggingface.co/join). + +## Using a Google Colab notebook[[using-a-google-colab-notebook]] + +Using a Colab notebook is the simplest possible setup; boot up a notebook in your browser and get straight to coding! + +If you're not familiar with Colab, we recommend you start by following the [introduction](https://colab.research.google.com/notebooks/intro.ipynb). Colab allows you to use some accelerating hardware, like GPUs or TPUs, and it is free for smaller workloads. + +Once you're comfortable moving around in Colab, create a new notebook and get started with the setup: + +
+An empty colab notebook +
+ +The next step is to install the libraries that we'll be using in this course. We'll use `pip` for the installation, which is the package manager for Python. In notebooks, you can run system commands by preceding them with the `!` character, so you can install the 🤗 Transformers library as follows: + +``` +!pip install transformers +``` + +You can make sure the package was correctly installed by importing it within your Python runtime: + +``` +import transformers +``` + +
+A gif showing the result of the two commands above: installation and import +
+ +This installs a very light version of 🤗 Transformers. In particular, no specific machine learning frameworks (like PyTorch or TensorFlow) are installed. Since we'll be using a lot of different features of the library, we recommend installing the development version, which comes with all the required dependencies for pretty much any imaginable use case: + +``` +!pip install transformers[sentencepiece] +``` + +This will take a bit of time, but then you'll be ready to go for the rest of the course! + +## Using a Python virtual environment[[using-a-python-virtual-environment]] + +If you prefer to use a Python virtual environment, the first step is to install Python on your system. We recommend following [this guide](https://realpython.com/installing-python/) to get started. + +Once you have Python installed, you should be able to run Python commands in your terminal. You can start by running the following command to ensure that it is correctly installed before proceeding to the next steps: `python --version`. This should print out the Python version now available on your system. + +When running a Python command in your terminal, such as `python --version`, you should think of the program running your command as the "main" Python on your system. We recommend keeping this main installation free of any packages, and using it to create separate environments for each application you work on — this way, each application can have its own dependencies and packages, and you won't need to worry about potential compatibility issues with other applications. + +In Python this is done with [*virtual environments*](https://docs.python.org/3/tutorial/venv.html), which are self-contained directory trees that each contain a Python installation with a particular Python version alongside all the packages the application needs. Creating such a virtual environment can be done with a number of different tools, but we'll use the official Python package for that purpose, which is called [`venv`](https://docs.python.org/3/library/venv.html#module-venv). + +First, create the directory you'd like your application to live in — for example, you might want to make a new directory called *transformers-course* at the root of your home directory: + +``` +mkdir ~/transformers-course +cd ~/transformers-course +``` + +From inside this directory, create a virtual environment using the Python `venv` module: + +``` +python -m venv .env +``` + +You should now have a directory called *.env* in your otherwise empty folder: + +``` +ls -a +``` + +```out +. .. .env +``` + +You can jump in and out of your virtual environment with the `activate` and `deactivate` scripts: + +``` +# Activate the virtual environment +source .env/bin/activate + +# Deactivate the virtual environment +deactivate +``` + +You can make sure that the environment is activated by running the `which python` command: if it points to the virtual environment, then you have successfully activated it! + +``` +which python +``` + +```out +/home//transformers-course/.env/bin/python +``` + +### Installing dependencies[[installing-dependencies]] + +As in the previous section on using Google Colab instances, you'll now need to install the packages required to continue. Again, you can install the development version of 🤗 Transformers using the `pip` package manager: + +``` +pip install "transformers[sentencepiece]" +``` + +You're now all set up and ready to go! diff --git a/chapters/rum/chapter1/1.mdx b/chapters/rum/chapter1/1.mdx new file mode 100644 index 000000000..30c992371 --- /dev/null +++ b/chapters/rum/chapter1/1.mdx @@ -0,0 +1,109 @@ +# Introduction[[introduction]] + + + +## Welcome to the 🤗 Course![[welcome-to-the-course]] + + + +This course will teach you about natural language processing (NLP) using libraries from the [Hugging Face](https://huggingface.co/) ecosystem — [🤗 Transformers](https://github.com/huggingface/transformers), [🤗 Datasets](https://github.com/huggingface/datasets), [🤗 Tokenizers](https://github.com/huggingface/tokenizers), and [🤗 Accelerate](https://github.com/huggingface/accelerate) — as well as the [Hugging Face Hub](https://huggingface.co/models). It's completely free and without ads. + + +## What to expect?[[what-to-expect]] + +Here is a brief overview of the course: + +
+Brief overview of the chapters of the course. + +
+ +- Chapters 1 to 4 provide an introduction to the main concepts of the 🤗 Transformers library. By the end of this part of the course, you will be familiar with how Transformer models work and will know how to use a model from the [Hugging Face Hub](https://huggingface.co/models), fine-tune it on a dataset, and share your results on the Hub! +- Chapters 5 to 8 teach the basics of 🤗 Datasets and 🤗 Tokenizers before diving into classic NLP tasks. By the end of this part, you will be able to tackle the most common NLP problems by yourself. +- Chapters 9 to 12 go beyond NLP, and explore how Transformer models can be used to tackle tasks in speech processing and computer vision. Along the way, you'll learn how to build and share demos of your models, and optimize them for production environments. By the end of this part, you will be ready to apply 🤗 Transformers to (almost) any machine learning problem! + +This course: + +* Requires a good knowledge of Python +* Is better taken after an introductory deep learning course, such as [fast.ai's](https://www.fast.ai/) [Practical Deep Learning for Coders](https://course.fast.ai/) or one of the programs developed by [DeepLearning.AI](https://www.deeplearning.ai/) +* Does not expect prior [PyTorch](https://pytorch.org/) or [TensorFlow](https://www.tensorflow.org/) knowledge, though some familiarity with either of those will help + +After you've completed this course, we recommend checking out DeepLearning.AI's [Natural Language Processing Specialization](https://www.coursera.org/specializations/natural-language-processing?utm_source=deeplearning-ai&utm_medium=institutions&utm_campaign=20211011-nlp-2-hugging_face-page-nlp-refresh), which covers a wide range of traditional NLP models like naive Bayes and LSTMs that are well worth knowing about! + +## Who are we?[[who-are-we]] + +About the authors: + +[**Abubakar Abid**](https://huggingface.co/abidlabs) completed his PhD at Stanford in applied machine learning. During his PhD, he founded [Gradio](https://github.com/gradio-app/gradio), an open-source Python library that has been used to build over 600,000 machine learning demos. Gradio was acquired by Hugging Face, which is where Abubakar now serves as a machine learning team lead. + +[**Matthew Carrigan**](https://huggingface.co/Rocketknight1) is a Machine Learning Engineer at Hugging Face. He lives in Dublin, Ireland and previously worked as an ML engineer at Parse.ly and before that as a post-doctoral researcher at Trinity College Dublin. He does not believe we're going to get to AGI by scaling existing architectures, but has high hopes for robot immortality regardless. + +[**Lysandre Debut**](https://huggingface.co/lysandre) is a Machine Learning Engineer at Hugging Face and has been working on the 🤗 Transformers library since the very early development stages. His aim is to make NLP accessible for everyone by developing tools with a very simple API. + +[**Sylvain Gugger**](https://huggingface.co/sgugger) is a Research Engineer at Hugging Face and one of the core maintainers of the 🤗 Transformers library. Previously he was a Research Scientist at fast.ai, and he co-wrote _[Deep Learning for Coders with fastai and PyTorch](https://learning.oreilly.com/library/view/deep-learning-for/9781492045519/)_ with Jeremy Howard. The main focus of his research is on making deep learning more accessible, by designing and improving techniques that allow models to train fast on limited resources. + +[**Dawood Khan**](https://huggingface.co/dawoodkhan82) is a Machine Learning Engineer at Hugging Face. He's from NYC and graduated from New York University studying Computer Science. After working as an iOS Engineer for a few years, Dawood quit to start Gradio with his fellow co-founders. Gradio was eventually acquired by Hugging Face. + +[**Merve Noyan**](https://huggingface.co/merve) is a developer advocate at Hugging Face, working on developing tools and building content around them to democratize machine learning for everyone. + +[**Lucile Saulnier**](https://huggingface.co/SaulLu) is a machine learning engineer at Hugging Face, developing and supporting the use of open source tools. She is also actively involved in many research projects in the field of Natural Language Processing such as collaborative training and BigScience. + +[**Lewis Tunstall**](https://huggingface.co/lewtun) is a machine learning engineer at Hugging Face, focused on developing open-source tools and making them accessible to the wider community. He is also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). + +[**Leandro von Werra**](https://huggingface.co/lvwerra) is a machine learning engineer in the open-source team at Hugging Face and also a co-author of the O’Reilly book [Natural Language Processing with Transformers](https://www.oreilly.com/library/view/natural-language-processing/9781098136789/). He has several years of industry experience bringing NLP projects to production by working across the whole machine learning stack.. + +## FAQ[[faq]] + +Here are some answers to frequently asked questions: + +- **Does taking this course lead to a certification?** +Currently we do not have any certification for this course. However, we are working on a certification program for the Hugging Face ecosystem -- stay tuned! + +- **How much time should I spend on this course?** +Each chapter in this course is designed to be completed in 1 week, with approximately 6-8 hours of work per week. However, you can take as much time as you need to complete the course. + +- **Where can I ask a question if I have one?** +If you have a question about any section of the course, just click on the "*Ask a question*" banner at the top of the page to be automatically redirected to the right section of the [Hugging Face forums](https://discuss.huggingface.co/): + +Link to the Hugging Face forums + +Note that a list of [project ideas](https://discuss.huggingface.co/c/course/course-event/25) is also available on the forums if you wish to practice more once you have completed the course. + +- **Where can I get the code for the course?** +For each section, click on the banner at the top of the page to run the code in either Google Colab or Amazon SageMaker Studio Lab: + +Link to the Hugging Face course notebooks + +The Jupyter notebooks containing all the code from the course are hosted on the [`huggingface/notebooks`](https://github.com/huggingface/notebooks) repo. If you wish to generate them locally, check out the instructions in the [`course`](https://github.com/huggingface/course#-jupyter-notebooks) repo on GitHub. + + +- **How can I contribute to the course?** +There are many ways to contribute to the course! If you find a typo or a bug, please open an issue on the [`course`](https://github.com/huggingface/course) repo. If you would like to help translate the course into your native language, check out the instructions [here](https://github.com/huggingface/course#translating-the-course-into-your-language). + +- ** What were the choices made for each translation?** +Each translation has a glossary and `TRANSLATING.txt` file that details the choices that were made for machine learning jargon etc. You can find an example for German [here](https://github.com/huggingface/course/blob/main/chapters/de/TRANSLATING.txt). + + +- **Can I reuse this course?** +Of course! The course is released under the permissive [Apache 2 license](https://www.apache.org/licenses/LICENSE-2.0.html). This means that you must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use. If you would like to cite the course, please use the following BibTeX: + +``` +@misc{huggingfacecourse, + author = {Hugging Face}, + title = {The Hugging Face Course, 2022}, + howpublished = "\url{https://huggingface.co/course}", + year = {2022}, + note = "[Online; accessed ]" +} +``` + +## Let's Go +Are you ready to roll? In this chapter, you will learn: + +* How to use the `pipeline()` function to solve NLP tasks such as text generation and classification +* About the Transformer architecture +* How to distinguish between encoder, decoder, and encoder-decoder architectures and use cases + diff --git a/chapters/rum/chapter1/10.mdx b/chapters/rum/chapter1/10.mdx new file mode 100644 index 000000000..1e14a5c95 --- /dev/null +++ b/chapters/rum/chapter1/10.mdx @@ -0,0 +1,258 @@ + + +# End-of-chapter quiz[[end-of-chapter-quiz]] + + + +This chapter covered a lot of ground! Don't worry if you didn't grasp all the details; the next chapters will help you understand how things work under the hood. + +First, though, let's test what you learned in this chapter! + + +### 1. Explore the Hub and look for the `roberta-large-mnli` checkpoint. What task does it perform? + + +roberta-large-mnli page." + }, + { + text: "Text classification", + explain: "More precisely, it classifies if two sentences are logically linked across three labels (contradiction, neutral, entailment) — a task also called natural language inference.", + correct: true + }, + { + text: "Text generation", + explain: "Look again on the roberta-large-mnli page." + } + ]} +/> + +### 2. What will the following code return? + +```py +from transformers import pipeline + +ner = pipeline("ner", grouped_entities=True) +ner("My name is Sylvain and I work at Hugging Face in Brooklyn.") +``` + +sentiment-analysis pipeline." + }, + { + text: "It will return a generated text completing this sentence.", + explain: "This is incorrect — it would be a text-generation pipeline.", + }, + { + text: "It will return the words representing persons, organizations or locations.", + explain: "Furthermore, with grouped_entities=True, it will group together the words belonging to the same entity, like \"Hugging Face\".", + correct: true + } + ]} +/> + +### 3. What should replace ... in this code sample? + +```py +from transformers import pipeline + +filler = pipeline("fill-mask", model="bert-base-cased") +result = filler("...") +``` + + has been waiting for you.", + explain: "This is incorrect. Check out the bert-base-cased model card and try to spot your mistake." + }, + { + text: "This [MASK] has been waiting for you.", + explain: "Correct! This model's mask token is [MASK].", + correct: true + }, + { + text: "This man has been waiting for you.", + explain: "This is incorrect. This pipeline fills in masked words, so it needs a mask token somewhere." + } + ]} +/> + +### 4. Why will this code fail? + +```py +from transformers import pipeline + +classifier = pipeline("zero-shot-classification") +result = classifier("This is a course about the Transformers library") +``` + +candidate_labels=[...].", + correct: true + }, + { + text: "This pipeline requires several sentences, not just one.", + explain: "This is incorrect, though when properly used, this pipeline can take a list of sentences to process (like all other pipelines)." + }, + { + text: "The 🤗 Transformers library is broken, as usual.", + explain: "We won't dignify this answer with a comment!" + }, + { + text: "This pipeline requires longer inputs; this one is too short.", + explain: "This is incorrect. Note that a very long text will be truncated when processed by this pipeline." + } + ]} +/> + +### 5. What does "transfer learning" mean? + + + +### 6. True or false? A language model usually does not need labels for its pretraining. + +self-supervised, which means the labels are created automatically from the inputs (like predicting the next word or filling in some masked words).", + correct: true + }, + { + text: "False", + explain: "This is not the correct answer." + } + ]} +/> + +### 7. Select the sentence that best describes the terms "model", "architecture", and "weights". + + + + +### 8. Which of these types of models would you use for completing prompts with generated text? + + + +### 9. Which of those types of models would you use for summarizing texts? + + + +### 10. Which of these types of models would you use for classifying text inputs according to certain labels? + + + +### 11. What possible source can the bias observed in a model have? + + diff --git a/chapters/rum/chapter1/2.mdx b/chapters/rum/chapter1/2.mdx new file mode 100644 index 000000000..eb84c4be5 --- /dev/null +++ b/chapters/rum/chapter1/2.mdx @@ -0,0 +1,26 @@ +# Natural Language Processing[[natural-language-processing]] + + + +Before jumping into Transformer models, let's do a quick overview of what natural language processing is and why we care about it. + +## What is NLP?[[what-is-nlp]] + +NLP is a field of linguistics and machine learning focused on understanding everything related to human language. The aim of NLP tasks is not only to understand single words individually, but to be able to understand the context of those words. + +The following is a list of common NLP tasks, with some examples of each: + +- **Classifying whole sentences**: Getting the sentiment of a review, detecting if an email is spam, determining if a sentence is grammatically correct or whether two sentences are logically related or not +- **Classifying each word in a sentence**: Identifying the grammatical components of a sentence (noun, verb, adjective), or the named entities (person, location, organization) +- **Generating text content**: Completing a prompt with auto-generated text, filling in the blanks in a text with masked words +- **Extracting an answer from a text**: Given a question and a context, extracting the answer to the question based on the information provided in the context +- **Generating a new sentence from an input text**: Translating a text into another language, summarizing a text + +NLP isn't limited to written text though. It also tackles complex challenges in speech recognition and computer vision, such as generating a transcript of an audio sample or a description of an image. + +## Why is it challenging?[[why-is-it-challenging]] + +Computers don't process information in the same way as humans. For example, when we read the sentence "I am hungry," we can easily understand its meaning. Similarly, given two sentences such as "I am hungry" and "I am sad," we're able to easily determine how similar they are. For machine learning (ML) models, such tasks are more difficult. The text needs to be processed in a way that enables the model to learn from it. And because language is complex, we need to think carefully about how this processing must be done. There has been a lot of research done on how to represent text, and we will look at some methods in the next chapter. diff --git a/chapters/rum/chapter1/3.mdx b/chapters/rum/chapter1/3.mdx new file mode 100644 index 000000000..a31638e9e --- /dev/null +++ b/chapters/rum/chapter1/3.mdx @@ -0,0 +1,329 @@ +# Transformers, what can they do?[[transformers-what-can-they-do]] + + + +In this section, we will look at what Transformer models can do and use our first tool from the 🤗 Transformers library: the `pipeline()` function. + + +👀 See that Open in Colab button on the top right? Click on it to open a Google Colab notebook with all the code samples of this section. This button will be present in any section containing code examples. + +If you want to run the examples locally, we recommend taking a look at the setup. + + +## Transformers are everywhere![[transformers-are-everywhere]] + +Transformer models are used to solve all kinds of NLP tasks, like the ones mentioned in the previous section. Here are some of the companies and organizations using Hugging Face and Transformer models, who also contribute back to the community by sharing their models: + +Companies using Hugging Face + +The [🤗 Transformers library](https://github.com/huggingface/transformers) provides the functionality to create and use those shared models. The [Model Hub](https://huggingface.co/models) contains thousands of pretrained models that anyone can download and use. You can also upload your own models to the Hub! + + +⚠️ The Hugging Face Hub is not limited to Transformer models. Anyone can share any kind of models or datasets they want! Create a huggingface.co account to benefit from all available features! + + +Before diving into how Transformer models work under the hood, let's look at a few examples of how they can be used to solve some interesting NLP problems. + +## Working with pipelines[[working-with-pipelines]] + + + +The most basic object in the 🤗 Transformers library is the `pipeline()` function. It connects a model with its necessary preprocessing and postprocessing steps, allowing us to directly input any text and get an intelligible answer: + +```python +from transformers import pipeline + +classifier = pipeline("sentiment-analysis") +classifier("I've been waiting for a HuggingFace course my whole life.") +``` + +```python out +[{'label': 'POSITIVE', 'score': 0.9598047137260437}] +``` + +We can even pass several sentences! + +```python +classifier( + ["I've been waiting for a HuggingFace course my whole life.", "I hate this so much!"] +) +``` + +```python out +[{'label': 'POSITIVE', 'score': 0.9598047137260437}, + {'label': 'NEGATIVE', 'score': 0.9994558095932007}] +``` + +By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the `classifier` object. If you rerun the command, the cached model will be used instead and there is no need to download the model again. + +There are three main steps involved when you pass some text to a pipeline: + +1. The text is preprocessed into a format the model can understand. +2. The preprocessed inputs are passed to the model. +3. The predictions of the model are post-processed, so you can make sense of them. + + +Some of the currently [available pipelines](https://huggingface.co/transformers/main_classes/pipelines) are: + +- `feature-extraction` (get the vector representation of a text) +- `fill-mask` +- `ner` (named entity recognition) +- `question-answering` +- `sentiment-analysis` +- `summarization` +- `text-generation` +- `translation` +- `zero-shot-classification` + +Let's have a look at a few of these! + +## Zero-shot classification[[zero-shot-classification]] + +We'll start by tackling a more challenging task where we need to classify texts that haven't been labelled. This is a common scenario in real-world projects because annotating text is usually time-consuming and requires domain expertise. For this use case, the `zero-shot-classification` pipeline is very powerful: it allows you to specify which labels to use for the classification, so you don't have to rely on the labels of the pretrained model. You've already seen how the model can classify a sentence as positive or negative using those two labels — but it can also classify the text using any other set of labels you like. + +```python +from transformers import pipeline + +classifier = pipeline("zero-shot-classification") +classifier( + "This is a course about the Transformers library", + candidate_labels=["education", "politics", "business"], +) +``` + +```python out +{'sequence': 'This is a course about the Transformers library', + 'labels': ['education', 'business', 'politics'], + 'scores': [0.8445963859558105, 0.111976258456707, 0.043427448719739914]} +``` + +This pipeline is called _zero-shot_ because you don't need to fine-tune the model on your data to use it. It can directly return probability scores for any list of labels you want! + + + +✏️ **Try it out!** Play around with your own sequences and labels and see how the model behaves. + + + + +## Text generation[[text-generation]] + +Now let's see how to use a pipeline to generate some text. The main idea here is that you provide a prompt and the model will auto-complete it by generating the remaining text. This is similar to the predictive text feature that is found on many phones. Text generation involves randomness, so it's normal if you don't get the same results as shown below. + +```python +from transformers import pipeline + +generator = pipeline("text-generation") +generator("In this course, we will teach you how to") +``` + +```python out +[{'generated_text': 'In this course, we will teach you how to understand and use ' + 'data flow and data interchange when handling user data. We ' + 'will be working with one or more of the most commonly used ' + 'data flows — data flows of various types, as seen by the ' + 'HTTP'}] +``` + +You can control how many different sequences are generated with the argument `num_return_sequences` and the total length of the output text with the argument `max_length`. + + + +✏️ **Try it out!** Use the `num_return_sequences` and `max_length` arguments to generate two sentences of 15 words each. + + + + +## Using any model from the Hub in a pipeline[[using-any-model-from-the-hub-in-a-pipeline]] + +The previous examples used the default model for the task at hand, but you can also choose a particular model from the Hub to use in a pipeline for a specific task — say, text generation. Go to the [Model Hub](https://huggingface.co/models) and click on the corresponding tag on the left to display only the supported models for that task. You should get to a page like [this one](https://huggingface.co/models?pipeline_tag=text-generation). + +Let's try the [`distilgpt2`](https://huggingface.co/distilgpt2) model! Here's how to load it in the same pipeline as before: + +```python +from transformers import pipeline + +generator = pipeline("text-generation", model="distilgpt2") +generator( + "In this course, we will teach you how to", + max_length=30, + num_return_sequences=2, +) +``` + +```python out +[{'generated_text': 'In this course, we will teach you how to manipulate the world and ' + 'move your mental and physical capabilities to your advantage.'}, + {'generated_text': 'In this course, we will teach you how to become an expert and ' + 'practice realtime, and with a hands on experience on both real ' + 'time and real'}] +``` + +You can refine your search for a model by clicking on the language tags, and pick a model that will generate text in another language. The Model Hub even contains checkpoints for multilingual models that support several languages. + +Once you select a model by clicking on it, you'll see that there is a widget enabling you to try it directly online. This way you can quickly test the model's capabilities before downloading it. + + + +✏️ **Try it out!** Use the filters to find a text generation model for another language. Feel free to play with the widget and use it in a pipeline! + + + +### The Inference API[[the-inference-api]] + +All the models can be tested directly through your browser using the Inference API, which is available on the Hugging Face [website](https://huggingface.co/). You can play with the model directly on this page by inputting custom text and watching the model process the input data. + +The Inference API that powers the widget is also available as a paid product, which comes in handy if you need it for your workflows. See the [pricing page](https://huggingface.co/pricing) for more details. + +## Mask filling[[mask-filling]] + +The next pipeline you'll try is `fill-mask`. The idea of this task is to fill in the blanks in a given text: + +```python +from transformers import pipeline + +unmasker = pipeline("fill-mask") +unmasker("This course will teach you all about models.", top_k=2) +``` + +```python out +[{'sequence': 'This course will teach you all about mathematical models.', + 'score': 0.19619831442832947, + 'token': 30412, + 'token_str': ' mathematical'}, + {'sequence': 'This course will teach you all about computational models.', + 'score': 0.04052725434303284, + 'token': 38163, + 'token_str': ' computational'}] +``` + +The `top_k` argument controls how many possibilities you want to be displayed. Note that here the model fills in the special `` word, which is often referred to as a *mask token*. Other mask-filling models might have different mask tokens, so it's always good to verify the proper mask word when exploring other models. One way to check it is by looking at the mask word used in the widget. + + + +✏️ **Try it out!** Search for the `bert-base-cased` model on the Hub and identify its mask word in the Inference API widget. What does this model predict for the sentence in our `pipeline` example above? + + + +## Named entity recognition[[named-entity-recognition]] + +Named entity recognition (NER) is a task where the model has to find which parts of the input text correspond to entities such as persons, locations, or organizations. Let's look at an example: + +```python +from transformers import pipeline + +ner = pipeline("ner", grouped_entities=True) +ner("My name is Sylvain and I work at Hugging Face in Brooklyn.") +``` + +```python out +[{'entity_group': 'PER', 'score': 0.99816, 'word': 'Sylvain', 'start': 11, 'end': 18}, + {'entity_group': 'ORG', 'score': 0.97960, 'word': 'Hugging Face', 'start': 33, 'end': 45}, + {'entity_group': 'LOC', 'score': 0.99321, 'word': 'Brooklyn', 'start': 49, 'end': 57} +] +``` + +Here the model correctly identified that Sylvain is a person (PER), Hugging Face an organization (ORG), and Brooklyn a location (LOC). + +We pass the option `grouped_entities=True` in the pipeline creation function to tell the pipeline to regroup together the parts of the sentence that correspond to the same entity: here the model correctly grouped "Hugging" and "Face" as a single organization, even though the name consists of multiple words. In fact, as we will see in the next chapter, the preprocessing even splits some words into smaller parts. For instance, `Sylvain` is split into four pieces: `S`, `##yl`, `##va`, and `##in`. In the post-processing step, the pipeline successfully regrouped those pieces. + + + +✏️ **Try it out!** Search the Model Hub for a model able to do part-of-speech tagging (usually abbreviated as POS) in English. What does this model predict for the sentence in the example above? + + + +## Question answering[[question-answering]] + +The `question-answering` pipeline answers questions using information from a given context: + +```python +from transformers import pipeline + +question_answerer = pipeline("question-answering") +question_answerer( + question="Where do I work?", + context="My name is Sylvain and I work at Hugging Face in Brooklyn", +) +``` + +```python out +{'score': 0.6385916471481323, 'start': 33, 'end': 45, 'answer': 'Hugging Face'} +``` + +Note that this pipeline works by extracting information from the provided context; it does not generate the answer. + +## Summarization[[summarization]] + +Summarization is the task of reducing a text into a shorter text while keeping all (or most) of the important aspects referenced in the text. Here's an example: + +```python +from transformers import pipeline + +summarizer = pipeline("summarization") +summarizer( + """ + America has changed dramatically during recent years. Not only has the number of + graduates in traditional engineering disciplines such as mechanical, civil, + electrical, chemical, and aeronautical engineering declined, but in most of + the premier American universities engineering curricula now concentrate on + and encourage largely the study of engineering science. As a result, there + are declining offerings in engineering subjects dealing with infrastructure, + the environment, and related issues, and greater concentration on high + technology subjects, largely supporting increasingly complex scientific + developments. While the latter is important, it should not be at the expense + of more traditional engineering. + + Rapidly developing economies such as China and India, as well as other + industrial countries in Europe and Asia, continue to encourage and advance + the teaching of engineering. Both China and India, respectively, graduate + six and eight times as many traditional engineers as does the United States. + Other industrial countries at minimum maintain their output, while America + suffers an increasingly serious decline in the number of engineering graduates + and a lack of well-educated engineers. +""" +) +``` + +```python out +[{'summary_text': ' America has changed dramatically during recent years . The ' + 'number of engineering graduates in the U.S. has declined in ' + 'traditional engineering disciplines such as mechanical, civil ' + ', electrical, chemical, and aeronautical engineering . Rapidly ' + 'developing economies such as China and India, as well as other ' + 'industrial countries in Europe and Asia, continue to encourage ' + 'and advance engineering .'}] +``` + +Like with text generation, you can specify a `max_length` or a `min_length` for the result. + + +## Translation[[translation]] + +For translation, you can use a default model if you provide a language pair in the task name (such as `"translation_en_to_fr"`), but the easiest way is to pick the model you want to use on the [Model Hub](https://huggingface.co/models). Here we'll try translating from French to English: + +```python +from transformers import pipeline + +translator = pipeline("translation", model="Helsinki-NLP/opus-mt-fr-en") +translator("Ce cours est produit par Hugging Face.") +``` + +```python out +[{'translation_text': 'This course is produced by Hugging Face.'}] +``` + +Like with text generation and summarization, you can specify a `max_length` or a `min_length` for the result. + + + +✏️ **Try it out!** Search for translation models in other languages and try to translate the previous sentence into a few different languages. + + + +The pipelines shown so far are mostly for demonstrative purposes. They were programmed for specific tasks and cannot perform variations of them. In the next chapter, you'll learn what's inside a `pipeline()` function and how to customize its behavior. diff --git a/chapters/rum/chapter1/4.mdx b/chapters/rum/chapter1/4.mdx new file mode 100644 index 000000000..a44b4a1b1 --- /dev/null +++ b/chapters/rum/chapter1/4.mdx @@ -0,0 +1,178 @@ +# How do Transformers work?[[how-do-transformers-work]] + + + +In this section, we will take a high-level look at the architecture of Transformer models. + +## A bit of Transformer history[[a-bit-of-transformer-history]] + +Here are some reference points in the (short) history of Transformer models: + +
+A brief chronology of Transformers models. + +
+ +The [Transformer architecture](https://arxiv.org/abs/1706.03762) was introduced in June 2017. The focus of the original research was on translation tasks. This was followed by the introduction of several influential models, including: + +- **June 2018**: [GPT](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf), the first pretrained Transformer model, used for fine-tuning on various NLP tasks and obtained state-of-the-art results + +- **October 2018**: [BERT](https://arxiv.org/abs/1810.04805), another large pretrained model, this one designed to produce better summaries of sentences (more on this in the next chapter!) + +- **February 2019**: [GPT-2](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf), an improved (and bigger) version of GPT that was not immediately publicly released due to ethical concerns + +- **October 2019**: [DistilBERT](https://arxiv.org/abs/1910.01108), a distilled version of BERT that is 60% faster, 40% lighter in memory, and still retains 97% of BERT's performance + +- **October 2019**: [BART](https://arxiv.org/abs/1910.13461) and [T5](https://arxiv.org/abs/1910.10683), two large pretrained models using the same architecture as the original Transformer model (the first to do so) + +- **May 2020**, [GPT-3](https://arxiv.org/abs/2005.14165), an even bigger version of GPT-2 that is able to perform well on a variety of tasks without the need for fine-tuning (called _zero-shot learning_) + +This list is far from comprehensive, and is just meant to highlight a few of the different kinds of Transformer models. Broadly, they can be grouped into three categories: + +- GPT-like (also called _auto-regressive_ Transformer models) +- BERT-like (also called _auto-encoding_ Transformer models) +- BART/T5-like (also called _sequence-to-sequence_ Transformer models) + +We will dive into these families in more depth later on. + +## Transformers are language models[[transformers-are-language-models]] + +All the Transformer models mentioned above (GPT, BERT, BART, T5, etc.) have been trained as *language models*. This means they have been trained on large amounts of raw text in a self-supervised fashion. Self-supervised learning is a type of training in which the objective is automatically computed from the inputs of the model. That means that humans are not needed to label the data! + +This type of model develops a statistical understanding of the language it has been trained on, but it's not very useful for specific practical tasks. Because of this, the general pretrained model then goes through a process called *transfer learning*. During this process, the model is fine-tuned in a supervised way -- that is, using human-annotated labels -- on a given task. + +An example of a task is predicting the next word in a sentence having read the *n* previous words. This is called *causal language modeling* because the output depends on the past and present inputs, but not the future ones. + +
+Example of causal language modeling in which the next word from a sentence is predicted. + +
+ +Another example is *masked language modeling*, in which the model predicts a masked word in the sentence. + +
+Example of masked language modeling in which a masked word from a sentence is predicted. + +
+ +## Transformers are big models[[transformers-are-big-models]] + +Apart from a few outliers (like DistilBERT), the general strategy to achieve better performance is by increasing the models' sizes as well as the amount of data they are pretrained on. + +
+Number of parameters of recent Transformers models +
+ +Unfortunately, training a model, especially a large one, requires a large amount of data. This becomes very costly in terms of time and compute resources. It even translates to environmental impact, as can be seen in the following graph. + +
+The carbon footprint of a large language model. + +
+ + + +And this is showing a project for a (very big) model led by a team consciously trying to reduce the environmental impact of pretraining. The footprint of running lots of trials to get the best hyperparameters would be even higher. + +Imagine if each time a research team, a student organization, or a company wanted to train a model, it did so from scratch. This would lead to huge, unnecessary global costs! + +This is why sharing language models is paramount: sharing the trained weights and building on top of already trained weights reduces the overall compute cost and carbon footprint of the community. + +By the way, you can evaluate the carbon footprint of your models' training through several tools. For example [ML CO2 Impact](https://mlco2.github.io/impact/) or [Code Carbon]( https://codecarbon.io/) which is integrated in 🤗 Transformers. To learn more about this, you can read this [blog post](https://huggingface.co/blog/carbon-emissions-on-the-hub) which will show you how to generate an `emissions.csv` file with an estimate of the footprint of your training, as well as the [documentation](https://huggingface.co/docs/hub/model-cards-co2) of 🤗 Transformers addressing this topic. + + +## Transfer Learning[[transfer-learning]] + + + +*Pretraining* is the act of training a model from scratch: the weights are randomly initialized, and the training starts without any prior knowledge. + +
+The pretraining of a language model is costly in both time and money. + +
+ +This pretraining is usually done on very large amounts of data. Therefore, it requires a very large corpus of data, and training can take up to several weeks. + +*Fine-tuning*, on the other hand, is the training done **after** a model has been pretrained. To perform fine-tuning, you first acquire a pretrained language model, then perform additional training with a dataset specific to your task. Wait -- why not simply train the model for your final use case from the start (**scratch**)? There are a couple of reasons: + +* The pretrained model was already trained on a dataset that has some similarities with the fine-tuning dataset. The fine-tuning process is thus able to take advantage of knowledge acquired by the initial model during pretraining (for instance, with NLP problems, the pretrained model will have some kind of statistical understanding of the language you are using for your task). +* Since the pretrained model was already trained on lots of data, the fine-tuning requires way less data to get decent results. +* For the same reason, the amount of time and resources needed to get good results are much lower. + +For example, one could leverage a pretrained model trained on the English language and then fine-tune it on an arXiv corpus, resulting in a science/research-based model. The fine-tuning will only require a limited amount of data: the knowledge the pretrained model has acquired is "transferred," hence the term *transfer learning*. + +
+The fine-tuning of a language model is cheaper than pretraining in both time and money. + +
+ +Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining. + +This process will also achieve better results than training from scratch (unless you have lots of data), which is why you should always try to leverage a pretrained model -- one as close as possible to the task you have at hand -- and fine-tune it. + +## General architecture[[general-architecture]] + +In this section, we'll go over the general architecture of the Transformer model. Don't worry if you don't understand some of the concepts; there are detailed sections later covering each of the components. + + + +## Introduction[[introduction]] + +The model is primarily composed of two blocks: + +* **Encoder (left)**: The encoder receives an input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input. +* **Decoder (right)**: The decoder uses the encoder's representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs. + +
+Architecture of a Transformers models + +
+ +Each of these parts can be used independently, depending on the task: + +* **Encoder-only models**: Good for tasks that require understanding of the input, such as sentence classification and named entity recognition. +* **Decoder-only models**: Good for generative tasks such as text generation. +* **Encoder-decoder models** or **sequence-to-sequence models**: Good for generative tasks that require an input, such as translation or summarization. + +We will dive into those architectures independently in later sections. + +## Attention layers[[attention-layers]] + +A key feature of Transformer models is that they are built with special layers called *attention layers*. In fact, the title of the paper introducing the Transformer architecture was ["Attention Is All You Need"](https://arxiv.org/abs/1706.03762)! We will explore the details of attention layers later in the course; for now, all you need to know is that this layer will tell the model to pay specific attention to certain words in the sentence you passed it (and more or less ignore the others) when dealing with the representation of each word. + +To put this into context, consider the task of translating text from English to French. Given the input "You like this course", a translation model will need to also attend to the adjacent word "You" to get the proper translation for the word "like", because in French the verb "like" is conjugated differently depending on the subject. The rest of the sentence, however, is not useful for the translation of that word. In the same vein, when translating "this" the model will also need to pay attention to the word "course", because "this" translates differently depending on whether the associated noun is masculine or feminine. Again, the other words in the sentence will not matter for the translation of "course". With more complex sentences (and more complex grammar rules), the model would need to pay special attention to words that might appear farther away in the sentence to properly translate each word. + +The same concept applies to any task associated with natural language: a word by itself has a meaning, but that meaning is deeply affected by the context, which can be any other word (or words) before or after the word being studied. + +Now that you have an idea of what attention layers are all about, let's take a closer look at the Transformer architecture. + +## The original architecture[[the-original-architecture]] + +The Transformer architecture was originally designed for translation. During training, the encoder receives inputs (sentences) in a certain language, while the decoder receives the same sentences in the desired target language. In the encoder, the attention layers can use all the words in a sentence (since, as we just saw, the translation of a given word can be dependent on what is after as well as before it in the sentence). The decoder, however, works sequentially and can only pay attention to the words in the sentence that it has already translated (so, only the words before the word currently being generated). For example, when we have predicted the first three words of the translated target, we give them to the decoder which then uses all the inputs of the encoder to try to predict the fourth word. + +To speed things up during training (when the model has access to target sentences), the decoder is fed the whole target, but it is not allowed to use future words (if it had access to the word at position 2 when trying to predict the word at position 2, the problem would not be very hard!). For instance, when trying to predict the fourth word, the attention layer will only have access to the words in positions 1 to 3. + +The original Transformer architecture looked like this, with the encoder on the left and the decoder on the right: + +
+Architecture of a Transformers models + +
+ +Note that the first attention layer in a decoder block pays attention to all (past) inputs to the decoder, but the second attention layer uses the output of the encoder. It can thus access the whole input sentence to best predict the current word. This is very useful as different languages can have grammatical rules that put the words in different orders, or some context provided later in the sentence may be helpful to determine the best translation of a given word. + +The *attention mask* can also be used in the encoder/decoder to prevent the model from paying attention to some special words -- for instance, the special padding word used to make all the inputs the same length when batching together sentences. + +## Architectures vs. checkpoints[[architecture-vs-checkpoints]] + +As we dive into Transformer models in this course, you'll see mentions of *architectures* and *checkpoints* as well as *models*. These terms all have slightly different meanings: + +* **Architecture**: This is the skeleton of the model -- the definition of each layer and each operation that happens within the model. +* **Checkpoints**: These are the weights that will be loaded in a given architecture. +* **Model**: This is an umbrella term that isn't as precise as "architecture" or "checkpoint": it can mean both. This course will specify *architecture* or *checkpoint* when it matters to reduce ambiguity. + +For example, BERT is an architecture while `bert-base-cased`, a set of weights trained by the Google team for the first release of BERT, is a checkpoint. However, one can say "the BERT model" and "the `bert-base-cased` model." diff --git a/chapters/rum/chapter1/5.mdx b/chapters/rum/chapter1/5.mdx new file mode 100644 index 000000000..89694ee83 --- /dev/null +++ b/chapters/rum/chapter1/5.mdx @@ -0,0 +1,22 @@ +# Encoder models[[encoder-models]] + + + + + +Encoder models use only the encoder of a Transformer model. At each stage, the attention layers can access all the words in the initial sentence. These models are often characterized as having "bi-directional" attention, and are often called *auto-encoding models*. + +The pretraining of these models usually revolves around somehow corrupting a given sentence (for instance, by masking random words in it) and tasking the model with finding or reconstructing the initial sentence. + +Encoder models are best suited for tasks requiring an understanding of the full sentence, such as sentence classification, named entity recognition (and more generally word classification), and extractive question answering. + +Representatives of this family of models include: + +- [ALBERT](https://huggingface.co/docs/transformers/model_doc/albert) +- [BERT](https://huggingface.co/docs/transformers/model_doc/bert) +- [DistilBERT](https://huggingface.co/docs/transformers/model_doc/distilbert) +- [ELECTRA](https://huggingface.co/docs/transformers/model_doc/electra) +- [RoBERTa](https://huggingface.co/docs/transformers/model_doc/roberta) diff --git a/chapters/rum/chapter1/6.mdx b/chapters/rum/chapter1/6.mdx new file mode 100644 index 000000000..b0f4ba09c --- /dev/null +++ b/chapters/rum/chapter1/6.mdx @@ -0,0 +1,21 @@ +# Decoder models[[decoder-models]] + + + + + +Decoder models use only the decoder of a Transformer model. At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called *auto-regressive models*. + +The pretraining of decoder models usually revolves around predicting the next word in the sentence. + +These models are best suited for tasks involving text generation. + +Representatives of this family of models include: + +- [CTRL](https://huggingface.co/transformers/model_doc/ctrl) +- [GPT](https://huggingface.co/docs/transformers/model_doc/openai-gpt) +- [GPT-2](https://huggingface.co/transformers/model_doc/gpt2) +- [Transformer XL](https://huggingface.co/transformers/model_doc/transfo-xl) diff --git a/chapters/rum/chapter1/7.mdx b/chapters/rum/chapter1/7.mdx new file mode 100644 index 000000000..e39c5ca8e --- /dev/null +++ b/chapters/rum/chapter1/7.mdx @@ -0,0 +1,21 @@ +# Sequence-to-sequence models[sequence-to-sequence-models] + + + + + +Encoder-decoder models (also called *sequence-to-sequence models*) use both parts of the Transformer architecture. At each stage, the attention layers of the encoder can access all the words in the initial sentence, whereas the attention layers of the decoder can only access the words positioned before a given word in the input. + +The pretraining of these models can be done using the objectives of encoder or decoder models, but usually involves something a bit more complex. For instance, [T5](https://huggingface.co/t5-base) is pretrained by replacing random spans of text (that can contain several words) with a single mask special word, and the objective is then to predict the text that this mask word replaces. + +Sequence-to-sequence models are best suited for tasks revolving around generating new sentences depending on a given input, such as summarization, translation, or generative question answering. + +Representatives of this family of models include: + +- [BART](https://huggingface.co/transformers/model_doc/bart) +- [mBART](https://huggingface.co/transformers/model_doc/mbart) +- [Marian](https://huggingface.co/transformers/model_doc/marian) +- [T5](https://huggingface.co/transformers/model_doc/t5) diff --git a/chapters/rum/chapter1/8.mdx b/chapters/rum/chapter1/8.mdx new file mode 100644 index 000000000..b5082b85e --- /dev/null +++ b/chapters/rum/chapter1/8.mdx @@ -0,0 +1,32 @@ +# Bias and limitations[[bias-and-limitations]] + + + +If your intent is to use a pretrained model or a fine-tuned version in production, please be aware that, while these models are powerful tools, they come with limitations. The biggest of these is that, to enable pretraining on large amounts of data, researchers often scrape all the content they can find, taking the best as well as the worst of what is available on the internet. + +To give a quick illustration, let's go back the example of a `fill-mask` pipeline with the BERT model: + +```python +from transformers import pipeline + +unmasker = pipeline("fill-mask", model="bert-base-uncased") +result = unmasker("This man works as a [MASK].") +print([r["token_str"] for r in result]) + +result = unmasker("This woman works as a [MASK].") +print([r["token_str"] for r in result]) +``` + +```python out +['lawyer', 'carpenter', 'doctor', 'waiter', 'mechanic'] +['nurse', 'waitress', 'teacher', 'maid', 'prostitute'] +``` + +When asked to fill in the missing word in these two sentences, the model gives only one gender-free answer (waiter/waitress). The others are work occupations usually associated with one specific gender -- and yes, prostitute ended up in the top 5 possibilities the model associates with "woman" and "work." This happens even though BERT is one of the rare Transformer models not built by scraping data from all over the internet, but rather using apparently neutral data (it's trained on the [English Wikipedia](https://huggingface.co/datasets/wikipedia) and [BookCorpus](https://huggingface.co/datasets/bookcorpus) datasets). + +When you use these tools, you therefore need to keep in the back of your mind that the original model you are using could very easily generate sexist, racist, or homophobic content. Fine-tuning the model on your data won't make this intrinsic bias disappear. diff --git a/chapters/rum/chapter1/9.mdx b/chapters/rum/chapter1/9.mdx new file mode 100644 index 000000000..a49dad953 --- /dev/null +++ b/chapters/rum/chapter1/9.mdx @@ -0,0 +1,16 @@ +# Summary[[summary]] + + + +In this chapter, you saw how to approach different NLP tasks using the high-level `pipeline()` function from 🤗 Transformers. You also saw how to search for and use models in the Hub, as well as how to use the Inference API to test the models directly in your browser. + +We discussed how Transformer models work at a high level, and talked about the importance of transfer learning and fine-tuning. A key aspect is that you can use the full architecture or only the encoder or decoder, depending on what kind of task you aim to solve. The following table summarizes this: + +| Model | Examples | Tasks | +|-----------------|--------------------------------------------|----------------------------------------------------------------------------------| +| Encoder | ALBERT, BERT, DistilBERT, ELECTRA, RoBERTa | Sentence classification, named entity recognition, extractive question answering | +| Decoder | CTRL, GPT, GPT-2, Transformer XL | Text generation | +| Encoder-decoder | BART, T5, Marian, mBART | Summarization, translation, generative question answering | diff --git a/chapters/rum/chapter2/1.mdx b/chapters/rum/chapter2/1.mdx new file mode 100644 index 000000000..16347ca94 --- /dev/null +++ b/chapters/rum/chapter2/1.mdx @@ -0,0 +1,25 @@ +# Introduction[[introduction]] + + + +As you saw in [Chapter 1](/course/chapter1), Transformer models are usually very large. With millions to tens of *billions* of parameters, training and deploying these models is a complicated undertaking. Furthermore, with new models being released on a near-daily basis and each having its own implementation, trying them all out is no easy task. + +The 🤗 Transformers library was created to solve this problem. Its goal is to provide a single API through which any Transformer model can be loaded, trained, and saved. The library's main features are: + +- **Ease of use**: Downloading, loading, and using a state-of-the-art NLP model for inference can be done in just two lines of code. +- **Flexibility**: At their core, all models are simple PyTorch `nn.Module` or TensorFlow `tf.keras.Model` classes and can be handled like any other models in their respective machine learning (ML) frameworks. +- **Simplicity**: Hardly any abstractions are made across the library. The "All in one file" is a core concept: a model's forward pass is entirely defined in a single file, so that the code itself is understandable and hackable. + +This last feature makes 🤗 Transformers quite different from other ML libraries. The models are not built on modules +that are shared across files; instead, each model has its own layers. In addition to making the models more approachable and understandable, this allows you to easily experiment on one model without affecting others. + +This chapter will begin with an end-to-end example where we use a model and a tokenizer together to replicate the `pipeline()` function introduced in [Chapter 1](/course/chapter1). Next, we'll discuss the model API: we'll dive into the model and configuration classes, and show you how to load a model and how it processes numerical inputs to output predictions. + +Then we'll look at the tokenizer API, which is the other main component of the `pipeline()` function. Tokenizers take care of the first and last processing steps, handling the conversion from text to numerical inputs for the neural network, and the conversion back to text when it is needed. Finally, we'll show you how to handle sending multiple sentences through a model in a prepared batch, then wrap it all up with a closer look at the high-level `tokenizer()` function. + + +⚠️ In order to benefit from all features available with the Model Hub and 🤗 Transformers, we recommend creating an account. + \ No newline at end of file diff --git a/chapters/rum/chapter2/2.mdx b/chapters/rum/chapter2/2.mdx new file mode 100644 index 000000000..2a35669d7 --- /dev/null +++ b/chapters/rum/chapter2/2.mdx @@ -0,0 +1,353 @@ + + +# Behind the pipeline[[behind-the-pipeline]] + +{#if fw === 'pt'} + + + +{:else} + + + +{/if} + + +This is the first section where the content is slightly different depending on whether you use PyTorch or TensorFlow. Toggle the switch on top of the title to select the platform you prefer! + + +{#if fw === 'pt'} + +{:else} + +{/if} + +Let's start with a complete example, taking a look at what happened behind the scenes when we executed the following code in [Chapter 1](/course/chapter1): + +```python +from transformers import pipeline + +classifier = pipeline("sentiment-analysis") +classifier( + [ + "I've been waiting for a HuggingFace course my whole life.", + "I hate this so much!", + ] +) +``` + +and obtained: + +```python out +[{'label': 'POSITIVE', 'score': 0.9598047137260437}, + {'label': 'NEGATIVE', 'score': 0.9994558095932007}] +``` + +As we saw in [Chapter 1](/course/chapter1), this pipeline groups together three steps: preprocessing, passing the inputs through the model, and postprocessing: + +
+The full NLP pipeline: tokenization of text, conversion to IDs, and inference through the Transformer model and the model head. + +
+ +Let's quickly go over each of these. + +## Preprocessing with a tokenizer[[preprocessing-with-a-tokenizer]] + +Like other neural networks, Transformer models can't process raw text directly, so the first step of our pipeline is to convert the text inputs into numbers that the model can make sense of. To do this we use a *tokenizer*, which will be responsible for: + +- Splitting the input into words, subwords, or symbols (like punctuation) that are called *tokens* +- Mapping each token to an integer +- Adding additional inputs that may be useful to the model + +All this preprocessing needs to be done in exactly the same way as when the model was pretrained, so we first need to download that information from the [Model Hub](https://huggingface.co/models). To do this, we use the `AutoTokenizer` class and its `from_pretrained()` method. Using the checkpoint name of our model, it will automatically fetch the data associated with the model's tokenizer and cache it (so it's only downloaded the first time you run the code below). + +Since the default checkpoint of the `sentiment-analysis` pipeline is `distilbert-base-uncased-finetuned-sst-2-english` (you can see its model card [here](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)), we run the following: + +```python +from transformers import AutoTokenizer + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +``` + +Once we have the tokenizer, we can directly pass our sentences to it and we'll get back a dictionary that's ready to feed to our model! The only thing left to do is to convert the list of input IDs to tensors. + +You can use 🤗 Transformers without having to worry about which ML framework is used as a backend; it might be PyTorch or TensorFlow, or Flax for some models. However, Transformer models only accept *tensors* as input. If this is your first time hearing about tensors, you can think of them as NumPy arrays instead. A NumPy array can be a scalar (0D), a vector (1D), a matrix (2D), or have more dimensions. It's effectively a tensor; other ML frameworks' tensors behave similarly, and are usually as simple to instantiate as NumPy arrays. + +To specify the type of tensors we want to get back (PyTorch, TensorFlow, or plain NumPy), we use the `return_tensors` argument: + +{#if fw === 'pt'} +```python +raw_inputs = [ + "I've been waiting for a HuggingFace course my whole life.", + "I hate this so much!", +] +inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt") +print(inputs) +``` +{:else} +```python +raw_inputs = [ + "I've been waiting for a HuggingFace course my whole life.", + "I hate this so much!", +] +inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="tf") +print(inputs) +``` +{/if} + +Don't worry about padding and truncation just yet; we'll explain those later. The main things to remember here are that you can pass one sentence or a list of sentences, as well as specifying the type of tensors you want to get back (if no type is passed, you will get a list of lists as a result). + +{#if fw === 'pt'} + +Here's what the results look like as PyTorch tensors: + +```python out +{ + 'input_ids': tensor([ + [ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102], + [ 101, 1045, 5223, 2023, 2061, 2172, 999, 102, 0, 0, 0, 0, 0, 0, 0, 0] + ]), + 'attention_mask': tensor([ + [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0] + ]) +} +``` +{:else} + +Here's what the results look like as TensorFlow tensors: + +```python out +{ + 'input_ids': , + 'attention_mask': +} +``` +{/if} + +The output itself is a dictionary containing two keys, `input_ids` and `attention_mask`. `input_ids` contains two rows of integers (one for each sentence) that are the unique identifiers of the tokens in each sentence. We'll explain what the `attention_mask` is later in this chapter. + +## Going through the model[[going-through-the-model]] + +{#if fw === 'pt'} +We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `AutoModel` class which also has a `from_pretrained()` method: + +```python +from transformers import AutoModel + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +model = AutoModel.from_pretrained(checkpoint) +``` +{:else} +We can download our pretrained model the same way we did with our tokenizer. 🤗 Transformers provides an `TFAutoModel` class which also has a `from_pretrained` method: + +```python +from transformers import TFAutoModel + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +model = TFAutoModel.from_pretrained(checkpoint) +``` +{/if} + +In this code snippet, we have downloaded the same checkpoint we used in our pipeline before (it should actually have been cached already) and instantiated a model with it. + +This architecture contains only the base Transformer module: given some inputs, it outputs what we'll call *hidden states*, also known as *features*. For each model input, we'll retrieve a high-dimensional vector representing the **contextual understanding of that input by the Transformer model**. + +If this doesn't make sense, don't worry about it. We'll explain it all later. + +While these hidden states can be useful on their own, they're usually inputs to another part of the model, known as the *head*. In [Chapter 1](/course/chapter1), the different tasks could have been performed with the same architecture, but each of these tasks will have a different head associated with it. + +### A high-dimensional vector?[[a-high-dimensional-vector]] + +The vector output by the Transformer module is usually large. It generally has three dimensions: + +- **Batch size**: The number of sequences processed at a time (2 in our example). +- **Sequence length**: The length of the numerical representation of the sequence (16 in our example). +- **Hidden size**: The vector dimension of each model input. + +It is said to be "high dimensional" because of the last value. The hidden size can be very large (768 is common for smaller models, and in larger models this can reach 3072 or more). + +We can see this if we feed the inputs we preprocessed to our model: + +{#if fw === 'pt'} +```python +outputs = model(**inputs) +print(outputs.last_hidden_state.shape) +``` + +```python out +torch.Size([2, 16, 768]) +``` +{:else} +```py +outputs = model(inputs) +print(outputs.last_hidden_state.shape) +``` + +```python out +(2, 16, 768) +``` +{/if} + +Note that the outputs of 🤗 Transformers models behave like `namedtuple`s or dictionaries. You can access the elements by attributes (like we did) or by key (`outputs["last_hidden_state"]`), or even by index if you know exactly where the thing you are looking for is (`outputs[0]`). + +### Model heads: Making sense out of numbers[[model-heads-making-sense-out-of-numbers]] + +The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers: + +
+A Transformer network alongside its head. + +
+ +The output of the Transformer model is sent directly to the model head to be processed. + +In this diagram, the model is represented by its embeddings layer and the subsequent layers. The embeddings layer converts each input ID in the tokenized input into a vector that represents the associated token. The subsequent layers manipulate those vectors using the attention mechanism to produce the final representation of the sentences. + +There are many different architectures available in 🤗 Transformers, with each one designed around tackling a specific task. Here is a non-exhaustive list: + +- `*Model` (retrieve the hidden states) +- `*ForCausalLM` +- `*ForMaskedLM` +- `*ForMultipleChoice` +- `*ForQuestionAnswering` +- `*ForSequenceClassification` +- `*ForTokenClassification` +- and others 🤗 + +{#if fw === 'pt'} +For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won't actually use the `AutoModel` class, but `AutoModelForSequenceClassification`: + +```python +from transformers import AutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +model = AutoModelForSequenceClassification.from_pretrained(checkpoint) +outputs = model(**inputs) +``` +{:else} +For our example, we will need a model with a sequence classification head (to be able to classify the sentences as positive or negative). So, we won't actually use the `TFAutoModel` class, but `TFAutoModelForSequenceClassification`: + +```python +from transformers import TFAutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) +outputs = model(inputs) +``` +{/if} + +Now if we look at the shape of our outputs, the dimensionality will be much lower: the model head takes as input the high-dimensional vectors we saw before, and outputs vectors containing two values (one per label): + +```python +print(outputs.logits.shape) +``` + +{#if fw === 'pt'} +```python out +torch.Size([2, 2]) +``` +{:else} +```python out +(2, 2) +``` +{/if} + +Since we have just two sentences and two labels, the result we get from our model is of shape 2 x 2. + +## Postprocessing the output[[postprocessing-the-output]] + +The values we get as output from our model don't necessarily make sense by themselves. Let's take a look: + +```python +print(outputs.logits) +``` + +{#if fw === 'pt'} +```python out +tensor([[-1.5607, 1.6123], + [ 4.1692, -3.3464]], grad_fn=) +``` +{:else} +```python out + +``` +{/if} + +Our model predicted `[-1.5607, 1.6123]` for the first sentence and `[ 4.1692, -3.3464]` for the second one. Those are not probabilities but *logits*, the raw, unnormalized scores outputted by the last layer of the model. To be converted to probabilities, they need to go through a [SoftMax](https://en.wikipedia.org/wiki/Softmax_function) layer (all 🤗 Transformers models output the logits, as the loss function for training will generally fuse the last activation function, such as SoftMax, with the actual loss function, such as cross entropy): + +{#if fw === 'pt'} +```py +import torch + +predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) +print(predictions) +``` +{:else} +```py +import tensorflow as tf + +predictions = tf.math.softmax(outputs.logits, axis=-1) +print(predictions) +``` +{/if} + +{#if fw === 'pt'} +```python out +tensor([[4.0195e-02, 9.5980e-01], + [9.9946e-01, 5.4418e-04]], grad_fn=) +``` +{:else} +```python out +tf.Tensor( +[[4.01951671e-02 9.59804833e-01] + [9.9945587e-01 5.4418424e-04]], shape=(2, 2), dtype=float32) +``` +{/if} + +Now we can see that the model predicted `[0.0402, 0.9598]` for the first sentence and `[0.9995, 0.0005]` for the second one. These are recognizable probability scores. + +To get the labels corresponding to each position, we can inspect the `id2label` attribute of the model config (more on this in the next section): + +```python +model.config.id2label +``` + +```python out +{0: 'NEGATIVE', 1: 'POSITIVE'} +``` + +Now we can conclude that the model predicted the following: + +- First sentence: NEGATIVE: 0.0402, POSITIVE: 0.9598 +- Second sentence: NEGATIVE: 0.9995, POSITIVE: 0.0005 + +We have successfully reproduced the three steps of the pipeline: preprocessing with tokenizers, passing the inputs through the model, and postprocessing! Now let's take some time to dive deeper into each of those steps. + + + +✏️ **Try it out!** Choose two (or more) texts of your own and run them through the `sentiment-analysis` pipeline. Then replicate the steps you saw here yourself and check that you obtain the same results! + + diff --git a/chapters/rum/chapter2/3.mdx b/chapters/rum/chapter2/3.mdx new file mode 100644 index 000000000..acc653704 --- /dev/null +++ b/chapters/rum/chapter2/3.mdx @@ -0,0 +1,228 @@ + + +# Models[[models]] + +{#if fw === 'pt'} + + + +{:else} + + + +{/if} + +{#if fw === 'pt'} + +{:else} + +{/if} + +{#if fw === 'pt'} +In this section we'll take a closer look at creating and using a model. We'll use the `AutoModel` class, which is handy when you want to instantiate any model from a checkpoint. + +The `AutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It's a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture. + +{:else} +In this section we'll take a closer look at creating and using a model. We'll use the `TFAutoModel` class, which is handy when you want to instantiate any model from a checkpoint. + +The `TFAutoModel` class and all of its relatives are actually simple wrappers over the wide variety of models available in the library. It's a clever wrapper as it can automatically guess the appropriate model architecture for your checkpoint, and then instantiates a model with this architecture. + +{/if} + +However, if you know the type of model you want to use, you can use the class that defines its architecture directly. Let's take a look at how this works with a BERT model. + +## Creating a Transformer[[creating-a-transformer]] + +The first thing we'll need to do to initialize a BERT model is load a configuration object: + +{#if fw === 'pt'} +```py +from transformers import BertConfig, BertModel + +# Building the config +config = BertConfig() + +# Building the model from the config +model = BertModel(config) +``` +{:else} +```py +from transformers import BertConfig, TFBertModel + +# Building the config +config = BertConfig() + +# Building the model from the config +model = TFBertModel(config) +``` +{/if} + +The configuration contains many attributes that are used to build the model: + +```py +print(config) +``` + +```python out +BertConfig { + [...] + "hidden_size": 768, + "intermediate_size": 3072, + "max_position_embeddings": 512, + "num_attention_heads": 12, + "num_hidden_layers": 12, + [...] +} +``` + +While you haven't seen what all of these attributes do yet, you should recognize some of them: the `hidden_size` attribute defines the size of the `hidden_states` vector, and `num_hidden_layers` defines the number of layers the Transformer model has. + +### Different loading methods[[different-loading-methods]] + +Creating a model from the default configuration initializes it with random values: + +{#if fw === 'pt'} +```py +from transformers import BertConfig, BertModel + +config = BertConfig() +model = BertModel(config) + +# Model is randomly initialized! +``` +{:else} +```py +from transformers import BertConfig, TFBertModel + +config = BertConfig() +model = TFBertModel(config) + +# Model is randomly initialized! +``` +{/if} + +The model can be used in this state, but it will output gibberish; it needs to be trained first. We could train the model from scratch on the task at hand, but as you saw in [Chapter 1](/course/chapter1), this would require a long time and a lot of data, and it would have a non-negligible environmental impact. To avoid unnecessary and duplicated effort, it's imperative to be able to share and reuse models that have already been trained. + +Loading a Transformer model that is already trained is simple — we can do this using the `from_pretrained()` method: + +{#if fw === 'pt'} +```py +from transformers import BertModel + +model = BertModel.from_pretrained("bert-base-cased") +``` + +As you saw earlier, we could replace `BertModel` with the equivalent `AutoModel` class. We'll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task). + +{:else} +```py +from transformers import TFBertModel + +model = TFBertModel.from_pretrained("bert-base-cased") +``` + +As you saw earlier, we could replace `TFBertModel` with the equivalent `TFAutoModel` class. We'll do this from now on as this produces checkpoint-agnostic code; if your code works for one checkpoint, it should work seamlessly with another. This applies even if the architecture is different, as long as the checkpoint was trained for a similar task (for example, a sentiment analysis task). + +{/if} + +In the code sample above we didn't use `BertConfig`, and instead loaded a pretrained model via the `bert-base-cased` identifier. This is a model checkpoint that was trained by the authors of BERT themselves; you can find more details about it in its [model card](https://huggingface.co/bert-base-cased). + +This model is now initialized with all the weights of the checkpoint. It can be used directly for inference on the tasks it was trained on, and it can also be fine-tuned on a new task. By training with pretrained weights rather than from scratch, we can quickly achieve good results. + +The weights have been downloaded and cached (so future calls to the `from_pretrained()` method won't re-download them) in the cache folder, which defaults to *~/.cache/huggingface/transformers*. You can customize your cache folder by setting the `HF_HOME` environment variable. + +The identifier used to load the model can be the identifier of any model on the Model Hub, as long as it is compatible with the BERT architecture. The entire list of available BERT checkpoints can be found [here](https://huggingface.co/models?filter=bert). + +### Saving methods[[saving-methods]] + +Saving a model is as easy as loading one — we use the `save_pretrained()` method, which is analogous to the `from_pretrained()` method: + +```py +model.save_pretrained("directory_on_my_computer") +``` + +This saves two files to your disk: + +{#if fw === 'pt'} +``` +ls directory_on_my_computer + +config.json pytorch_model.bin +``` +{:else} +``` +ls directory_on_my_computer + +config.json tf_model.h5 +``` +{/if} + +If you take a look at the *config.json* file, you'll recognize the attributes necessary to build the model architecture. This file also contains some metadata, such as where the checkpoint originated and what 🤗 Transformers version you were using when you last saved the checkpoint. + +{#if fw === 'pt'} +The *pytorch_model.bin* file is known as the *state dictionary*; it contains all your model's weights. The two files go hand in hand; the configuration is necessary to know your model's architecture, while the model weights are your model's parameters. + +{:else} +The *tf_model.h5* file is known as the *state dictionary*; it contains all your model's weights. The two files go hand in hand; the configuration is necessary to know your model's architecture, while the model weights are your model's parameters. + +{/if} + +## Using a Transformer model for inference[[using-a-transformer-model-for-inference]] + +Now that you know how to load and save a model, let's try using it to make some predictions. Transformer models can only process numbers — numbers that the tokenizer generates. But before we discuss tokenizers, let's explore what inputs the model accepts. + +Tokenizers can take care of casting the inputs to the appropriate framework's tensors, but to help you understand what's going on, we'll take a quick look at what must be done before sending the inputs to the model. + +Let's say we have a couple of sequences: + +```py +sequences = ["Hello!", "Cool.", "Nice!"] +``` + +The tokenizer converts these to vocabulary indices which are typically called *input IDs*. Each sequence is now a list of numbers! The resulting output is: + +```py no-format +encoded_sequences = [ + [101, 7592, 999, 102], + [101, 4658, 1012, 102], + [101, 3835, 999, 102], +] +``` + +This is a list of encoded sequences: a list of lists. Tensors only accept rectangular shapes (think matrices). This "array" is already of rectangular shape, so converting it to a tensor is easy: + +{#if fw === 'pt'} +```py +import torch + +model_inputs = torch.tensor(encoded_sequences) +``` +{:else} +```py +import tensorflow as tf + +model_inputs = tf.constant(encoded_sequences) +``` +{/if} + +### Using the tensors as inputs to the model[[using-the-tensors-as-inputs-to-the-model]] + +Making use of the tensors with the model is extremely simple — we just call the model with the inputs: + +```py +output = model(model_inputs) +``` + +While the model accepts a lot of different arguments, only the input IDs are necessary. We'll explain what the other arguments do and when they are required later, +but first we need to take a closer look at the tokenizers that build the inputs that a Transformer model can understand. diff --git a/chapters/rum/chapter2/4.mdx b/chapters/rum/chapter2/4.mdx new file mode 100644 index 000000000..30167ddbd --- /dev/null +++ b/chapters/rum/chapter2/4.mdx @@ -0,0 +1,240 @@ + + +# Tokenizers[[tokenizers]] + +{#if fw === 'pt'} + + + +{:else} + + + +{/if} + + + +Tokenizers are one of the core components of the NLP pipeline. They serve one purpose: to translate text into data that can be processed by the model. Models can only process numbers, so tokenizers need to convert our text inputs to numerical data. In this section, we'll explore exactly what happens in the tokenization pipeline. + +In NLP tasks, the data that is generally processed is raw text. Here's an example of such text: + +``` +Jim Henson was a puppeteer +``` + +However, models can only process numbers, so we need to find a way to convert the raw text to numbers. That's what the tokenizers do, and there are a lot of ways to go about this. The goal is to find the most meaningful representation — that is, the one that makes the most sense to the model — and, if possible, the smallest representation. + +Let's take a look at some examples of tokenization algorithms, and try to answer some of the questions you may have about tokenization. + +## Word-based[[word-based]] + + + +The first type of tokenizer that comes to mind is _word-based_. It's generally very easy to set up and use with only a few rules, and it often yields decent results. For example, in the image below, the goal is to split the raw text into words and find a numerical representation for each of them: + +
+ An example of word-based tokenization. + +
+ +There are different ways to split the text. For example, we could use whitespace to tokenize the text into words by applying Python's `split()` function: + +```py +tokenized_text = "Jim Henson was a puppeteer".split() +print(tokenized_text) +``` + +```python out +['Jim', 'Henson', 'was', 'a', 'puppeteer'] +``` + +There are also variations of word tokenizers that have extra rules for punctuation. With this kind of tokenizer, we can end up with some pretty large "vocabularies," where a vocabulary is defined by the total number of independent tokens that we have in our corpus. + +Each word gets assigned an ID, starting from 0 and going up to the size of the vocabulary. The model uses these IDs to identify each word. + +If we want to completely cover a language with a word-based tokenizer, we'll need to have an identifier for each word in the language, which will generate a huge amount of tokens. For example, there are over 500,000 words in the English language, so to build a map from each word to an input ID we'd need to keep track of that many IDs. Furthermore, words like "dog" are represented differently from words like "dogs", and the model will initially have no way of knowing that "dog" and "dogs" are similar: it will identify the two words as unrelated. The same applies to other similar words, like "run" and "running", which the model will not see as being similar initially. + +Finally, we need a custom token to represent words that are not in our vocabulary. This is known as the "unknown" token, often represented as "[UNK]" or "<unk>". It's generally a bad sign if you see that the tokenizer is producing a lot of these tokens, as it wasn't able to retrieve a sensible representation of a word and you're losing information along the way. The goal when crafting the vocabulary is to do it in such a way that the tokenizer tokenizes as few words as possible into the unknown token. + +One way to reduce the amount of unknown tokens is to go one level deeper, using a _character-based_ tokenizer. + +## Character-based[[character-based]] + + + +Character-based tokenizers split the text into characters, rather than words. This has two primary benefits: + +- The vocabulary is much smaller. +- There are much fewer out-of-vocabulary (unknown) tokens, since every word can be built from characters. + +But here too some questions arise concerning spaces and punctuation: + +
+ An example of character-based tokenization. + +
+ +This approach isn't perfect either. Since the representation is now based on characters rather than words, one could argue that, intuitively, it's less meaningful: each character doesn't mean a lot on its own, whereas that is the case with words. However, this again differs according to the language; in Chinese, for example, each character carries more information than a character in a Latin language. + +Another thing to consider is that we'll end up with a very large amount of tokens to be processed by our model: whereas a word would only be a single token with a word-based tokenizer, it can easily turn into 10 or more tokens when converted into characters. + +To get the best of both worlds, we can use a third technique that combines the two approaches: *subword tokenization*. + +## Subword tokenization[[subword-tokenization]] + + + +Subword tokenization algorithms rely on the principle that frequently used words should not be split into smaller subwords, but rare words should be decomposed into meaningful subwords. + +For instance, "annoyingly" might be considered a rare word and could be decomposed into "annoying" and "ly". These are both likely to appear more frequently as standalone subwords, while at the same time the meaning of "annoyingly" is kept by the composite meaning of "annoying" and "ly". + +Here is an example showing how a subword tokenization algorithm would tokenize the sequence "Let's do tokenization!": + +
+ A subword tokenization algorithm. + +
+ +These subwords end up providing a lot of semantic meaning: for instance, in the example above "tokenization" was split into "token" and "ization", two tokens that have a semantic meaning while being space-efficient (only two tokens are needed to represent a long word). This allows us to have relatively good coverage with small vocabularies, and close to no unknown tokens. + +This approach is especially useful in agglutinative languages such as Turkish, where you can form (almost) arbitrarily long complex words by stringing together subwords. + +### And more![[and-more]] + +Unsurprisingly, there are many more techniques out there. To name a few: + +- Byte-level BPE, as used in GPT-2 +- WordPiece, as used in BERT +- SentencePiece or Unigram, as used in several multilingual models + +You should now have sufficient knowledge of how tokenizers work to get started with the API. + +## Loading and saving[[loading-and-saving]] + +Loading and saving tokenizers is as simple as it is with models. Actually, it's based on the same two methods: `from_pretrained()` and `save_pretrained()`. These methods will load or save the algorithm used by the tokenizer (a bit like the *architecture* of the model) as well as its vocabulary (a bit like the *weights* of the model). + +Loading the BERT tokenizer trained with the same checkpoint as BERT is done the same way as loading the model, except we use the `BertTokenizer` class: + +```py +from transformers import BertTokenizer + +tokenizer = BertTokenizer.from_pretrained("bert-base-cased") +``` + +{#if fw === 'pt'} +Similar to `AutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint: + +{:else} +Similar to `TFAutoModel`, the `AutoTokenizer` class will grab the proper tokenizer class in the library based on the checkpoint name, and can be used directly with any checkpoint: + +{/if} + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") +``` + +We can now use the tokenizer as shown in the previous section: + +```python +tokenizer("Using a Transformer network is simple") +``` + +```python out +{'input_ids': [101, 7993, 170, 11303, 1200, 2443, 1110, 3014, 102], + 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0], + 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]} +``` + +Saving a tokenizer is identical to saving a model: + +```py +tokenizer.save_pretrained("directory_on_my_computer") +``` + +We'll talk more about `token_type_ids` in [Chapter 3](/course/chapter3), and we'll explain the `attention_mask` key a little later. First, let's see how the `input_ids` are generated. To do this, we'll need to look at the intermediate methods of the tokenizer. + +## Encoding[[encoding]] + + + +Translating text to numbers is known as _encoding_. Encoding is done in a two-step process: the tokenization, followed by the conversion to input IDs. + +As we've seen, the first step is to split the text into words (or parts of words, punctuation symbols, etc.), usually called *tokens*. There are multiple rules that can govern that process, which is why we need to instantiate the tokenizer using the name of the model, to make sure we use the same rules that were used when the model was pretrained. + +The second step is to convert those tokens into numbers, so we can build a tensor out of them and feed them to the model. To do this, the tokenizer has a *vocabulary*, which is the part we download when we instantiate it with the `from_pretrained()` method. Again, we need to use the same vocabulary used when the model was pretrained. + +To get a better understanding of the two steps, we'll explore them separately. Note that we will use some methods that perform parts of the tokenization pipeline separately to show you the intermediate results of those steps, but in practice, you should call the tokenizer directly on your inputs (as shown in the section 2). + +### Tokenization[[tokenization]] + +The tokenization process is done by the `tokenize()` method of the tokenizer: + +```py +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") + +sequence = "Using a Transformer network is simple" +tokens = tokenizer.tokenize(sequence) + +print(tokens) +``` + +The output of this method is a list of strings, or tokens: + +```python out +['Using', 'a', 'transform', '##er', 'network', 'is', 'simple'] +``` + +This tokenizer is a subword tokenizer: it splits the words until it obtains tokens that can be represented by its vocabulary. That's the case here with `transformer`, which is split into two tokens: `transform` and `##er`. + +### From tokens to input IDs[[from-tokens-to-input-ids]] + +The conversion to input IDs is handled by the `convert_tokens_to_ids()` tokenizer method: + +```py +ids = tokenizer.convert_tokens_to_ids(tokens) + +print(ids) +``` + +```python out +[7993, 170, 11303, 1200, 2443, 1110, 3014] +``` + +These outputs, once converted to the appropriate framework tensor, can then be used as inputs to a model as seen earlier in this chapter. + + + +✏️ **Try it out!** Replicate the two last steps (tokenization and conversion to input IDs) on the input sentences we used in section 2 ("I've been waiting for a HuggingFace course my whole life." and "I hate this so much!"). Check that you get the same input IDs we got earlier! + + + +## Decoding[[decoding]] + +*Decoding* is going the other way around: from vocabulary indices, we want to get a string. This can be done with the `decode()` method as follows: + +```py +decoded_string = tokenizer.decode([7993, 170, 11303, 1200, 2443, 1110, 3014]) +print(decoded_string) +``` + +```python out +'Using a Transformer network is simple' +``` + +Note that the `decode` method not only converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence. This behavior will be extremely useful when we use models that predict new text (either text generated from a prompt, or for sequence-to-sequence problems like translation or summarization). + +By now you should understand the atomic operations a tokenizer can handle: tokenization, conversion to IDs, and converting IDs back to a string. However, we've just scraped the tip of the iceberg. In the following section, we'll take our approach to its limits and take a look at how to overcome them. diff --git a/chapters/rum/chapter2/5.mdx b/chapters/rum/chapter2/5.mdx new file mode 100644 index 000000000..33060505b --- /dev/null +++ b/chapters/rum/chapter2/5.mdx @@ -0,0 +1,338 @@ + + +# Handling multiple sequences[[handling-multiple-sequences]] + +{#if fw === 'pt'} + + + +{:else} + + + +{/if} + +{#if fw === 'pt'} + +{:else} + +{/if} + +In the previous section, we explored the simplest of use cases: doing inference on a single sequence of a small length. However, some questions emerge already: + +- How do we handle multiple sequences? +- How do we handle multiple sequences *of different lengths*? +- Are vocabulary indices the only inputs that allow a model to work well? +- Is there such a thing as too long a sequence? + +Let's see what kinds of problems these questions pose, and how we can solve them using the 🤗 Transformers API. + +## Models expect a batch of inputs[[models-expect-a-batch-of-inputs]] + +In the previous exercise you saw how sequences get translated into lists of numbers. Let's convert this list of numbers to a tensor and send it to the model: + +{#if fw === 'pt'} +```py +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = AutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence = "I've been waiting for a HuggingFace course my whole life." + +tokens = tokenizer.tokenize(sequence) +ids = tokenizer.convert_tokens_to_ids(tokens) +input_ids = torch.tensor(ids) +# This line will fail. +model(input_ids) +``` + +```python out +IndexError: Dimension out of range (expected to be in range of [-1, 0], but got 1) +``` +{:else} +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence = "I've been waiting for a HuggingFace course my whole life." + +tokens = tokenizer.tokenize(sequence) +ids = tokenizer.convert_tokens_to_ids(tokens) +input_ids = tf.constant(ids) +# This line will fail. +model(input_ids) +``` + +```py out +InvalidArgumentError: Input to reshape is a tensor with 14 values, but the requested shape has 196 [Op:Reshape] +``` +{/if} + +Oh no! Why did this fail? We followed the steps from the pipeline in section 2. + +The problem is that we sent a single sequence to the model, whereas 🤗 Transformers models expect multiple sentences by default. Here we tried to do everything the tokenizer did behind the scenes when we applied it to a `sequence`. But if you look closely, you'll see that the tokenizer didn't just convert the list of input IDs into a tensor, it added a dimension on top of it: + +{#if fw === 'pt'} +```py +tokenized_inputs = tokenizer(sequence, return_tensors="pt") +print(tokenized_inputs["input_ids"]) +``` + +```python out +tensor([[ 101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, + 2607, 2026, 2878, 2166, 1012, 102]]) +``` +{:else} +```py +tokenized_inputs = tokenizer(sequence, return_tensors="tf") +print(tokenized_inputs["input_ids"]) +``` + +```py out + +``` +{/if} + +Let's try again and add a new dimension: + +{#if fw === 'pt'} +```py +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = AutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence = "I've been waiting for a HuggingFace course my whole life." + +tokens = tokenizer.tokenize(sequence) +ids = tokenizer.convert_tokens_to_ids(tokens) + +input_ids = torch.tensor([ids]) +print("Input IDs:", input_ids) + +output = model(input_ids) +print("Logits:", output.logits) +``` +{:else} +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence = "I've been waiting for a HuggingFace course my whole life." + +tokens = tokenizer.tokenize(sequence) +ids = tokenizer.convert_tokens_to_ids(tokens) + +input_ids = tf.constant([ids]) +print("Input IDs:", input_ids) + +output = model(input_ids) +print("Logits:", output.logits) +``` +{/if} + +We print the input IDs as well as the resulting logits — here's the output: + +{#if fw === 'pt'} +```python out +Input IDs: [[ 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012]] +Logits: [[-2.7276, 2.8789]] +``` +{:else} +```py out +Input IDs: tf.Tensor( +[[ 1045 1005 2310 2042 3403 2005 1037 17662 12172 2607 2026 2878 + 2166 1012]], shape=(1, 14), dtype=int32) +Logits: tf.Tensor([[-2.7276208 2.8789377]], shape=(1, 2), dtype=float32) +``` +{/if} + +*Batching* is the act of sending multiple sentences through the model, all at once. If you only have one sentence, you can just build a batch with a single sequence: + +``` +batched_ids = [ids, ids] +``` + +This is a batch of two identical sequences! + + + +✏️ **Try it out!** Convert this `batched_ids` list into a tensor and pass it through your model. Check that you obtain the same logits as before (but twice)! + + + +Batching allows the model to work when you feed it multiple sentences. Using multiple sequences is just as simple as building a batch with a single sequence. There's a second issue, though. When you're trying to batch together two (or more) sentences, they might be of different lengths. If you've ever worked with tensors before, you know that they need to be of rectangular shape, so you won't be able to convert the list of input IDs into a tensor directly. To work around this problem, we usually *pad* the inputs. + +## Padding the inputs[[padding-the-inputs]] + +The following list of lists cannot be converted to a tensor: + +```py no-format +batched_ids = [ + [200, 200, 200], + [200, 200] +] +``` + +In order to work around this, we'll use *padding* to make our tensors have a rectangular shape. Padding makes sure all our sentences have the same length by adding a special word called the *padding token* to the sentences with fewer values. For example, if you have 10 sentences with 10 words and 1 sentence with 20 words, padding will ensure all the sentences have 20 words. In our example, the resulting tensor looks like this: + +```py no-format +padding_id = 100 + +batched_ids = [ + [200, 200, 200], + [200, 200, padding_id], +] +``` + +The padding token ID can be found in `tokenizer.pad_token_id`. Let's use it and send our two sentences through the model individually and batched together: + +{#if fw === 'pt'} +```py no-format +model = AutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence1_ids = [[200, 200, 200]] +sequence2_ids = [[200, 200]] +batched_ids = [ + [200, 200, 200], + [200, 200, tokenizer.pad_token_id], +] + +print(model(torch.tensor(sequence1_ids)).logits) +print(model(torch.tensor(sequence2_ids)).logits) +print(model(torch.tensor(batched_ids)).logits) +``` + +```python out +tensor([[ 1.5694, -1.3895]], grad_fn=) +tensor([[ 0.5803, -0.4125]], grad_fn=) +tensor([[ 1.5694, -1.3895], + [ 1.3373, -1.2163]], grad_fn=) +``` +{:else} +```py no-format +model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) + +sequence1_ids = [[200, 200, 200]] +sequence2_ids = [[200, 200]] +batched_ids = [ + [200, 200, 200], + [200, 200, tokenizer.pad_token_id], +] + +print(model(tf.constant(sequence1_ids)).logits) +print(model(tf.constant(sequence2_ids)).logits) +print(model(tf.constant(batched_ids)).logits) +``` + +```py out +tf.Tensor([[ 1.5693678 -1.3894581]], shape=(1, 2), dtype=float32) +tf.Tensor([[ 0.5803005 -0.41252428]], shape=(1, 2), dtype=float32) +tf.Tensor( +[[ 1.5693681 -1.3894582] + [ 1.3373486 -1.2163193]], shape=(2, 2), dtype=float32) +``` +{/if} + +There's something wrong with the logits in our batched predictions: the second row should be the same as the logits for the second sentence, but we've got completely different values! + +This is because the key feature of Transformer models is attention layers that *contextualize* each token. These will take into account the padding tokens since they attend to all of the tokens of a sequence. To get the same result when passing individual sentences of different lengths through the model or when passing a batch with the same sentences and padding applied, we need to tell those attention layers to ignore the padding tokens. This is done by using an attention mask. + +## Attention masks[[attention-masks]] + +*Attention masks* are tensors with the exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended to (i.e., they should be ignored by the attention layers of the model). + +Let's complete the previous example with an attention mask: + +{#if fw === 'pt'} +```py no-format +batched_ids = [ + [200, 200, 200], + [200, 200, tokenizer.pad_token_id], +] + +attention_mask = [ + [1, 1, 1], + [1, 1, 0], +] + +outputs = model(torch.tensor(batched_ids), attention_mask=torch.tensor(attention_mask)) +print(outputs.logits) +``` + +```python out +tensor([[ 1.5694, -1.3895], + [ 0.5803, -0.4125]], grad_fn=) +``` +{:else} +```py no-format +batched_ids = [ + [200, 200, 200], + [200, 200, tokenizer.pad_token_id], +] + +attention_mask = [ + [1, 1, 1], + [1, 1, 0], +] + +outputs = model(tf.constant(batched_ids), attention_mask=tf.constant(attention_mask)) +print(outputs.logits) +``` + +```py out +tf.Tensor( +[[ 1.5693681 -1.3894582 ] + [ 0.5803021 -0.41252586]], shape=(2, 2), dtype=float32) +``` +{/if} + +Now we get the same logits for the second sentence in the batch. + +Notice how the last value of the second sequence is a padding ID, which is a 0 value in the attention mask. + + + +✏️ **Try it out!** Apply the tokenization manually on the two sentences used in section 2 ("I've been waiting for a HuggingFace course my whole life." and "I hate this so much!"). Pass them through the model and check that you get the same logits as in section 2. Now batch them together using the padding token, then create the proper attention mask. Check that you obtain the same results when going through the model! + + + +## Longer sequences[[longer-sequences]] + +With Transformer models, there is a limit to the lengths of the sequences we can pass the models. Most models handle sequences of up to 512 or 1024 tokens, and will crash when asked to process longer sequences. There are two solutions to this problem: + +- Use a model with a longer supported sequence length. +- Truncate your sequences. + +Models have different supported sequence lengths, and some specialize in handling very long sequences. [Longformer](https://huggingface.co/docs/transformers/model_doc/longformer) is one example, and another is [LED](https://huggingface.co/docs/transformers/model_doc/led). If you're working on a task that requires very long sequences, we recommend you take a look at those models. + +Otherwise, we recommend you truncate your sequences by specifying the `max_sequence_length` parameter: + +```py +sequence = sequence[:max_sequence_length] +``` diff --git a/chapters/rum/chapter2/6.mdx b/chapters/rum/chapter2/6.mdx new file mode 100644 index 000000000..d26118501 --- /dev/null +++ b/chapters/rum/chapter2/6.mdx @@ -0,0 +1,164 @@ + + +# Putting it all together[[putting-it-all-together]] + +{#if fw === 'pt'} + + + +{:else} + + + +{/if} + +In the last few sections, we've been trying our best to do most of the work by hand. We've explored how tokenizers work and looked at tokenization, conversion to input IDs, padding, truncation, and attention masks. + +However, as we saw in section 2, the 🤗 Transformers API can handle all of this for us with a high-level function that we'll dive into here. When you call your `tokenizer` directly on the sentence, you get back inputs that are ready to pass through your model: + +```py +from transformers import AutoTokenizer + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) + +sequence = "I've been waiting for a HuggingFace course my whole life." + +model_inputs = tokenizer(sequence) +``` + +Here, the `model_inputs` variable contains everything that's necessary for a model to operate well. For DistilBERT, that includes the input IDs as well as the attention mask. Other models that accept additional inputs will also have those output by the `tokenizer` object. + +As we'll see in some examples below, this method is very powerful. First, it can tokenize a single sequence: + +```py +sequence = "I've been waiting for a HuggingFace course my whole life." + +model_inputs = tokenizer(sequence) +``` + +It also handles multiple sequences at a time, with no change in the API: + +```py +sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] + +model_inputs = tokenizer(sequences) +``` + +It can pad according to several objectives: + +```py +# Will pad the sequences up to the maximum sequence length +model_inputs = tokenizer(sequences, padding="longest") + +# Will pad the sequences up to the model max length +# (512 for BERT or DistilBERT) +model_inputs = tokenizer(sequences, padding="max_length") + +# Will pad the sequences up to the specified max length +model_inputs = tokenizer(sequences, padding="max_length", max_length=8) +``` + +It can also truncate sequences: + +```py +sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] + +# Will truncate the sequences that are longer than the model max length +# (512 for BERT or DistilBERT) +model_inputs = tokenizer(sequences, truncation=True) + +# Will truncate the sequences that are longer than the specified max length +model_inputs = tokenizer(sequences, max_length=8, truncation=True) +``` + +The `tokenizer` object can handle the conversion to specific framework tensors, which can then be directly sent to the model. For example, in the following code sample we are prompting the tokenizer to return tensors from the different frameworks — `"pt"` returns PyTorch tensors, `"tf"` returns TensorFlow tensors, and `"np"` returns NumPy arrays: + +```py +sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] + +# Returns PyTorch tensors +model_inputs = tokenizer(sequences, padding=True, return_tensors="pt") + +# Returns TensorFlow tensors +model_inputs = tokenizer(sequences, padding=True, return_tensors="tf") + +# Returns NumPy arrays +model_inputs = tokenizer(sequences, padding=True, return_tensors="np") +``` + +## Special tokens[[special-tokens]] + +If we take a look at the input IDs returned by the tokenizer, we will see they are a tiny bit different from what we had earlier: + +```py +sequence = "I've been waiting for a HuggingFace course my whole life." + +model_inputs = tokenizer(sequence) +print(model_inputs["input_ids"]) + +tokens = tokenizer.tokenize(sequence) +ids = tokenizer.convert_tokens_to_ids(tokens) +print(ids) +``` + +```python out +[101, 1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012, 102] +[1045, 1005, 2310, 2042, 3403, 2005, 1037, 17662, 12172, 2607, 2026, 2878, 2166, 1012] +``` + +One token ID was added at the beginning, and one at the end. Let's decode the two sequences of IDs above to see what this is about: + +```py +print(tokenizer.decode(model_inputs["input_ids"])) +print(tokenizer.decode(ids)) +``` + +```python out +"[CLS] i've been waiting for a huggingface course my whole life. [SEP]" +"i've been waiting for a huggingface course my whole life." +``` + +The tokenizer added the special word `[CLS]` at the beginning and the special word `[SEP]` at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don't add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you. + +## Wrapping up: From tokenizer to model[[wrapping-up-from-tokenizer-to-model]] + +Now that we've seen all the individual steps the `tokenizer` object uses when applied on texts, let's see one final time how it can handle multiple sequences (padding!), very long sequences (truncation!), and multiple types of tensors with its main API: + +{#if fw === 'pt'} +```py +import torch +from transformers import AutoTokenizer, AutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = AutoModelForSequenceClassification.from_pretrained(checkpoint) +sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] + +tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="pt") +output = model(**tokens) +``` +{:else} +```py +import tensorflow as tf +from transformers import AutoTokenizer, TFAutoModelForSequenceClassification + +checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" +tokenizer = AutoTokenizer.from_pretrained(checkpoint) +model = TFAutoModelForSequenceClassification.from_pretrained(checkpoint) +sequences = ["I've been waiting for a HuggingFace course my whole life.", "So have I!"] + +tokens = tokenizer(sequences, padding=True, truncation=True, return_tensors="tf") +output = model(**tokens) +``` +{/if} diff --git a/chapters/rum/chapter2/7.mdx b/chapters/rum/chapter2/7.mdx new file mode 100644 index 000000000..657aa28e9 --- /dev/null +++ b/chapters/rum/chapter2/7.mdx @@ -0,0 +1,18 @@ +# Basic usage completed![[basic-usage-completed]] + + + +Great job following the course up to here! To recap, in this chapter you: + +- Learned the basic building blocks of a Transformer model. +- Learned what makes up a tokenization pipeline. +- Saw how to use a Transformer model in practice. +- Learned how to leverage a tokenizer to convert text to tensors that are understandable by the model. +- Set up a tokenizer and a model together to get from text to predictions. +- Learned the limitations of input IDs, and learned about attention masks. +- Played around with versatile and configurable tokenizer methods. + +From now on, you should be able to freely navigate the 🤗 Transformers docs: the vocabulary will sound familiar, and you've already seen the methods that you'll use the majority of the time. diff --git a/chapters/rum/chapter2/8.mdx b/chapters/rum/chapter2/8.mdx new file mode 100644 index 000000000..c41f27936 --- /dev/null +++ b/chapters/rum/chapter2/8.mdx @@ -0,0 +1,310 @@ + + + + +# End-of-chapter quiz[[end-of-chapter-quiz]] + + + +### 1. What is the order of the language modeling pipeline? + + + +### 2. How many dimensions does the tensor output by the base Transformer model have, and what are they? + +