diff --git a/framework/docs/source/index.rst b/framework/docs/source/index.rst index 2013fc806848..e91cd59bcd55 100644 --- a/framework/docs/source/index.rst +++ b/framework/docs/source/index.rst @@ -40,6 +40,7 @@ A learning-oriented series of federated learning tutorials, the best place to st tutorial-series-use-a-federated-learning-strategy-pytorch tutorial-series-build-a-strategy-from-scratch-pytorch tutorial-series-customize-the-client-pytorch + notebooks/index .. toctree:: :maxdepth: 1 diff --git a/framework/docs/source/notebooks/index.rst b/framework/docs/source/notebooks/index.rst new file mode 100644 index 000000000000..d796616b9b29 --- /dev/null +++ b/framework/docs/source/notebooks/index.rst @@ -0,0 +1,30 @@ +:og:description: Run the Flower Federated Learning Tutorial in Notebooks +.. meta:: + :description: Run the Flower Federated Learning Tutorial in Notebooks + +Tutorial Notebooks +================== + +Instead of following the main tutorials in the documentation, you can also run them in +interactive Jupyter notebooks. This allows you to execute code snippets and experiment +with running Flower in a more interactive way. + +.. |flower_how_to_run_simulations_link| replace:: How-to Run Simulations + +.. _flower_how_to_run_simulations_link: how-to-run-simulations.html + +.. note:: + + The notebooks use the ``run_simulation`` approach, but the preferred way to run + simulations is using the ``flwr run`` approach. For a comprehensive guide on how to + setup and run Flower simulations please read the + |flower_how_to_run_simulations_link|_ guide. + +.. toctree:: + :maxdepth: 1 + + tutorial-series-what-is-federated-learning + tutorial-series-get-started-with-flower-pytorch + tutorial-series-use-a-federated-learning-strategy-pytorch + tutorial-series-build-a-strategy-from-scratch-pytorch + tutorial-series-customize-the-client-pytorch diff --git a/framework/docs/source/tutorial-series-build-a-strategy-from-scratch-pytorch.ipynb b/framework/docs/source/notebooks/tutorial-series-build-a-strategy-from-scratch-pytorch.ipynb similarity index 100% rename from framework/docs/source/tutorial-series-build-a-strategy-from-scratch-pytorch.ipynb rename to framework/docs/source/notebooks/tutorial-series-build-a-strategy-from-scratch-pytorch.ipynb diff --git a/framework/docs/source/tutorial-series-customize-the-client-pytorch.ipynb b/framework/docs/source/notebooks/tutorial-series-customize-the-client-pytorch.ipynb similarity index 100% rename from framework/docs/source/tutorial-series-customize-the-client-pytorch.ipynb rename to framework/docs/source/notebooks/tutorial-series-customize-the-client-pytorch.ipynb diff --git a/framework/docs/source/tutorial-series-get-started-with-flower-pytorch.ipynb b/framework/docs/source/notebooks/tutorial-series-get-started-with-flower-pytorch.ipynb similarity index 100% rename from framework/docs/source/tutorial-series-get-started-with-flower-pytorch.ipynb rename to framework/docs/source/notebooks/tutorial-series-get-started-with-flower-pytorch.ipynb diff --git a/framework/docs/source/tutorial-series-use-a-federated-learning-strategy-pytorch.ipynb b/framework/docs/source/notebooks/tutorial-series-use-a-federated-learning-strategy-pytorch.ipynb similarity index 100% rename from framework/docs/source/tutorial-series-use-a-federated-learning-strategy-pytorch.ipynb rename to framework/docs/source/notebooks/tutorial-series-use-a-federated-learning-strategy-pytorch.ipynb diff --git a/framework/docs/source/tutorial-series-what-is-federated-learning.ipynb b/framework/docs/source/notebooks/tutorial-series-what-is-federated-learning.ipynb similarity index 92% rename from framework/docs/source/tutorial-series-what-is-federated-learning.ipynb rename to framework/docs/source/notebooks/tutorial-series-what-is-federated-learning.ipynb index a94dc2910959..0bf883973128 100644 --- a/framework/docs/source/tutorial-series-what-is-federated-learning.ipynb +++ b/framework/docs/source/notebooks/tutorial-series-what-is-federated-learning.ipynb @@ -30,13 +30,13 @@ "In machine learning, we have a model, and we have data. The model could be a neural network (as depicted here), or something else, like classical linear regression.\n", "\n", "
\n", - " \"Model\n", + " \"Model\n", "
\n", "\n", "We train the model using the data to perform a useful task. A task could be to detect objects in images, transcribe an audio recording, or play a game like Go.\n", "\n", "
\n", - " \"Train\n", + " \"Train\n", "
\n", "\n", "In practice, the training data we work with doesn't originate on the machine we train the model on. \n", @@ -44,25 +44,25 @@ "This data gets created \"somewhere else\". For instance, the data can originate on a smartphone by the user interacting with an app, a car collecting sensor data, a laptop receiving input via the keyboard, or a smart speaker listening to someone trying to sing a song.\n", "\n", "
\n", - " \"Data\n", + " \"Data\n", "
\n", "\n", "What's also important to mention, this \"somewhere else\" is usually not just one place, it's many places. It could be several devices all running the same app. But it could also be several organizations, all generating data for the same task.\n", "\n", "
\n", - " \"Data\n", + " \"Data\n", "
\n", "\n", "So to use machine learning, or any kind of data analysis, the approach that has been used in the past was to collect all this data on a central server. This server can be located somewhere in a data center, or somewhere in the cloud.\n", "\n", "
\n", - " \"Central\n", + " \"Central\n", "
\n", "\n", "Once all the data is collected in one place, we can finally use machine learning algorithms to train our model on the data. This is the machine learning approach that we've basically always relied on.\n", "\n", "
\n", - " \"Central\n", + " \"Central\n", "
" ] }, @@ -75,13 +75,13 @@ "This classical machine learning approach we've just seen can be used in some cases. Great examples include categorizing holiday photos, or analyzing web traffic. Cases, where all the data is naturally available on a centralized server.\n", "\n", "
\n", - " \"Centralized\n", + " \"Centralized\n", "
\n", "\n", "But the approach can not be used in many other cases. Cases, where the data is not available on a centralized server, or cases where the data available on one server is not enough to train a good model.\n", "\n", "
\n", - " \"Centralized\n", + " \"Centralized\n", "
\n", "\n", "There are many reasons why the classical centralized machine learning approach does not work for a large number of highly important real-world use cases. Those reasons include:\n", @@ -123,7 +123,7 @@ "We start by initializing the model on the server. This is exactly the same in classic centralized learning: we initialize the model parameters, either randomly or from a previously saved checkpoint.\n", "\n", "
\n", - " \"Initialize\n", + " \"Initialize\n", "
\n", "\n", "#### Step 1: Send model to a number of connected organizations/devices (client nodes)\n", @@ -131,7 +131,7 @@ "Next, we send the parameters of the global model to the connected client nodes (think: edge devices like smartphones or servers belonging to organizations). This is to ensure that each participating node starts its local training using the same model parameters. We often use only a few of the connected nodes instead of all nodes. The reason for this is that selecting more and more client nodes has diminishing returns.\n", "\n", "
\n", - " \"Send\n", + " \"Send\n", "
\n", "\n", "#### Step 2: Train model locally on the data of each organization/device (client node)\n", @@ -139,7 +139,7 @@ "Now that all (selected) client nodes have the latest version of the global model parameters, they start the local training. They use their own local dataset to train their own local model. They don't train the model until full convergence, but they only train for a little while. This could be as little as one epoch on the local data, or even just a few steps (mini-batches).\n", "\n", "
\n", - " \"Train\n", + " \"Train\n", "
\n", "\n", "#### Step 3: Return model updates back to the server\n", @@ -147,7 +147,7 @@ "After local training, each client node has a slightly different version of the model parameters they originally received. The parameters are all different because each client node has different examples in its local dataset. The client nodes then send those model updates back to the server. The model updates they send can either be the full model parameters or just the gradients that were accumulated during local training.\n", "\n", "
\n", - " \"Send\n", + " \"Send\n", "
\n", "\n", "#### Step 4: Aggregate model updates into a new global model\n", @@ -157,7 +157,7 @@ "In order to get one single model, we have to combine all the model updates we received from the client nodes. This process is called *aggregation*, and there are many different ways to do it. The most basic way is called *Federated Averaging* ([McMahan et al., 2016](https://arxiv.org/abs/1602.05629)), often abbreviated as *FedAvg*. *FedAvg* takes the 100 model updates and, as the name suggests, averages them. To be more precise, it takes the *weighted average* of the model updates, weighted by the number of examples each client used for training. The weighting is important to make sure that each data example has the same \"influence\" on the resulting global model. If one client has 10 examples, and another client has 100 examples, then - without weighting - each of the 10 examples would influence the global model ten times as much as each of the 100 examples.\n", "\n", "
\n", - " \"Aggregate\n", + " \"Aggregate\n", "
\n", "\n", "#### Step 5: Repeat steps 1 to 4 until the model converges\n", @@ -193,7 +193,7 @@ "Federated learning, federated evaluation, and federated analytics require infrastructure to move machine learning models back and forth, train and evaluate them on local data, and then aggregate the updated models. Flower provides the infrastructure to do exactly that in an easy, scalable, and secure way. In short, Flower presents a unified approach to federated learning, analytics, and evaluation. It allows the user to federate any workload, any ML framework, and any programming language.\n", "\n", "
\n", - " \"Flower\n", + " \"Flower\n", "
" ] }, diff --git a/framework/docs/source/tutorial-series-build-a-strategy-from-scratch-pytorch.rst b/framework/docs/source/tutorial-series-build-a-strategy-from-scratch-pytorch.rst new file mode 100644 index 000000000000..c569109b6409 --- /dev/null +++ b/framework/docs/source/tutorial-series-build-a-strategy-from-scratch-pytorch.rst @@ -0,0 +1,286 @@ +Build a strategy from scratch +============================= + +Welcome to the third part of the Flower federated learning tutorial. In previous parts +of this tutorial, we introduced federated learning with PyTorch and the Flower framework +(:doc:`part 1 `) and we learned how +strategies can be used to customize the execution on both the server and the clients +(:doc:`part 2 `). + +In this tutorial, we’ll continue to customize the federated learning system we built +previously by creating a custom version of FedAvg using the Flower framework, Flower +Datasets, and PyTorch. + + `Star Flower on GitHub `__ ⭐️ and join the Flower + community on Flower Discuss and the Flower Slack to connect, ask questions, and get + help: - `Join Flower Discuss `__ We’d love to hear from + you in the ``Introduction`` topic! If anything is unclear, post in ``Flower Help - + Beginners``. - `Join Flower Slack `__ We’d love to + hear from you in the ``#introductions`` channel! If anything is unclear, head over + to the ``#questions`` channel. + +Let’s build a new ``Strategy`` from scratch! 🌼 + +Preparation +----------- + +Before we begin with the actual code, let’s make sure that we have everything we need. + +Installing dependencies +~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + If you've completed part 1 of the tutorial, you can skip this step. + +First, we install the Flower package ``flwr``: + +.. code-block:: shell + + # In a new Python environment + $ pip install -U flwr + +Then, we create a new Flower app called ``flower-tutorial`` using the PyTorch template. +We also specify a username (``flwrlabs``) for the project: + +.. code-block:: shell + + $ flwr new flower-tutorial --framework pytorch --username flwrlabs + +After running the command, a new directory called ``flower-tutorial`` will be created. +It should have the following structure: + +.. code-block:: shell + + flower-tutorial + ├── README.md + ├── flower_tutorial + │ ├── __init__.py + │ ├── client_app.py # Defines your ClientApp + │ ├── server_app.py # Defines your ServerApp + │ └── task.py # Defines your model, training and data loading + ├── pyproject.toml # Project metadata like dependencies and configs + └── README.md + +Next, we install the project and its dependencies, which are specified in the +``pyproject.toml`` file: + +.. code-block:: shell + + $ cd flower-tutorial + $ pip install -e . + +Build a Strategy from scratch +----------------------------- + +Let’s overwrite the ``configure_fit`` method such that it passes a higher learning rate +(potentially also other hyperparameters) to the optimizer of a fraction of the clients. +We will keep the sampling of the clients as it is in ``FedAvg`` and then change the +configuration dictionary (one of the ``FitIns`` attributes). Create a new module called +``strategy.py`` in the ``flower_tutorial`` directory. Next, we define a new class +``FedCustom`` that inherits from ``Strategy``. Copy and paste the following code into +``strategy.py``: + +.. code-block:: python + + from typing import Dict, List, Optional, Tuple, Union + + from flwr.common import ( + EvaluateIns, + EvaluateRes, + FitIns, + FitRes, + Parameters, + Scalar, + ndarrays_to_parameters, + parameters_to_ndarrays, + ) + from flwr.server.client_manager import ClientManager + from flwr.server.client_proxy import ClientProxy + from flwr.server.strategy import Strategy + from flwr.server.strategy.aggregate import aggregate, weighted_loss_avg + + from flower_tutorial.task import Net, get_weights + + + class FedCustom(Strategy): + def __init__( + self, + fraction_fit: float = 1.0, + fraction_evaluate: float = 1.0, + min_fit_clients: int = 2, + min_evaluate_clients: int = 2, + min_available_clients: int = 2, + ) -> None: + super().__init__() + self.fraction_fit = fraction_fit + self.fraction_evaluate = fraction_evaluate + self.min_fit_clients = min_fit_clients + self.min_evaluate_clients = min_evaluate_clients + self.min_available_clients = min_available_clients + + def __repr__(self) -> str: + return "FedCustom" + + def initialize_parameters( + self, client_manager: ClientManager + ) -> Optional[Parameters]: + """Initialize global model parameters.""" + net = Net() + ndarrays = get_weights(net) + return ndarrays_to_parameters(ndarrays) + + def configure_fit( + self, server_round: int, parameters: Parameters, client_manager: ClientManager + ) -> List[Tuple[ClientProxy, FitIns]]: + """Configure the next round of training.""" + + # Sample clients + sample_size, min_num_clients = self.num_fit_clients( + client_manager.num_available() + ) + clients = client_manager.sample( + num_clients=sample_size, min_num_clients=min_num_clients + ) + + # Create custom configs + n_clients = len(clients) + half_clients = n_clients // 2 + standard_config = {"lr": 0.001} + higher_lr_config = {"lr": 0.003} + fit_configurations = [] + for idx, client in enumerate(clients): + if idx < half_clients: + fit_configurations.append((client, FitIns(parameters, standard_config))) + else: + fit_configurations.append( + (client, FitIns(parameters, higher_lr_config)) + ) + return fit_configurations + + def aggregate_fit( + self, + server_round: int, + results: List[Tuple[ClientProxy, FitRes]], + failures: List[Union[Tuple[ClientProxy, FitRes], BaseException]], + ) -> Tuple[Optional[Parameters], Dict[str, Scalar]]: + """Aggregate fit results using weighted average.""" + + weights_results = [ + (parameters_to_ndarrays(fit_res.parameters), fit_res.num_examples) + for _, fit_res in results + ] + parameters_aggregated = ndarrays_to_parameters(aggregate(weights_results)) + metrics_aggregated = {} + return parameters_aggregated, metrics_aggregated + + def configure_evaluate( + self, server_round: int, parameters: Parameters, client_manager: ClientManager + ) -> List[Tuple[ClientProxy, EvaluateIns]]: + """Configure the next round of evaluation.""" + if self.fraction_evaluate == 0.0: + return [] + config = {} + evaluate_ins = EvaluateIns(parameters, config) + + # Sample clients + sample_size, min_num_clients = self.num_evaluation_clients( + client_manager.num_available() + ) + clients = client_manager.sample( + num_clients=sample_size, min_num_clients=min_num_clients + ) + + # Return client/config pairs + return [(client, evaluate_ins) for client in clients] + + def aggregate_evaluate( + self, + server_round: int, + results: List[Tuple[ClientProxy, EvaluateRes]], + failures: List[Union[Tuple[ClientProxy, EvaluateRes], BaseException]], + ) -> Tuple[Optional[float], Dict[str, Scalar]]: + """Aggregate evaluation losses using weighted average.""" + + if not results: + return None, {} + + loss_aggregated = weighted_loss_avg( + [ + (evaluate_res.num_examples, evaluate_res.loss) + for _, evaluate_res in results + ] + ) + metrics_aggregated = {} + return loss_aggregated, metrics_aggregated + + def evaluate( + self, server_round: int, parameters: Parameters + ) -> Optional[Tuple[float, Dict[str, Scalar]]]: + """Evaluate global model parameters using an evaluation function.""" + + # Let's assume we won't perform the global model evaluation on the server side. + return None + + def num_fit_clients(self, num_available_clients: int) -> Tuple[int, int]: + """Return sample size and required number of clients.""" + num_clients = int(num_available_clients * self.fraction_fit) + return max(num_clients, self.min_fit_clients), self.min_available_clients + + def num_evaluation_clients(self, num_available_clients: int) -> Tuple[int, int]: + """Use a fraction of available clients for evaluation.""" + num_clients = int(num_available_clients * self.fraction_evaluate) + return max(num_clients, self.min_evaluate_clients), self.min_available_clients + +The only thing left is to use the newly created custom Strategy ``FedCustom`` when +starting the experiment. In the ``server_app.py`` file, import the custom strategy and +use it in ``server_fn``: + +.. code-block:: python + + from flower_tutorial.strategy import FedCustom + + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + + # Define strategy + strategy = FedCustom() + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + + + # Create ServerApp + app = ServerApp(server_fn=server_fn) + +Finally, we run the simulation. + +.. code-block:: shell + + $ flwr run . + +Recap +----- + +In this tutorial, we’ve seen how to implement a custom strategy. A custom strategy +enables granular control over client node configuration, result aggregation, and more. +To define a custom strategy, you only have to overwrite the abstract methods of the +(abstract) base class ``Strategy``. To make custom strategies even more powerful, you +can pass custom functions to the constructor of your new class (``__init__``) and then +call these functions whenever needed. + +Next steps +---------- + +Before you continue, make sure to join the Flower community on Flower Discuss (`Join +Flower Discuss `__) and on Slack (`Join Slack +`__). + +There’s a dedicated ``#questions`` channel if you need help, but we’d also love to hear +who you are in ``#introductions``! + +The :doc:`Flower Federated Learning Tutorial - Part 4 +` introduces ``Client``, the flexible API +underlying ``NumPyClient``. diff --git a/framework/docs/source/tutorial-series-customize-the-client-pytorch.rst b/framework/docs/source/tutorial-series-customize-the-client-pytorch.rst new file mode 100644 index 000000000000..f2a6d4397542 --- /dev/null +++ b/framework/docs/source/tutorial-series-customize-the-client-pytorch.rst @@ -0,0 +1,812 @@ +Customize the client +==================== + +Welcome to the fourth part of the Flower federated learning tutorial. In the previous +parts of this tutorial, we introduced federated learning with PyTorch and Flower +(:doc:`part 1 `), we learned how +strategies can be used to customize the execution on both the server and the clients +(:doc:`part 2 `) and we built +our own custom strategy from scratch (:doc:`part 3 +`). + +In this final tutorial, we revisit ``NumPyClient`` and introduce a new baseclass for +building clients, simply named ``Client``. In previous parts of this tutorial, we’ve +based our client on ``NumPyClient``, a convenience class which makes it easy to work +with machine learning libraries that have good NumPy interoperability. With ``Client``, +we gain a lot of flexibility that we didn’t have before, but we’ll also have to do a few +things the we didn’t have to do before. + + `Star Flower on GitHub `__ ⭐️ and join the Flower + community on Flower Discuss and the Flower Slack to connect, ask questions, and get + help: - `Join Flower Discuss `__ We’d love to hear from + you in the ``Introduction`` topic! If anything is unclear, post in ``Flower Help - + Beginners``. - `Join Flower Slack `__ We’d love to + hear from you in the ``#introductions`` channel! If anything is unclear, head over + to the ``#questions`` channel. + +Let’s go deeper and see what it takes to move from ``NumPyClient`` to ``Client``! 🌼 + +Step 0: Preparation +------------------- + +Before we begin with the actual code, let’s make sure that we have everything we need. + +Installing dependencies +~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + If you've completed part 1 of the tutorial, you can skip this step. + +First, we install the Flower package ``flwr``: + +.. code-block:: shell + + # In a new Python environment + $ pip install -U flwr + +Then, we create a new Flower app called ``flower-tutorial`` using the PyTorch template. +We also specify a username (``flwrlabs``) for the project: + +.. code-block:: shell + + $ flwr new flower-tutorial --framework pytorch --username flwrlabs + +After running the command, a new directory called ``flower-tutorial`` will be created. +It should have the following structure: + +.. code-block:: shell + + flower-tutorial + ├── README.md + ├── flower_tutorial + │ ├── __init__.py + │ ├── client_app.py # Defines your ClientApp + │ ├── server_app.py # Defines your ServerApp + │ └── task.py # Defines your model, training and data loading + ├── pyproject.toml # Project metadata like dependencies and configs + └── README.md + +Next, we install the project and its dependencies, which are specified in the +``pyproject.toml`` file: + +.. code-block:: shell + + $ cd flower-tutorial + $ pip install -e . + +Step 1: Revisiting NumPyClient +------------------------------ + +So far, we’ve implemented our client by subclassing ``flwr.client.NumPyClient``. The two +methods that were implemented in ``client_app.py`` are ``fit`` and ``evaluate``. + +.. code-block:: python + + class FlowerClient(NumPyClient): + def __init__(self, net, trainloader, valloader, local_epochs): + self.net = net + self.trainloader = trainloader + self.valloader = valloader + self.local_epochs = local_epochs + self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + self.net.to(self.device) + + def fit(self, parameters, config): + set_weights(self.net, parameters) + train_loss = train( + self.net, + self.trainloader, + self.local_epochs, + self.device, + ) + return ( + get_weights(self.net), + len(self.trainloader.dataset), + {"train_loss": train_loss}, + ) + + def evaluate(self, parameters, config): + set_weights(self.net, parameters) + loss, accuracy = test(self.net, self.valloader, self.device) + return loss, len(self.valloader.dataset), {"accuracy": accuracy} + +Then, we have the function ``client_fn`` that is used by Flower to create the +``FlowerClient`` instances on demand. Finally, we create the ``ClientApp`` and pass the +``client_fn`` to it. + +.. code-block:: python + + def client_fn(context: Context): + # Load model and data + net = Net() + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + trainloader, valloader = load_data(partition_id, num_partitions) + local_epochs = context.run_config["local-epochs"] + + # Return Client instance + return FlowerClient(net, trainloader, valloader, local_epochs).to_client() + + + # Flower ClientApp + app = ClientApp( + client_fn, + ) + +We’ve seen this before, there’s nothing new so far. Next, in ``server_app.py``, the +number of federated learning rounds are preconfigured in the ``ServerConfig`` and in the +same module, the ``ServerApp`` is created with this config: + +.. code-block:: python + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + # Define strategy + strategy = FedAvg( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + + + # Create ServerApp + app = ServerApp(server_fn=server_fn) + +Finally, we run the simulation to see the output we get: + +.. code-block:: shell + + $ flwr run . + +This works as expected, ten clients are training for three rounds of federated learning. + +Let’s dive a little bit deeper and discuss how Flower executes this simulation. Whenever +a client is selected to do some work, under the hood, Flower launches the ``ClientApp`` +object which in turn calls the function ``client_fn`` to create an instance of our +``FlowerClient`` (along with loading the model and the data). + +But here’s the perhaps surprising part: Flower doesn’t actually use the ``FlowerClient`` +object directly. Instead, it wraps the object to makes it look like a subclass of +``flwr.client.Client``, not ``flwr.client.NumPyClient``. In fact, the Flower core +framework doesn’t know how to handle ``NumPyClient``\ ’s, it only knows how to handle +``Client``\ ’s. ``NumPyClient`` is just a convenience abstraction built on top of +``Client``. + +Instead of building on top of ``NumPyClient``, we can directly build on top of +``Client``. + +Step 2: Moving from ``NumPyClient`` to ``Client`` +------------------------------------------------- + +Let’s try to do the same thing using ``Client`` instead of ``NumPyClient``. Create a new +file called ``custom_client_app.py`` and copy the following code into it: + +.. code-block:: python + + from typing import List + + import numpy as np + import torch + from flwr.client import Client, ClientApp + from flwr.common import ( + Code, + Context, + EvaluateIns, + EvaluateRes, + FitIns, + FitRes, + GetParametersIns, + GetParametersRes, + Status, + ndarrays_to_parameters, + parameters_to_ndarrays, + ) + + from flower_tutorial.task import Net, get_weights, load_data, set_weights, test, train + + + class FlowerClient(Client): + def __init__(self, partition_id, net, trainloader, valloader, local_epochs): + self.partition_id = partition_id + self.net = net + self.trainloader = trainloader + self.valloader = valloader + self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + self.local_epochs = local_epochs + + def get_parameters(self, ins: GetParametersIns) -> GetParametersRes: + print(f"[Client {self.partition_id}] get_parameters") + + # Get parameters as a list of NumPy ndarray's + ndarrays: List[np.ndarray] = get_weights(self.net) + + # Serialize ndarray's into a Parameters object + parameters = ndarrays_to_parameters(ndarrays) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return GetParametersRes( + status=status, + parameters=parameters, + ) + + def fit(self, ins: FitIns) -> FitRes: + print(f"[Client {self.partition_id}] fit, config: {ins.config}") + + # Deserialize parameters to NumPy ndarray's + parameters_original = ins.parameters + ndarrays_original = parameters_to_ndarrays(parameters_original) + + # Update local model, train, get updated parameters + set_weights(self.net, ndarrays_original) + train(self.net, self.trainloader, self.local_epochs, self.device) + ndarrays_updated = get_weights(self.net) + + # Serialize ndarray's into a Parameters object + parameters_updated = ndarrays_to_parameters(ndarrays_updated) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return FitRes( + status=status, + parameters=parameters_updated, + num_examples=len(self.trainloader), + metrics={}, + ) + + def evaluate(self, ins: EvaluateIns) -> EvaluateRes: + print(f"[Client {self.partition_id}] evaluate, config: {ins.config}") + + # Deserialize parameters to NumPy ndarray's + parameters_original = ins.parameters + ndarrays_original = parameters_to_ndarrays(parameters_original) + + set_weights(self.net, ndarrays_original) + loss, accuracy = test(self.net, self.valloader, self.device) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return EvaluateRes( + status=status, + loss=float(loss), + num_examples=len(self.valloader), + metrics={"accuracy": float(accuracy)}, + ) + + + def client_fn(context: Context) -> Client: + net = Net() + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + local_epochs = context.run_config["local-epochs"] + trainloader, valloader = load_data(partition_id, num_partitions) + return FlowerClient( + partition_id, net, trainloader, valloader, local_epochs + ).to_client() + + + # Create the ClientApp + app = ClientApp(client_fn=client_fn) + +Next, we update the ``pyproject.toml`` so that Flower uses the new module: + +.. code-block:: toml + + [tool.flwr.app.components] + serverapp = "flower_tutorial.server_app:app" + clientapp = "flower_tutorial.custom_client_app:app" + +Before we discuss the code in more detail, let’s try to run it! Gotta make sure our new +``Client``-based client works, right? We run the simulation as follows: + +.. code-block:: shell + + $ flwr run . + +That’s it, we’re now using ``Client``. It probably looks similar to what we’ve done with +``NumPyClient``. So what’s the difference? + +First of all, it’s more code. But why? The difference comes from the fact that +``Client`` expects us to take care of parameter serialization and deserialization. For +Flower to be able to send parameters over the network, it eventually needs to turn these +parameters into ``bytes``. Turning parameters (e.g., NumPy ``ndarray``\ ’s) into raw +bytes is called serialization. Turning raw bytes into something more useful (like NumPy +``ndarray``\ ’s) is called deserialization. Flower needs to do both: it needs to +serialize parameters on the server-side and send them to the client, the client needs to +deserialize them to use them for local training, and then serialize the updated +parameters again to send them back to the server, which (finally!) deserializes them +again in order to aggregate them with the updates received from other clients. + +The only *real* difference between Client and NumPyClient is that NumPyClient takes care +of serialization and deserialization for you. It can do so because it expects you to +return parameters as NumPy ndarray’s, and it knows how to handle these. This makes +working with machine learning libraries that have good NumPy support (most of them) a +breeze. + +In terms of API, there’s one major difference: all methods in Client take exactly one +argument (e.g., ``FitIns`` in ``Client.fit``) and return exactly one value (e.g., +``FitRes`` in ``Client.fit``). The methods in ``NumPyClient`` on the other hand have +multiple arguments (e.g., ``parameters`` and ``config`` in ``NumPyClient.fit``) and +multiple return values (e.g., ``parameters``, ``num_example``, and ``metrics`` in +``NumPyClient.fit``) if there are multiple things to handle. These ``*Ins`` and ``*Res`` +objects in ``Client`` wrap all the individual values you’re used to from +``NumPyClient``. + +Step 3: Custom serialization +---------------------------- + +Here we will explore how to implement custom serialization with a simple example. + +But first what is serialization? Serialization is just the process of converting an +object into raw bytes, and equally as important, deserialization is the process of +converting raw bytes back into an object. This is very useful for network communication. +Indeed, without serialization, you could not just a Python object through the internet. + +Federated Learning relies heavily on internet communication for training by sending +Python objects back and forth between the clients and the server. This means that +serialization is an essential part of Federated Learning. + +In the following section, we will write a basic example where instead of sending a +serialized version of our ``ndarray``\ s containing our parameters, we will first +convert the ``ndarray`` into sparse matrices, before sending them. This technique can be +used to save bandwidth, as in certain cases where the weights of a model are sparse +(containing many 0 entries), converting them to a sparse matrix can greatly improve +their bytesize. + +Our custom serialization/deserialization functions +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is where the real serialization/deserialization will happen, especially in +``ndarray_to_sparse_bytes`` for serialization and ``sparse_bytes_to_ndarray`` for +deserialization. First we add the following code to ``task.py``: + +.. code-block:: python + + from io import BytesIO + from typing import cast + + import numpy as np + + from flwr.common.typing import NDArray, NDArrays, Parameters + + + def ndarrays_to_sparse_parameters(ndarrays: NDArrays) -> Parameters: + """Convert NumPy ndarrays to parameters object.""" + tensors = [ndarray_to_sparse_bytes(ndarray) for ndarray in ndarrays] + return Parameters(tensors=tensors, tensor_type="numpy.ndarray") + + + def sparse_parameters_to_ndarrays(parameters: Parameters) -> NDArrays: + """Convert parameters object to NumPy ndarrays.""" + return [sparse_bytes_to_ndarray(tensor) for tensor in parameters.tensors] + + + def ndarray_to_sparse_bytes(ndarray: NDArray) -> bytes: + """Serialize NumPy ndarray to bytes.""" + bytes_io = BytesIO() + + if len(ndarray.shape) > 1: + # We convert our ndarray into a sparse matrix + ndarray = torch.tensor(ndarray).to_sparse_csr() + + # And send it byutilizing the sparse matrix attributes + # WARNING: NEVER set allow_pickle to true. + # Reason: loading pickled data can execute arbitrary code + # Source: https://numpy.org/doc/stable/reference/generated/numpy.save.html + np.savez( + bytes_io, # type: ignore + crow_indices=ndarray.crow_indices(), + col_indices=ndarray.col_indices(), + values=ndarray.values(), + allow_pickle=False, + ) + else: + # WARNING: NEVER set allow_pickle to true. + # Reason: loading pickled data can execute arbitrary code + # Source: https://numpy.org/doc/stable/reference/generated/numpy.save.html + np.save(bytes_io, ndarray, allow_pickle=False) + return bytes_io.getvalue() + + + def sparse_bytes_to_ndarray(tensor: bytes) -> NDArray: + """Deserialize NumPy ndarray from bytes.""" + bytes_io = BytesIO(tensor) + # WARNING: NEVER set allow_pickle to true. + # Reason: loading pickled data can execute arbitrary code + # Source: https://numpy.org/doc/stable/reference/generated/numpy.load.html + loader = np.load(bytes_io, allow_pickle=False) # type: ignore + + if "crow_indices" in loader: + # We convert our sparse matrix back to a ndarray, using the attributes we sent + ndarray_deserialized = ( + torch.sparse_csr_tensor( + crow_indices=loader["crow_indices"], + col_indices=loader["col_indices"], + values=loader["values"], + ) + .to_dense() + .numpy() + ) + else: + ndarray_deserialized = loader + return cast(NDArray, ndarray_deserialized) + +Client-side +~~~~~~~~~~~ + +To be able to serialize our ``ndarray``\ s into sparse parameters, we will just have to +call our custom functions in our ``flwr.client.Client``. + +Indeed, in ``get_parameters`` we need to serialize the parameters we got from our +network using our custom ``ndarrays_to_sparse_parameters`` defined above. + +In ``fit``, we first need to deserialize the parameters coming from the server using our +custom ``sparse_parameters_to_ndarrays`` and then we need to serialize our local results +with ``ndarrays_to_sparse_parameters``. + +In ``evaluate``, we will only need to deserialize the global parameters with our custom +function. In a new file called ``serde_client_app.py``, copy the following code into it: + +.. code-block:: python + + from typing import List + + import numpy as np + import torch + from flwr.client import Client, ClientApp + from flwr.common import ( + Code, + Context, + EvaluateIns, + EvaluateRes, + FitIns, + FitRes, + GetParametersIns, + GetParametersRes, + Status, + ) + + from flower_tutorial.task import ( + Net, + get_weights, + load_data, + ndarrays_to_sparse_parameters, + set_weights, + sparse_parameters_to_ndarrays, + test, + train, + ) + + + class FlowerClient(Client): + def __init__(self, partition_id, net, trainloader, valloader, local_epochs): + self.partition_id = partition_id + self.net = net + self.trainloader = trainloader + self.valloader = valloader + self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + self.local_epochs = local_epochs + + def get_parameters(self, ins: GetParametersIns) -> GetParametersRes: + print(f"[Client {self.partition_id}] get_parameters") + + # Get parameters as a list of NumPy ndarray's + ndarrays: List[np.ndarray] = get_weights(self.net) + + # Serialize ndarray's into a Parameters object using our custom function + parameters = ndarrays_to_sparse_parameters(ndarrays) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return GetParametersRes( + status=status, + parameters=parameters, + ) + + def fit(self, ins: FitIns) -> FitRes: + print(f"[Client {self.partition_id}] fit, config: {ins.config}") + + # Deserialize parameters to NumPy ndarray's using our custom function + parameters_original = ins.parameters + ndarrays_original = sparse_parameters_to_ndarrays(parameters_original) + + # Update local model, train, get updated parameters + set_weights(self.net, ndarrays_original) + train(self.net, self.trainloader, self.local_epochs, self.device) + ndarrays_updated = get_weights(self.net) + + # Serialize ndarray's into a Parameters object using our custom function + parameters_updated = ndarrays_to_sparse_parameters(ndarrays_updated) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return FitRes( + status=status, + parameters=parameters_updated, + num_examples=len(self.trainloader), + metrics={}, + ) + + def evaluate(self, ins: EvaluateIns) -> EvaluateRes: + print(f"[Client {self.partition_id}] evaluate, config: {ins.config}") + + # Deserialize parameters to NumPy ndarray's using our custom function + parameters_original = ins.parameters + ndarrays_original = sparse_parameters_to_ndarrays(parameters_original) + + set_weights(self.net, ndarrays_original) + loss, accuracy = test(self.net, self.valloader, self.device) + + # Build and return response + status = Status(code=Code.OK, message="Success") + return EvaluateRes( + status=status, + loss=float(loss), + num_examples=len(self.valloader), + metrics={"accuracy": float(accuracy)}, + ) + + + def client_fn(context: Context) -> Client: + net = Net() + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + local_epochs = context.run_config["local-epochs"] + trainloader, valloader = load_data(partition_id, num_partitions) + return FlowerClient( + partition_id, net, trainloader, valloader, local_epochs + ).to_client() + + + # Create the ClientApp + app = ClientApp(client_fn=client_fn) + +Server-side +~~~~~~~~~~~ + +For this example, we will just use ``FedAvg`` as a strategy. To change the serialization +and deserialization here, we only need to reimplement the ``evaluate`` and +``aggregate_fit`` functions of ``FedAvg``. The other functions of the strategy will be +inherited from the super class ``FedAvg``. + +As you can see only one line as change in ``evaluate``: + +.. code-block:: python + + parameters_ndarrays = sparse_parameters_to_ndarrays(parameters) + +And for ``aggregate_fit``, we will first deserialize every result we received: + +.. code-block:: python + + weights_results = [ + (sparse_parameters_to_ndarrays(fit_res.parameters), fit_res.num_examples) + for _, fit_res in results + ] + +And then serialize the aggregated result: + +.. code-block:: python + + parameters_aggregated = ndarrays_to_sparse_parameters(aggregate(weights_results)) + +In a new file called ``strategy.py``, copy the following code into it: + +.. code-block:: python + + from logging import WARNING + from typing import Callable, Dict, List, Optional, Tuple, Union + + from flwr.common import FitRes, MetricsAggregationFn, NDArrays, Parameters, Scalar + from flwr.common.logger import log + from flwr.server.client_proxy import ClientProxy + from flwr.server.strategy import FedAvg + from flwr.server.strategy.aggregate import aggregate + + from flower_tutorial.task import ( + ndarrays_to_sparse_parameters, + sparse_parameters_to_ndarrays, + ) + + WARNING_MIN_AVAILABLE_CLIENTS_TOO_LOW = """ + Setting `min_available_clients` lower than `min_fit_clients` or + `min_evaluate_clients` can cause the server to fail when there are too few clients + connected to the server. `min_available_clients` must be set to a value larger + than or equal to the values of `min_fit_clients` and `min_evaluate_clients`. + """ + + + class FedSparse(FedAvg): + def __init__( + self, + *, + fraction_fit: float = 1.0, + fraction_evaluate: float = 1.0, + min_fit_clients: int = 2, + min_evaluate_clients: int = 2, + min_available_clients: int = 2, + evaluate_fn: Optional[ + Callable[ + [int, NDArrays, Dict[str, Scalar]], + Optional[Tuple[float, Dict[str, Scalar]]], + ] + ] = None, + on_fit_config_fn: Optional[Callable[[int], Dict[str, Scalar]]] = None, + on_evaluate_config_fn: Optional[Callable[[int], Dict[str, Scalar]]] = None, + accept_failures: bool = True, + initial_parameters: Optional[Parameters] = None, + fit_metrics_aggregation_fn: Optional[MetricsAggregationFn] = None, + evaluate_metrics_aggregation_fn: Optional[MetricsAggregationFn] = None, + ) -> None: + """Custom FedAvg strategy with sparse matrices. + + Parameters + ---------- + fraction_fit : float, optional + Fraction of clients used during training. Defaults to 0.1. + fraction_evaluate : float, optional + Fraction of clients used during validation. Defaults to 0.1. + min_fit_clients : int, optional + Minimum number of clients used during training. Defaults to 2. + min_evaluate_clients : int, optional + Minimum number of clients used during validation. Defaults to 2. + min_available_clients : int, optional + Minimum number of total clients in the system. Defaults to 2. + evaluate_fn : Optional[Callable[[int, NDArrays, Dict[str, Scalar]], Optional[Tuple[float, Dict[str, Scalar]]]]] + Optional function used for validation. Defaults to None. + on_fit_config_fn : Callable[[int], Dict[str, Scalar]], optional + Function used to configure training. Defaults to None. + on_evaluate_config_fn : Callable[[int], Dict[str, Scalar]], optional + Function used to configure validation. Defaults to None. + accept_failures : bool, optional + Whether or not accept rounds containing failures. Defaults to True. + initial_parameters : Parameters, optional + Initial global model parameters. + """ + + if ( + min_fit_clients > min_available_clients + or min_evaluate_clients > min_available_clients + ): + log(WARNING, WARNING_MIN_AVAILABLE_CLIENTS_TOO_LOW) + + super().__init__( + fraction_fit=fraction_fit, + fraction_evaluate=fraction_evaluate, + min_fit_clients=min_fit_clients, + min_evaluate_clients=min_evaluate_clients, + min_available_clients=min_available_clients, + evaluate_fn=evaluate_fn, + on_fit_config_fn=on_fit_config_fn, + on_evaluate_config_fn=on_evaluate_config_fn, + accept_failures=accept_failures, + initial_parameters=initial_parameters, + fit_metrics_aggregation_fn=fit_metrics_aggregation_fn, + evaluate_metrics_aggregation_fn=evaluate_metrics_aggregation_fn, + ) + + def evaluate( + self, server_round: int, parameters: Parameters + ) -> Optional[Tuple[float, Dict[str, Scalar]]]: + """Evaluate model parameters using an evaluation function.""" + if self.evaluate_fn is None: + # No evaluation function provided + return None + + # We deserialize using our custom method + parameters_ndarrays = sparse_parameters_to_ndarrays(parameters) + + eval_res = self.evaluate_fn(server_round, parameters_ndarrays, {}) + if eval_res is None: + return None + loss, metrics = eval_res + return loss, metrics + + def aggregate_fit( + self, + server_round: int, + results: List[Tuple[ClientProxy, FitRes]], + failures: List[Union[Tuple[ClientProxy, FitRes], BaseException]], + ) -> Tuple[Optional[Parameters], Dict[str, Scalar]]: + """Aggregate fit results using weighted average.""" + if not results: + return None, {} + # Do not aggregate if there are failures and failures are not accepted + if not self.accept_failures and failures: + return None, {} + + # We deserialize each of the results with our custom method + weights_results = [ + (sparse_parameters_to_ndarrays(fit_res.parameters), fit_res.num_examples) + for _, fit_res in results + ] + + # We serialize the aggregated result using our custom method + parameters_aggregated = ndarrays_to_sparse_parameters( + aggregate(weights_results) + ) + + # Aggregate custom metrics if aggregation fn was provided + metrics_aggregated = {} + if self.fit_metrics_aggregation_fn: + fit_metrics = [(res.num_examples, res.metrics) for _, res in results] + metrics_aggregated = self.fit_metrics_aggregation_fn(fit_metrics) + elif server_round == 1: # Only log this warning once + log(WARNING, "No fit_metrics_aggregation_fn provided") + + return parameters_aggregated, metrics_aggregated + +We can now import our new ``FedSparse`` strategy into ``server_app.py`` and update our +``server_fn`` to use it: + +.. code-block:: python + + from flower_tutorial.strategy import FedSparse + + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents( + strategy=FedSparse(), config=config # <-- pass the new strategy here + ) + + + # Create ServerApp + app = ServerApp(server_fn=server_fn) + +Finally, we run the simulation. + +.. code-block:: shell + + $ flwr run . + +Recap +----- + +In this part of the tutorial, we’ve seen how we can build clients by subclassing either +``NumPyClient`` or ``Client``. ``NumPyClient`` is a convenience abstraction that makes +it easier to work with machine learning libraries that have good NumPy interoperability. +``Client`` is a more flexible abstraction that allows us to do things that are not +possible in ``NumPyClient``. In order to do so, it requires us to handle parameter +serialization and deserialization ourselves. + +.. note:: + + If you'd like to follow along with tutorial notebooks, check out the :doc:`Tutorial + notebooks `. Note that the notebooks use the ``run_simulation`` + approach, whereas the recommended way to run simulations in Flower is using the + ``flwr run`` approach as shown in this tutorial. + +Next steps +---------- + +Before you continue, make sure to join the Flower community on Flower Discuss (`Join +Flower Discuss `__) and on Slack (`Join Slack +`__). + +There’s a dedicated ``#questions`` channel if you need help, but we’d also love to hear +who you are in ``#introductions``! + +This is the final part of the Flower tutorial (for now!), congratulations! You’re now +well equipped to understand the rest of the documentation. There are many topics we +didn’t cover in the tutorial, we recommend the following resources: + +- `Read Flower Docs `__ +- `Check out Flower Code Examples `__ +- `Use Flower Baselines for your research `__ +- `Watch Flower AI Summit 2024 videos `__ diff --git a/framework/docs/source/tutorial-series-get-started-with-flower-pytorch.rst b/framework/docs/source/tutorial-series-get-started-with-flower-pytorch.rst new file mode 100644 index 000000000000..d4e6b18abdb9 --- /dev/null +++ b/framework/docs/source/tutorial-series-get-started-with-flower-pytorch.rst @@ -0,0 +1,637 @@ +Get started with Flower +======================= + +Welcome to the Flower federated learning tutorial! + +In this tutorial, we’ll build a federated learning system using the Flower framework, +Flower Datasets and PyTorch. In part 1, we use PyTorch for the model training pipeline +and data loading. In part 2, we federate the PyTorch project using Flower. + + `Star Flower on GitHub `__ ⭐️ and join the Flower + community on Flower Discuss and the Flower Slack to connect, ask questions, and get + help: - `Join Flower Discuss `__ We’d love to hear from + you in the ``Introduction`` topic! If anything is unclear, post in ``Flower Help - + Beginners``. - `Join Flower Slack `__ We’d love to + hear from you in the ``#introductions`` channel! If anything is unclear, head over + to the ``#questions`` channel. + +Let’s get started! 🌼 + +Step 0: Preparation +------------------- + +Before we begin with any actual code, let’s make sure that we have everything we need. + +Install dependencies +~~~~~~~~~~~~~~~~~~~~ + +First, we install the Flower package ``flwr``: + +.. code-block:: shell + + # In a new Python environment + $ pip install -U flwr + +Then, we create a new Flower app called ``flower-tutorial`` using the PyTorch template. +We also specify a username (``flwrlabs``) for the project: + +.. code-block:: shell + + $ flwr new flower-tutorial --framework pytorch --username flwrlabs + +After running the command, a new directory called ``flower-tutorial`` will be created. +It should have the following structure: + +.. code-block:: shell + + flower-tutorial + ├── README.md + ├── flower_tutorial + │ ├── __init__.py + │ ├── client_app.py # Defines your ClientApp + │ ├── server_app.py # Defines your ServerApp + │ └── task.py # Defines your model, training and data loading + ├── pyproject.toml # Project metadata like dependencies and configs + └── README.md + +Next, we install the project and its dependencies, which are specified in the +``pyproject.toml`` file. For this tutorial, we'll also need ``matplotlib``, so we'll +also install it: + +.. code-block:: shell + + $ cd flower-tutorial + $ pip install -e . matplotlib + +Before we dive into federated learning, we'll take a look at the dataset that we'll be +using for this tutorial, which is the `CIFAR-10 +`_ dataset, and run a simple centralized +training pipeline using PyTorch. + +The ``CIFAR-10`` dataset +~~~~~~~~~~~~~~~~~~~~~~~~ + +Federated learning can be applied to many different types of tasks across different +domains. In this tutorial, we introduce federated learning by training a simple +convolutional neural network (CNN) on the popular CIFAR-10 dataset. CIFAR-10 can be used +to train image classifiers that distinguish between images from ten different classes: +‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, and +‘truck’. + +We simulate having multiple datasets from multiple organizations (also called the +“cross-silo” setting in federated learning) by splitting the original CIFAR-10 dataset +into multiple partitions. Each partition will represent the data from a single +organization. We’re doing this purely for experimentation purposes, in the real world +there’s no need for data splitting because each organization already has their own data +(the data is naturally partitioned). + +Each organization will act as a client in the federated learning system. Having ten +organizations participate in a federation means having ten clients connected to the +federated learning server. + +We use the `Flower Datasets `_ library +(``flwr-datasets``) to partition CIFAR-10 into ten partitions using +``FederatedDataset``. Using the ``load_data()`` function defined in ``task.py``, we will +create a small training and test set for each of the ten organizations and wrap each of +these into a PyTorch ``DataLoader``: + +.. code-block:: python + + def load_data(partition_id: int, num_partitions: int): + """Load partition CIFAR10 data.""" + # Only initialize `FederatedDataset` once + global fds + if fds is None: + partitioner = IidPartitioner(num_partitions=num_partitions) + fds = FederatedDataset( + dataset="uoft-cs/cifar10", + partitioners={"train": partitioner}, + ) + partition = fds.load_partition(partition_id) + # Divide data on each node: 80% train, 20% test + partition_train_test = partition.train_test_split(test_size=0.2, seed=42) + pytorch_transforms = Compose( + [ToTensor(), Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))] + ) + + def apply_transforms(batch): + """Apply transforms to the partition from FederatedDataset.""" + batch["img"] = [pytorch_transforms(img) for img in batch["img"]] + return batch + + partition_train_test = partition_train_test.with_transform(apply_transforms) + trainloader = DataLoader(partition_train_test["train"], batch_size=32, shuffle=True) + testloader = DataLoader(partition_train_test["test"], batch_size=32) + return trainloader, testloader + +We now have a function that can return a training set and validation set +(``trainloader`` and ``valloader``) representing one dataset from one of ten different +organizations. Each ``trainloader``/``valloader`` pair contains 4000 training examples +and 1000 validation examples. There’s also a single ``testloader`` (we did not split the +test set). Again, this is only necessary for building research or educational systems, +actual federated learning systems have their data naturally distributed across multiple +partitions. + +Let’s take a look at the first batch of images and labels in the first training set +(i.e., ``trainloader`` from ``partition_id=0``) before we move on. Copy this code block +into a new Python script ``plot.py`` and execute it with ``python plot.py``: + +.. code-block:: python + + from matplotlib import pyplot as plt + + from flower_tutorial.task import load_data + + trainloader, _ = load_data(partition_id=0, num_partitions=10) + batch = next(iter(trainloader)) + images, labels = batch["img"], batch["label"] + + # Reshape and convert images to a NumPy array + # matplotlib requires images with the shape (height, width, 3) + images = images.permute(0, 2, 3, 1).numpy() + + # Denormalize + images = images / 2 + 0.5 + + # Create a figure and a grid of subplots + fig, axs = plt.subplots(4, 8, figsize=(12, 6)) + + # Loop over the images and plot them + for i, ax in enumerate(axs.flat): + ax.imshow(images[i]) + ax.set_title(trainloader.dataset.features["label"].int2str([labels[i]])[0]) + ax.axis("off") + + # Show the plot + fig.tight_layout() + plt.show() + +The output from running the script above shows a random batch of images from the +``trainloader`` from the first of ten partitions. It also prints the labels associated +with each image (i.e., one of the ten possible labels we’ve seen above). If you run the +script again, you should see another batch of images. + +Step 1: Centralized Training with PyTorch +----------------------------------------- + +Next, we’re going to use PyTorch to define a simple convolutional neural network. This +introduction assumes basic familiarity with PyTorch, so it doesn’t cover the +PyTorch-related aspects in full detail. If you want to dive deeper into PyTorch, we +recommend `DEEP LEARNING WITH PYTORCH: A 60 MINUTE BLITZ +`_. + +The model +~~~~~~~~~ + +We will use the simple CNN described in the `PyTorch tutorial +`__ +(The following code is already defined in ``task.py``): + +.. code-block:: python + + class Net(nn.Module): + """Model (simple CNN adapted from 'PyTorch: A 60 Minute Blitz')""" + + def __init__(self): + super(Net, self).__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.fc1 = nn.Linear(16 * 5 * 5, 120) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = x.view(-1, 16 * 5 * 5) + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + return self.fc3(x) + +The PyTorch template has also provided us with the usual training and test functions: + +.. code-block:: python + + def train(net, trainloader, epochs, device): + """Train the model on the training set.""" + net.to(device) # move model to GPU if available + criterion = torch.nn.CrossEntropyLoss().to(device) + optimizer = torch.optim.Adam(net.parameters(), lr=0.01) + net.train() + running_loss = 0.0 + for _ in range(epochs): + for batch in trainloader: + images = batch["img"] + labels = batch["label"] + optimizer.zero_grad() + loss = criterion(net(images.to(device)), labels.to(device)) + loss.backward() + optimizer.step() + running_loss += loss.item() + + avg_trainloss = running_loss / len(trainloader) + return avg_trainloss + + + def test(net, testloader, device): + """Validate the model on the test set.""" + net.to(device) + criterion = torch.nn.CrossEntropyLoss() + correct, loss = 0, 0.0 + with torch.no_grad(): + for batch in testloader: + images = batch["img"].to(device) + labels = batch["label"].to(device) + outputs = net(images) + loss += criterion(outputs, labels).item() + correct += (torch.max(outputs.data, 1)[1] == labels).sum().item() + accuracy = correct / len(testloader.dataset) + loss = loss / len(testloader) + return loss, accuracy + +Train the model +~~~~~~~~~~~~~~~ + +We now have all the basic building blocks we need: a dataset, a model, a training +function, and a test function. Let’s put them together to train the model on the dataset +of one of our organizations (``partition_id=0``). This simulates the reality of most +machine learning projects today: each organization has their own data and trains models +only on this internal data. + +First, we'll create a new script called ``centralized.py`` and copy the following code +into it: + +.. code-block:: python + + import torch + + from flower_tutorial.task import Net, load_data, test, train + + DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + + trainloader, testloader = load_data(partition_id=0, num_partitions=10) + net = Net().to(DEVICE) + + for epoch in range(5): + train(net, trainloader, 1, DEVICE) + loss, accuracy = test(net, testloader, DEVICE) + print(f"Epoch {epoch+1}: validation loss {loss}, accuracy {accuracy}") + +Training the simple CNN on our CIFAR-10 split for 5 epochs should result in a validation +set accuracy of about 41%, which is not good, but at the same time, it doesn’t really +matter for the purposes of this tutorial. The intent was just to show a simple +centralized training pipeline that sets the stage for what comes next - federated +learning! + +Step 2: Federated Learning with Flower +-------------------------------------- + +Step 1 demonstrated a simple centralized training pipeline. All data was in one place +(i.e., a single ``trainloader`` and a single ``testloader``). Next, we’ll simulate a +situation where we have multiple datasets in multiple organizations and where we train a +model over these organizations using federated learning. + +Update model parameters +~~~~~~~~~~~~~~~~~~~~~~~ + +In federated learning, the server sends global model parameters to the client, and the +client updates the local model with parameters received from the server. It then trains +the model on the local data (which changes the model parameters locally) and sends the +updated/changed model parameters back to the server (or, alternatively, it sends just +the gradients back to the server, not the full model parameters). + +We need two helper functions to get the updated model parameters from the local model +and to update the local model with parameters received from the server: ``get_weights`` +and ``set_weights``. The following two functions do just that for the PyTorch model +above and are predefined in ``task.py``. + +The details of how this works are not really important here (feel free to consult the +PyTorch documentation if you want to learn more). In essence, we use ``state_dict`` to +access PyTorch model parameter tensors. The parameter tensors are then converted to/from +a list of NumPy ``ndarray``\s (which the Flower ``NumPyClient`` knows how to +serialize/deserialize): + +.. code-block:: python + + def get_weights(net): + return [val.cpu().numpy() for _, val in net.state_dict().items()] + + + def set_weights(net, parameters): + params_dict = zip(net.state_dict().keys(), parameters) + state_dict = OrderedDict({k: torch.tensor(v) for k, v in params_dict}) + net.load_state_dict(state_dict, strict=True) + +Define the Flower ClientApp +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +With that out of the way, let’s move on to the interesting part. Federated learning +systems consist of a server and multiple clients. In Flower, we create a ``ServerApp`` +and a ``ClientApp`` to run the server-side and client-side code, respectively. + +The first step toward creating a ``ClientApp`` is to implement a subclasses of +``flwr.client.Client`` or ``flwr.client.NumPyClient``. We use ``NumPyClient`` in this +tutorial because it is easier to implement and requires us to write less boilerplate. To +implement ``NumPyClient``, we create a subclass that implements the three methods +``get_weights``, ``fit``, and ``evaluate``: + +- ``get_weights``: Return the current local model parameters +- ``fit``: Receive model parameters from the server, train the model on the local data, + and return the updated model parameters to the server +- ``evaluate``: Receive model parameters from the server, evaluate the model on the + local data, and return the evaluation result to the server + +We mentioned that our clients will use the previously defined PyTorch components for +model training and evaluation. Let’s see a simple Flower client implementation that +brings everything together. Note that all of this boilerplate implementation has already +been done for us in our Flower project: + +.. code-block:: python + + class FlowerClient(NumPyClient): + def __init__(self, net, trainloader, valloader, local_epochs): + self.net = net + self.trainloader = trainloader + self.valloader = valloader + self.local_epochs = local_epochs + self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + self.net.to(self.device) + + def fit(self, parameters, config): + set_weights(self.net, parameters) + train_loss = train( + self.net, + self.trainloader, + self.local_epochs, + self.device, + ) + return ( + get_weights(self.net), + len(self.trainloader.dataset), + {"train_loss": train_loss}, + ) + + def evaluate(self, parameters, config): + set_weights(self.net, parameters) + loss, accuracy = test(self.net, self.valloader, self.device) + return loss, len(self.valloader.dataset), {"accuracy": accuracy} + +Our class ``FlowerClient`` defines how local training/evaluation will be performed and +allows Flower to call the local training/evaluation through ``fit`` and ``evaluate``. +Each instance of ``FlowerClient`` represents a *single client* in our federated learning +system. Federated learning systems have multiple clients (otherwise, there’s not much to +federate), so each client will be represented by its own instance of ``FlowerClient``. +If we have, for example, three clients in our workload, then we’d have three instances +of ``FlowerClient`` (one on each of the machines we’d start the client on). Flower calls +``FlowerClient.fit`` on the respective instance when the server selects a particular +client for training (and ``FlowerClient.evaluate`` for evaluation). + +In this project, we want to simulate a federated learning system with 10 clients *on a +single machine*. This means that the server and all 10 clients will live on a single +machine and share resources such as CPU, GPU, and memory. Having 10 clients would mean +having 10 instances of ``FlowerClient`` in memory. Doing this on a single machine can +quickly exhaust the available memory resources, even if only a subset of these clients +participates in a single round of federated learning. + +In addition to the regular capabilities where server and clients run on multiple +machines, Flower, therefore, provides special simulation capabilities that create +``FlowerClient`` instances only when they are actually necessary for training or +evaluation. To enable the Flower framework to create clients when necessary, we need to +implement a function that creates a ``FlowerClient`` instance on demand. We typically +call this function ``client_fn``. Flower calls ``client_fn`` whenever it needs an +instance of one particular client to call ``fit`` or ``evaluate`` (those instances are +usually discarded after use, so they should not keep any local state). In federated +learning experiments using Flower, clients are identified by a partition ID, or +``partition_id``. This ``partition_id`` is used to load different local data partitions +for different clients, as can be seen below. The value of ``partition_id`` is retrieved +from the ``node_config`` dictionary in the ``Context`` object, which holds the +information that persists throughout each training round. + +With this, we have the class ``FlowerClient`` which defines client-side +training/evaluation and ``client_fn`` which allows Flower to create ``FlowerClient`` +instances whenever it needs to call ``fit`` or ``evaluate`` on one particular client. +Last, but definitely not least, we create an instance of ``ClientApp`` and pass it the +``client_fn``. ``ClientApp`` is the entrypoint that a running Flower client uses to call +your code (as defined in, for example, ``FlowerClient.fit``). The following code is +reproduced from ``client_app.py`` with additional comments: + +.. code-block:: python + + def client_fn(context: Context): + # Load model and data + net = Net() + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + # Load data (CIFAR-10) + # Note: each client gets a different trainloader/valloader, so each client + # will train and evaluate on their own unique data partition + # Read the node_config to fetch data partition associated to this node + trainloader, valloader = load_data(partition_id, num_partitions) + local_epochs = context.run_config["local-epochs"] + + # Create a single Flower client representing a single organization + # FlowerClient is a subclass of NumPyClient, so we need to call .to_client() + # to convert it to a subclass of `flwr.client.Client` + return FlowerClient(net, trainloader, valloader, local_epochs).to_client() + + + # Create the Flower ClientApp + app = ClientApp(client_fn=client_fn) + +Define the Flower ServerApp +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On the server side, we need to configure a strategy which encapsulates the federated +learning approach/algorithm, for example, *Federated Averaging* (FedAvg). Flower has a +number of built-in strategies, but we can also use our own strategy implementations to +customize nearly all aspects of the federated learning approach. For this example, we +use the built-in ``FedAvg`` implementation and customize it using a few basic +parameters: + +.. code-block:: python + + # Create FedAvg strategy + strategy = FedAvg( + fraction_fit=fraction_fit, # Sample this value of available client for training + fraction_evaluate=1.0, # Sample 100% of available clients for evaluation + min_available_clients=2, # Wait until 2 clients are available + initial_parameters=parameters, # Use these initial model parameters + ) + +Similar to ``ClientApp``, we create a ``ServerApp`` using a utility function +``server_fn``. This function is predefined for us in ``server_app.py``. In +``server_fn``, we pass an instance of ``ServerConfig`` for defining the number of +federated learning rounds (``num_rounds``) and we also pass the previously created +``strategy``. The ``server_fn`` returns a ``ServerAppComponents`` object containing the +settings that define the ``ServerApp`` behaviour. ``ServerApp`` is the entrypoint that +Flower uses to call all your server-side code (for example, the strategy). + +.. code-block:: python + + def server_fn(context: Context): + """Construct components that set the ServerApp behaviour. + + You can use the settings in `context.run_config` to parameterize the + construction of all elements (e.g the strategy or the number of rounds) + wrapped in the returned ServerAppComponents object. + """ + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + # Define strategy + strategy = FedAvg( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + +Run the training +~~~~~~~~~~~~~~~~ + +With all of these components in place, we can now run the federated learning simulation +with Flower! The last step is to run our simulation in command line, as follows: + +.. code-block:: shell + + $ flwr run . + +This will execute the federated learning simulation with 10 clients, or SuperNodes, +defined in the ``[tool.flwr.federations.local-simulation]`` section in the +``pyproject.toml``. You can also override the parameters defined in the +``[tool.flwr.app.config]`` section in ``pyproject.toml`` like this: + +.. code-block:: shell + + # Run the simulation with 5 server rounds and 3 local epochs + $ flwr run . --run-config "num-server-rounds=5 local-epochs=3" + +Behind the scenes +~~~~~~~~~~~~~~~~~ + +So how does this work? How does Flower execute this simulation? + +When we execute ``flwr run``, we tell Flower that there are 10 clients +(``options.num-supernodes = 10``, where 1 ``SuperNode`` launches 1 ``ClientApp``). +Flower then goes ahead an asks the ``ServerApp`` to issue an instructions to those nodes +using the ``FedAvg`` strategy. ``FedAvg`` knows that it should select 50% of the +available clients (``fraction-fit=0.5``), so it goes ahead and selects 5 random clients +(i.e., 50% of 10). + +Flower then asks the selected 5 clients to train the model. Each of the 5 ``ClientApp`` +instances receives a message, which causes it to call ``client_fn`` to create an +instance of ``FlowerClient``. It then calls ``.fit()`` on each the ``FlowerClient`` +instances and returns the resulting model parameter updates to the ``ServerApp``. When +the ``ServerApp`` receives the model parameter updates from the clients, it hands those +updates over to the strategy (*FedAvg*) for aggregation. The strategy aggregates those +updates and returns the new global model, which then gets used in the next round of +federated learning. + +Where’s the accuracy? +~~~~~~~~~~~~~~~~~~~~~ + +You may have noticed that all metrics except for ``losses_distributed`` are empty. Where +did the ``{"accuracy": float(accuracy)}`` go? + +Flower can automatically aggregate losses returned by individual clients, but it cannot +do the same for metrics in the generic metrics dictionary (the one with the ``accuracy`` +key). Metrics dictionaries can contain very different kinds of metrics and even +key/value pairs that are not metrics at all, so the framework does not (and can not) +know how to handle these automatically. + +As users, we need to tell the framework how to handle/aggregate these custom metrics, +and we do so by passing metric aggregation functions to the strategy. The strategy will +then call these functions whenever it receives fit or evaluate metrics from clients. The +two possible functions are ``fit_metrics_aggregation_fn`` and +``evaluate_metrics_aggregation_fn``. + +Let’s create a simple weighted averaging function to aggregate the ``accuracy`` metric +we return from ``evaluate``. Copy the following ``weighted_average()`` function to +``task.py``: + +.. code-block:: python + + def weighted_average(metrics: List[Tuple[int, Metrics]]) -> Metrics: + # Multiply accuracy of each client by number of examples used + accuracies = [num_examples * m["accuracy"] for num_examples, m in metrics] + examples = [num_examples for num_examples, _ in metrics] + + # Aggregate and return custom metric (weighted average) + return {"accuracy": sum(accuracies) / sum(examples)} + +Now, in ``server_app.py``, we import the function and pass it to the ``FedAvg`` +strategy: + +.. code-block:: python + + from flower_tutorial.task import weighted_average + + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + # Define strategy + strategy = FedAvg( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + evaluate_metrics_aggregation_fn=weighted_average, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + + + # Create ServerApp + app = ServerApp(server_fn=server_fn) + +We now have a full system that performs federated training and federated evaluation. It +uses the ``weighted_average`` function to aggregate custom evaluation metrics and +calculates a single ``accuracy`` metric across all clients on the server side. + +The other two categories of metrics (``losses_centralized`` and ``metrics_centralized``) +are still empty because they only apply when centralized evaluation is being used. Part +two of the Flower tutorial will cover centralized evaluation. + +Final remarks +------------- + +Congratulations, you just trained a convolutional neural network, federated over 10 +clients! With that, you understand the basics of federated learning with Flower. The +same approach you’ve seen can be used with other machine learning frameworks (not just +PyTorch) and tasks (not just CIFAR-10 images classification), for example NLP with +Hugging Face Transformers or speech with SpeechBrain. + +In the next tutorial, we’re going to cover some more advanced concepts. Want to +customize your strategy? Initialize parameters on the server side? Or evaluate the +aggregated model on the server side? We’ll cover all this and more in the next tutorial. + +Next steps +---------- + +Before you continue, make sure to join the Flower community on Flower Discuss (`Join +Flower Discuss `__) and on Slack (`Join Slack +`__). + +There’s a dedicated ``#questions`` channel if you need help, but we’d also love to hear +who you are in ``#introductions``! + +The :doc:`Flower Federated Learning Tutorial - Part 2 +` goes into more depth about +strategies and all the advanced things you can build with them. diff --git a/framework/docs/source/tutorial-series-use-a-federated-learning-strategy-pytorch.rst b/framework/docs/source/tutorial-series-use-a-federated-learning-strategy-pytorch.rst new file mode 100644 index 000000000000..2ccfd0ab646e --- /dev/null +++ b/framework/docs/source/tutorial-series-use-a-federated-learning-strategy-pytorch.rst @@ -0,0 +1,421 @@ +Use a federated learning strategy +================================= + +Welcome to the next part of the federated learning tutorial. In previous parts of this +tutorial, we introduced federated learning with PyTorch and Flower (:doc:`part 1 +`). + +In part 2, we’ll begin to customize the federated learning system we built in part 1 +again, using the Flower framework, Flower Datasets, and PyTorch. + + `Star Flower on GitHub `_ ⭐️ and join the Flower + community on Flower Discuss and the Flower Slack to connect, ask questions, and get + help: - `Join Flower Discuss `_ We’d love to hear from + you in the ``Introduction`` topic! If anything is unclear, post in ``Flower Help - + Beginners``. - `Join Flower Slack `_ We’d love to hear + from you in the ``#introductions`` channel! If anything is unclear, head over to the + ``#questions`` channel. + +Let’s move beyond FedAvg with Flower strategies! 🌼 + +Preparation +----------- + +Before we begin with the actual code, let’s make sure that we have everything we need. + +Installing dependencies +~~~~~~~~~~~~~~~~~~~~~~~ + +.. note:: + + If you've completed part 1 of the tutorial, you can skip this step. + +First, we install the Flower package ``flwr``: + +.. code-block:: shell + + # In a new Python environment + $ pip install -U flwr + +Then, we create a new Flower app called ``flower-tutorial`` using the PyTorch template. +We also specify a username (``flwrlabs``) for the project: + +.. code-block:: shell + + $ flwr new flower-tutorial --framework pytorch --username flwrlabs + +After running the command, a new directory called ``flower-tutorial`` will be created. +It should have the following structure: + +.. code-block:: shell + + flower-tutorial + ├── README.md + ├── flower_tutorial + │ ├── __init__.py + │ ├── client_app.py # Defines your ClientApp + │ ├── server_app.py # Defines your ServerApp + │ └── task.py # Defines your model, training and data loading + ├── pyproject.toml # Project metadata like dependencies and configs + └── README.md + +Next, we install the project and its dependencies, which are specified in the +``pyproject.toml`` file: + +.. code-block:: shell + + $ cd flower-tutorial + $ pip install -e . + +Strategy customization +---------------------- + +So far, everything should look familiar if you’ve worked through the introductory +tutorial. With that, we’re ready to introduce a number of new features. + +Starting with a customized strategy +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +In part 1, we created a ``ServerApp`` (in ``server_app.py``) using the ``server_fn``. In +it, we defined the srategy and number of training rounds. + +The strategy encapsulates the federated learning approach/algorithm, for example, +``FedAvg`` or ``FedAdagrad``. Let’s try to use a different strategy this time. Add this +line to the top of your ``server_app.py``: ``from flwr.server.strategy import +FedAdagrad`` and replace the ``server_fn()`` with the following code: + +.. code-block:: python + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + # Define strategy + strategy = FedAdagrad( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + +Next, run the training with the following command: + +.. code-block:: shell + + $ flwr run . + +Server-side parameter **evaluation** +------------------------------------ + +Flower can evaluate the aggregated model on the server-side or on the client-side. +Client-side and server-side evaluation are similar in some ways, but different in +others. + +**Centralized Evaluation** (or *server-side evaluation*) is conceptually simple: it +works the same way that evaluation in centralized machine learning does. If there is a +server-side dataset that can be used for evaluation purposes, then that’s great. We can +evaluate the newly aggregated model after each round of training without having to send +the model to clients. We’re also fortunate in the sense that our entire evaluation +dataset is available at all times. + +**Federated Evaluation** (or *client-side evaluation*) is more complex, but also more +powerful: it doesn’t require a centralized dataset and allows us to evaluate models over +a larger set of data, which often yields more realistic evaluation results. In fact, +many scenarios require us to use **Federated Evaluation** if we want to get +representative evaluation results at all. But this power comes at a cost: once we start +to evaluate on the client side, we should be aware that our evaluation dataset can +change over consecutive rounds of learning if those clients are not always available. +Moreover, the dataset held by each client can also change over consecutive rounds. This +can lead to evaluation results that are not stable, so even if we would not change the +model, we’d see our evaluation results fluctuate over consecutive rounds. + +We’ve seen how federated evaluation works on the client side (i.e., by implementing the +``evaluate`` method in ``FlowerClient``). Now let’s see how we can evaluate aggregated +model parameters on the server-side. First we define a new function ``evaluate`` in +``task.py``: + +.. code-block:: python + + # The `evaluate` function will be called by Flower after every round + def evaluate( + server_round: int, + parameters, + config, + ): + device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + net = Net().to(device) + _, testloader = load_data(0, 10) + set_weights(net, parameters) # Update model with the latest parameters + loss, accuracy = test(net, testloader, device) + print(f"Server-side evaluation loss {loss} / accuracy {accuracy}") + return loss, {"accuracy": accuracy} + +Next, in ``server_app.py``, we pass the ``evaluate`` function to the ``evaluate_fn`` +parameter of the ``FedAvg`` strategy: + +.. code-block:: python + + def server_fn(context: Context) -> ServerAppComponents: + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + strategy = FedAvg( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + evaluate_fn=evaluate, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + + + # Create ServerApp + app = ServerApp(server_fn=server_fn) + +Finally, we run the simulation. + +.. code-block:: shell + + $ flwr run . + +Sending/receiving arbitrary values to/from clients +-------------------------------------------------- + +In some situations, we want to configure client-side execution (training, evaluation) +from the server-side. One example for that is the server asking the clients to train for +a certain number of local epochs. Flower provides a way to send configuration values +from the server to the clients using a dictionary. Let’s look at an example where the +clients receive values from the server through the ``config`` parameter in ``fit`` +(``config`` is also available in ``evaluate``). The ``fit`` method receives the +configuration dictionary through the ``config`` parameter and can then read values from +this dictionary. In this example, it reads ``server_round`` and ``local_epochs`` and +uses those values to improve the logging and configure the number of local training +epochs. In our ``client_app.py``, replace the ``FlowerClient()`` class and +``client_fn()`` with the following code: + +.. code-block:: python + + class FlowerClient(NumPyClient): + def __init__(self, pid, net, trainloader, valloader): + self.pid = pid # partition ID of a client + self.net = net + self.trainloader = trainloader + self.valloader = valloader + self.device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + self.net.to(self.device) + + def get_weights(self, config): + print(f"[Client {self.pid}] get_weights") + return get_weights(self.net) + + def fit(self, parameters, config): + # Read values from config + server_round = config["server_round"] + local_epochs = config["local_epochs"] + + # Use values provided by the config + print(f"[Client {self.pid}, round {server_round}] fit, config: {config}") + set_weights(self.net, parameters) + train(self.net, self.trainloader, epochs=local_epochs) + return get_weights(self.net), len(self.trainloader), {} + + def evaluate(self, parameters, config): + print(f"[Client {self.pid}] evaluate, config: {config}") + set_weights(self.net, parameters) + loss, accuracy = test(self.net, self.valloader) + return float(loss), len(self.valloader), {"accuracy": float(accuracy)} + + + def client_fn(context: Context): + net = Net() + partition_id = context.node_config["partition-id"] + num_partitions = context.node_config["num-partitions"] + trainloader, valloader = load_data(partition_id, num_partitions) + + return FlowerClient(partition_id, net, trainloader, valloader).to_client() + + + # Create the ClientApp + client = ClientApp(client_fn=client_fn) + +So how can we send this config dictionary from server to clients? The built-in Flower +Strategies provide way to do this, and it works similarly to the way server-side +evaluation works. We provide a function to the strategy, and the strategy calls this +function for every round of federated learning. Add the following function to your +``server_app.py``: + +.. code-block:: python + + def fit_config(server_round: int): + """Return training configuration dict for each round. + + Perform two rounds of training with one local epoch, increase to two local + epochs afterwards. + """ + config = { + "server_round": server_round, # The current round of federated learning + "local_epochs": 1 if server_round < 2 else 2, + } + return config + +Next, we’ll pass this function to the FedAvg strategy before starting the simulation. +Change the ``server_fn()`` function in ``server_app.py`` to the following: + +.. code-block:: python + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + strategy = FedAvg( + fraction_fit=fraction_fit, + fraction_evaluate=1.0, + min_available_clients=2, + initial_parameters=parameters, + evaluate_fn=evaluate, + on_fit_config_fn=fit_config, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + +Finally, run the training with the following command: + +.. code-block:: shell + + $ flwr run . + +As we can see, the client logs now include the current round of federated learning +(which they read from the ``config`` dictionary). We can also configure local training +to run for one epoch during the first and second round of federated learning, and then +for two epochs during the third round. + +Clients can also return arbitrary values to the server. To do so, they return a +dictionary from ``fit`` and/or ``evaluate``. We have seen and used this concept +throughout this tutorial without mentioning it explicitly: our ``FlowerClient`` returns +a dictionary containing a custom key/value pair as the third return value in +``evaluate``. + +Scaling federated learning +-------------------------- + +As a last step in this tutorial, let’s see how we can use Flower to experiment with a +large number of clients. In the ``pyproject.toml``, increase the number of SuperNodes to +1000: + +.. code-block:: toml + + [tool.flwr.federations.local-simulation] + options.num-supernodes = 1000 + +Note that we can reuse the ``ClientApp`` for different ``num-supernodes`` since the +Context is defined by the ``num-partitions`` argument in the ``client_fn()`` and for +simulations with Flower, the number of partitions is equal to the number of SuperNodes. + +We now have 1000 partitions, each holding 45 training and 5 validation examples. Given +that the number of training examples on each client is quite small, we should probably +train the model a bit longer, so we configure the clients to perform 3 local training +epochs. We should also adjust the fraction of clients selected for training during each +round (we don’t want all 1000 clients participating in every round), so we adjust +``fraction_fit`` to ``0.025``, which means that only 2.5% of available clients (so 25 +clients) will be selected for training each round. We update the ``fraction-fit`` value +in the ``pyproject.toml``: + +.. code-block:: toml + + [tool.flwr.app.config] + fraction-fit = 0.025 + +Then, we update the ``fit_config`` and ``server_fn`` functions in ``server_app.py`` to +the following: + +.. code-block:: python + + def fit_config(server_round: int): + config = { + "server_round": server_round, + "local_epochs": 3, + } + return config + + + def server_fn(context: Context): + # Read from config + num_rounds = context.run_config["num-server-rounds"] + fraction_fit = context.run_config["fraction-fit"] + + # Initialize model parameters + ndarrays = get_weights(Net()) + parameters = ndarrays_to_parameters(ndarrays) + + # Create FedAvg strategy + strategy = FedAvg( + fraction_fit=fraction_fit, # Train on 25 clients (each round) + fraction_evaluate=0.05, # Evaluate on 50 clients (each round) + min_fit_clients=20, + min_evaluate_clients=40, + min_available_clients=1000, + initial_parameters=parameters, + on_fit_config_fn=fit_config, + ) + config = ServerConfig(num_rounds=num_rounds) + + return ServerAppComponents(strategy=strategy, config=config) + + + # Create the ServerApp + server = ServerApp(server_fn=server_fn) + +Finally, run the simulation with the following command: + +.. code-block:: shell + + $ flwr run . + +Recap +----- + +In this tutorial, we’ve seen how we can gradually enhance our system by customizing the +strategy, initializing parameters on the server side, choosing a different strategy, and +evaluating models on the server-side. That’s quite a bit of flexibility with so little +code, right? + +In the later sections, we’ve seen how we can communicate arbitrary values between server +and clients to fully customize client-side execution. With that capability, we built a +large-scale Federated Learning simulation using the Flower Virtual Client Engine and ran +an experiment involving 1000 clients in the same workload - all in the same Flower +project! + +Next steps +---------- + +Before you continue, make sure to join the Flower community on Flower Discuss (`Join +Flower Discuss `__) and on Slack (`Join Slack +`__). + +There’s a dedicated ``#questions`` channel if you need help, but we’d also love to hear +who you are in ``#introductions``! + +The :doc: `Flower Federated Learning Tutorial - Part 3 +` shows how to build a fully +custom ``Strategy`` from scratch. diff --git a/framework/docs/source/tutorial-series-what-is-federated-learning.rst b/framework/docs/source/tutorial-series-what-is-federated-learning.rst new file mode 100644 index 000000000000..dbb284b43a3e --- /dev/null +++ b/framework/docs/source/tutorial-series-what-is-federated-learning.rst @@ -0,0 +1,363 @@ +What is Federated Learning? +=========================== + +Welcome to the Flower federated learning tutorial! + +In this tutorial, you will learn what federated learning is, build your first system in +Flower, and gradually extend it. If you work through all parts of the tutorial, you will +be able to build advanced federated learning systems that approach the current state of +the art in the field. + +🧑‍🏫 This tutorial starts from zero and expects no familiarity with federated learning. +Only a basic understanding of data science and Python programming is assumed. + + `Star Flower on GitHub `__ ⭐️ and join the + open-source Flower community on Slack to connect, ask questions, and get help: `Join + Slack `__ 🌼 We’d love to hear from you in the + ``#introductions`` channel! And if anything is unclear, head over to the + ``#questions`` channel. + +Let’s get started! + +Classical Machine Learning +-------------------------- + +Before we begin discussing federated learning, let us quickly recap how most machine +learning works today. + +In machine learning, we have a model, and we have data. The model could be a neural +network (as depicted here), or something else, like classical linear regression. + +.. raw:: html + +
+ Model and data +
+ +We train the model using the data to perform a useful task. A task could be to detect +objects in images, transcribe an audio recording, or play a game like Go. + +.. raw:: html + +
+ Train model using data +
+ +In practice, the training data we work with doesn’t originate on the machine we train +the model on. + +This data gets created “somewhere else”. For instance, the data can originate on a +smartphone by the user interacting with an app, a car collecting sensor data, a laptop +receiving input via the keyboard, or a smart speaker listening to someone trying to sing +a song. + +.. raw:: html + +
+ Data on a phone +
+ +What’s also important to mention, this “somewhere else” is usually not just one place, +it’s many places. It could be several devices all running the same app. But it could +also be several organizations, all generating data for the same task. + +.. raw:: html + +
+ Data is on many devices +
+ +So to use machine learning, or any kind of data analysis, the approach that has been +used in the past was to collect all this data on a central server. This server can be +located somewhere in a data center, or somewhere in the cloud. + +.. raw:: html + +
+ Central data collection +
+ +Once all the data is collected in one place, we can finally use machine learning +algorithms to train our model on the data. This is the machine learning approach that +we’ve basically always relied on. + +.. raw:: html + +
+ Central model training +
+ +Challenges of classical machine learning +---------------------------------------- + +This classical machine learning approach we’ve just seen can be used in some cases. +Great examples include categorizing holiday photos, or analyzing web traffic. Cases, +where all the data is naturally available on a centralized server. + +.. raw:: html + +
+ Centralized possible +
+ +But the approach can not be used in many other cases. Cases, where the data is not +available on a centralized server, or cases where the data available on one server is +not enough to train a good model. + +.. raw:: html + +
+ Centralized impossible +
+ +There are many reasons why the classical centralized machine learning approach does not +work for a large number of highly important real-world use cases. Those reasons include: + +- **Regulations**: GDPR (Europe), CCPA (California), PIPEDA (Canada), LGPD (Brazil), + PDPL (Argentina), KVKK (Turkey), POPI (South Africa), FSS (Russia), CDPR (China), PDPB + (India), PIPA (Korea), APPI (Japan), PDP (Indonesia), PDPA (Singapore), APP + (Australia), and other regulations protect sensitive data from being moved. In fact, + those regulations sometimes even prevent single organizations from combining their own + users’ data for machine learning training because those users live in different parts + of the world, and their data is governed by different data protection regulations. +- **User preference**: In addition to regulation, there are use cases where users just + expect that no data leaves their device, ever. If you type your passwords and credit + card info into the digital keyboard of your phone, you don’t expect those passwords to + end up on the server of the company that developed that keyboard, do you? In fact, + that use case was the reason federated learning was invented in the first place. +- **Data volume**: Some sensors, like cameras, produce such a high data volume that it + is neither feasible nor economic to collect all the data (due to, for example, + bandwidth or communication efficiency). Think about a national rail service with + hundreds of train stations across the country. If each of these train stations is + outfitted with a number of security cameras, the volume of raw on-device data they + produce requires incredibly powerful and exceedingly expensive infrastructure to + process and store. And most of the data isn’t even useful. + +Examples where centralized machine learning does not work include: + +- Sensitive healthcare records from multiple hospitals to train cancer detection models. +- Financial information from different organizations to detect financial fraud. +- Location data from your electric car to make better range prediction. +- End-to-end encrypted messages to train better auto-complete models. + +The popularity of privacy-enhancing systems like the `Brave `__ +browser or the `Signal `__ messenger shows that users care about +privacy. In fact, they choose the privacy-enhancing version over other alternatives, if +such an alternative exists. But what can we do to apply machine learning and data +science to these cases to utilize private data? After all, these are all areas that +would benefit significantly from recent advances in AI. + +Federated Learning +------------------ + +Federated Learning simply reverses this approach. It enables machine learning on +distributed data by moving the training to the data, instead of moving the data to the +training. Here’s a one-liner explanation: + +- Centralized machine learning: move the data to the computation +- Federated (machine) Learning: move the computation to the data + +By doing so, Federated Learning enables us to use machine learning (and other data +science approaches) in areas where it wasn’t possible before. We can now train excellent +medical AI models by enabling different hospitals to work together. We can solve +financial fraud by training AI models on the data of different financial institutions. +We can build novel privacy-enhancing applications (such as secure messaging) that have +better built-in AI than their non-privacy-enhancing alternatives. And those are just a +few of the examples that come to mind. As we deploy Federated Learning, we discover more +and more areas that can suddenly be reinvented because they now have access to vast +amounts of previously inaccessible data. + +So how does Federated Learning work, exactly? Let’s start with an intuitive explanation. + +Federated learning in five steps +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Step 0: Initialize global model ++++++++++++++++++++++++++++++++ + +We start by initializing the model on the server. This is exactly the same in classic +centralized learning: we initialize the model parameters, either randomly or from a +previously saved checkpoint. + +.. raw:: html + +
+ Initialize global model +
+ +Step 1: Send model to a number of connected organizations/devices (client nodes) +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Next, we send the parameters of the global model to the connected client nodes (think: +edge devices like smartphones or servers belonging to organizations). This is to ensure +that each participating node starts its local training using the same model parameters. +We often use only a few of the connected nodes instead of all nodes. The reason for this +is that selecting more and more client nodes has diminishing returns. + +.. raw:: html + +
+ Send global model +
+ +Step 2: Train model locally on the data of each organization/device (client node) ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Now that all (selected) client nodes have the latest version of the global model +parameters, they start the local training. They use their own local dataset to train +their own local model. They don’t train the model until full convergence, but they only +train for a little while. This could be as little as one epoch on the local data, or +even just a few steps (mini-batches). + +.. raw:: html + +
+ Train on local data +
+ +Step 3: Return model updates back to the server ++++++++++++++++++++++++++++++++++++++++++++++++ + +After local training, each client node has a slightly different version of the model +parameters they originally received. The parameters are all different because each +client node has different examples in its local dataset. The client nodes then send +those model updates back to the server. The model updates they send can either be the +full model parameters or just the gradients that were accumulated during local training. + +.. raw:: html + +
+ Send model updates +
+ +Step 4: Aggregate model updates into a new global model ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +The server receives model updates from the selected client nodes. If it selected 100 +client nodes, it now has 100 slightly different versions of the original global model, +each trained on the local data of one client. But didn’t we want to have one model that +contains the learnings from the data of all 100 client nodes? + +In order to get one single model, we have to combine all the model updates we received +from the client nodes. This process is called *aggregation*, and there are many +different ways to do it. The most basic way is called *Federated Averaging* (`McMahan et +al., 2016 `__), often abbreviated as *FedAvg*. +*FedAvg* takes the 100 model updates and, as the name suggests, averages them. To be +more precise, it takes the *weighted average* of the model updates, weighted by the +number of examples each client used for training. The weighting is important to make +sure that each data example has the same “influence” on the resulting global model. If +one client has 10 examples, and another client has 100 examples, then - without +weighting - each of the 10 examples would influence the global model ten times as much +as each of the 100 examples. + +.. raw:: html + +
+ Aggregate model updates +
+ +Step 5: Repeat steps 1 to 4 until the model converges ++++++++++++++++++++++++++++++++++++++++++++++++++++++ + +Steps 1 to 4 are what we call a single round of federated learning. The global model +parameters get sent to the participating client nodes (step 1), the client nodes train +on their local data (step 2), they send their updated models to the server (step 3), and +the server then aggregates the model updates to get a new version of the global model +(step 4). + +During a single round, each client node that participates in that iteration only trains +for a little while. This means that after the aggregation step (step 4), we have a model +that has been trained on all the data of all participating client nodes, but only for a +little while. We then have to repeat this training process over and over again to +eventually arrive at a fully trained model that performs well across the data of all +client nodes. + +Conclusion +~~~~~~~~~~ + +Congratulations, you now understand the basics of federated learning. There’s a lot more +to discuss, of course, but that was federated learning in a nutshell. In later parts of +this tutorial, we will go into more detail. Interesting questions include: How can we +select the best client nodes that should participate in the next round? What’s the best +way to aggregate model updates? How can we handle failing client nodes (stragglers)? + +Federated Evaluation +~~~~~~~~~~~~~~~~~~~~ + +Just like we can train a model on the decentralized data of different client nodes, we +can also evaluate the model on that data to receive valuable metrics. This is called +federated evaluation, sometimes abbreviated as FE. In fact, federated evaluation is an +integral part of most federated learning systems. + +Federated Analytics +~~~~~~~~~~~~~~~~~~~ + +In many cases, machine learning isn’t necessary to derive value from data. Data analysis +can yield valuable insights, but again, there’s often not enough data to get a clear +answer. What’s the average age at which people develop a certain type of health +condition? Federated analytics enables such queries over multiple client nodes. It is +usually used in conjunction with other privacy-enhancing technologies like secure +aggregation to prevent the server from seeing the results submitted by individual client +nodes. + +Differential Privacy +~~~~~~~~~~~~~~~~~~~~ + +Differential privacy (DP) is often mentioned in the context of Federated Learning. It is +a privacy-preserving method used when analyzing and sharing statistical data, ensuring +the privacy of individual participants. DP achieves this by adding statistical noise to +the model updates, ensuring any individual participants’ information cannot be +distinguished or re-identified. This technique can be considered an optimization that +provides a quantifiable privacy protection measure. + +Flower +------ + +Federated learning, federated evaluation, and federated analytics require infrastructure +to move machine learning models back and forth, train and evaluate them on local data, +and then aggregate the updated models. Flower provides the infrastructure to do exactly +that in an easy, scalable, and secure way. In short, Flower presents a unified approach +to federated learning, analytics, and evaluation. It allows the user to federate any +workload, any ML framework, and any programming language. + +.. raw:: html + +
+ Flower federated learning server and client nodes (car, scooter, personal
+    computer, roomba, and phone) +
+ +Final Remarks +------------- + +Congratulations, you just learned the basics of federated learning and how it relates to +the classic (centralized) machine learning! + +In the next part of this tutorial, we are going to build a first federated learning +system with Flower. + +Next steps +---------- + +Before you continue, make sure to join the Flower community on Slack: `Join Slack +`__ + +There’s a dedicated ``#questions`` channel if you need help, but we’d also love to hear +who you are in ``#introductions``! + +The `Flower Federated Learning Tutorial - Part 1 +`__ +shows how to build a simple federated learning system with PyTorch and Flower.