Experimental Repository
Samples demonstrating how to scale Flower.ai simulations on a Ray Cluster.
- Scaling Federated Learning Simulations on Multi-Node Ray Cluster
- Presentation deck
- Videos: Presentation and interview during the event live stream.
This repository provides experimental examples showcasing Flower.ai simulations running distributed across multiple nodes and GPUs on a Ray Cluster.
quickstart-pytorch: Based on the Flower quickstart PyTorch example
The Ray version used on the application needs to match the one of the Ray Cluster.
Mind that Flower 0.16.0 simulation requires Ray 2.31.0.
- When using IOG Ray clusters: Use the customized Flower fork at IOG-Foundation/[email protected] so that it matches the IOG Ray Cluster based on Ray
2.30.0- besides custom pinned
2.30.0ray version, the fork includes minor changes useful for production Ray cluster, such as allow ray nodes nodes to have zero resources and flag to supress deprecation warning, which might eventually be merged to the main repo if it makes sense for the official Flower release.
- besides custom pinned
Samples replicate the original Flower examples with minor adjustments and configured environment dependencies for convenience.
For the example quickstart-pytorch:
git clone https://github.com/IOG-Foundation/flower-samples.git
cd flower-samples/quickstart-pytorchAssuming you are familiar with the common execution of a Flower simulation and setup as described on the base readme.
For reference, that's a standard execution:
flwr run . --run-config "num-server-rounds=5"When submitting the script as a Ray Job, there is no need for local dependencies to be installed (except a matching Ray version 2.31.0). If you want to submit to your local Ray Cluster, include the ray dashboard on your local environment (pip install "ray[default]==2.31.0") and start a head node (ray start --head).
ray job submit \
--runtime-env-json '{"working_dir": ".", "pip": "requirements.txt"}' \
-- flwr run . --run-config "num-server-rounds=5"Notes:
- The
working_diris propagated to all Ray nodes. - The runtime environment is configured on the worker nodes at submission. Subsequent submissions may leverage the cached environment.
- Specify the cluster address with
--addressif needed (Ray Jobs CLI Reference). - The job command to be executed is defined after
--.
When using the intermediate script flwr_run.py to call the Flower CLI, have it available on working_dir. Then:
ray job submit \
--runtime-env-json '{"working_dir": ".", "pip": "requirements.txt"}' \
-- python flwr_run.py . --run-config "num-server-rounds=5"As noted, python flwr_run.py . is equivalent to flwr run . . All Flower simulation arguments are directly forwarded.
🔧 Purpose of flwr_run.py:
Use this script to execute custom code or configurations prior to invoking flwr run. Modify it if your simulation needs specific environment setups, dynamic parameters, or pre-processing tasks not directly supported by Flower CLI. Possible applications: trigger in advance cluster autoscaling based on the request, as flower simulation will define the distribution based on the resources which are already available; evaluate the request to tune/adapt the resources when passing to the Flower CLI, as some configurations combination (of clients resources and number of supernodes) might allocate resources which will be blocked on wait, setting a sequential undesired pipeline, specially if resources available are much superior than the number of num-supernodes configured.
To use conda as the base environment:
ray job submit \
--runtime-env-json '{"working_dir": ".", "conda": "conda-env.yml"}' \
-- flwr run . --run-config "num-server-rounds=5"Dependencies below use customized Flower fork at IOG-Foundation/flower/iog-ray-cluster-2.30.0.
From VSCode on your IOG Ray Cluster, set up authentication header once:
export RAY_JOB_HEADERS="{\"Authorization\": \"Basic $(echo -n ':'$PASSWORD_ENV | base64)\"}"Submit your Flower simulation job:
ray job submit \
--runtime-env-json '{"working_dir": ".", "pip": "requirements-iog.txt"}' \
-- flwr run . simulation-gpu --run-config "num-server-rounds=5"For base conda env:
ray job submit \
--runtime-env-json '{"working_dir": ".", "conda": "conda-env-iog.yml"}' \
-- flwr run . simulation-gpu --run-config "num-server-rounds=5"To turn on the FLWR_SUPRESS_DEPRECATION_WARNINGS flag:
ray job submit \
--runtime-env-json '{"working_dir": ".", "conda": "conda-env-iog.yml"}' \
-- FLWR_SUPRESS_DEPRECATION_WARNINGS=true flwr run . simulation-gpu --run-config "num-server-rounds=5"- Version mismatch errors: Ensure Ray versions match exactly between your local and cluster environments.
- Dependency errors (IOG cluster): Verify you're using the correct customized Flower version (
iog-ray-cluster-2.30.0).
This repository welcomes contributions. Open an issue or PR if you have suggestions or improvements.