diff --git a/examples/tutorials/self-paced-training/part-2_federated_learning_system/chapter-3_federated_computing_platform/03.1_federated_computing_architecture/system_architecture.ipynb b/examples/tutorials/self-paced-training/part-2_federated_learning_system/chapter-3_federated_computing_platform/03.1_federated_computing_architecture/system_architecture.ipynb index fedff218d2..e6ebe72a7f 100644 --- a/examples/tutorials/self-paced-training/part-2_federated_learning_system/chapter-3_federated_computing_platform/03.1_federated_computing_architecture/system_architecture.ipynb +++ b/examples/tutorials/self-paced-training/part-2_federated_learning_system/chapter-3_federated_computing_platform/03.1_federated_computing_architecture/system_architecture.ipynb @@ -85,12 +85,16 @@ "From top to bottom, FCI has the following layers:\n", "\n", "* **API Layer**: This is the API exposed to application developers, like Communicator and Cellnet.\n", + "\n", "* **Streamable Framed Message (SFM)**: This is the core of FCI and it provides abstraction on top of different communication protocols. It manages endpoints and connections.\n", + "\n", "* **Transport Drivers**: This layer is responsible for sending frames to other endpoints. It treats the frame as opaque bytes.\n", + "One can use one of driver out of box such as gRPC, TCP, HTTP/Websocket. One can also develop custom driver for alternative protocols. Switch driver will not affect the application layers \n", "\n", "\"FLARE\n", "\n", - "## Federated Computing Architecture\n", + "\n", + "## Federated Job Processing Architecture\n", "\n", "There are two parent control processes with corresponding job processes on each site. This enables support of concurrent, multi-job processes.\n", "\n", diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.0_introduction/introduction.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.0_introduction/introduction.ipynb index 76eb2b5cd7..b5c8683127 100644 --- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.0_introduction/introduction.ipynb +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.0_introduction/introduction.ipynb @@ -1,11 +1,56 @@ { "cells": [ { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "id": "ceca45d8-437c-44ae-8ed9-7a784983731f", "metadata": {}, - "outputs": [], + "source": [ + "\n", + "## **Security in NVIDIA FLARE Federated Computing Systems** \n", + "\n", + "Federated Computing System requires robust security mechanisms to ensure that only legitimate and trusted participants contribute, while also protecting communication channels and enforcing authorization policies. Below are the critical security components of an FL system. \n", + "\n", + "This area is concerned with these two trust issues:\n", + "\n", + "* **Authentication**\n", + "\n", + "ensures communicating parties have enough confidence about each other’s identities: everyone is who they claim to be.\n", + "\n", + "* **Authorization** \n", + "\n", + "ensures that the user can only do what he/she is authorized to do.\n", + "\n", + "Due to the distributed nature of federated computing system, additional authentications and authorization are needed for each participate organizations. \n", + "\n", + "You can find out how NVIDIA FLARE instrumented these via event-based Federated Authentication and authorization\n", + "\n", + "* **Privacy Protection**: \n", + "\n", + "another aspect of security is privacy protection. We have introduced different privacy enhancement technologies (PETs) in [Chapter 5](../../chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb), here ware going to explore what's privacy protection mechanism at the organization level. \n", + "\n", + "* **Trust-based security** \n", + "\n", + "Trust-based mechanism will adding another layer of protection to the security mechansim\n", + "leveraging confidential computing's VM-based trust execution environment (TEE), NVIDIA FLARE will enable end-to-end confidential federated AI. We will brief touch on it in this chapter. The details will be added in the future. \n", + "\n", + "\n", + "* **Communication Security**\n", + "\n", + "Use of Secure Protocols – TLS for secure transmission. FLARE support both mutual TLS (mTLS) as well normal TLS with signed message \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "c7a65a24", + "metadata": {}, "source": [] } ], diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.1_security_architecture/Seurity_architecture.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.1_security_architecture/Seurity_architecture.ipynb index cc2eeea833..cf950a9a64 100644 --- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.1_security_architecture/Seurity_architecture.ipynb +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.1_security_architecture/Seurity_architecture.ipynb @@ -1,11 +1,102 @@ { "cells": [ { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "id": "c563b1cd-2176-4198-96a7-52b096b0b656", "metadata": {}, - "outputs": [], + "source": [ + "# NVIDIA FLARE Security Architecture\n", + "\n", + "NVFLARE is an application running in the IT environment of each participating site. The total security of this application is the combination of the security measures implemented in this application and the security measures of the site’s IT infrastructure.\n", + "\n", + "NVFLARE implements security measures in the following areas:\n", + "\n", + "* **Identity Security**: the authentication and authorization of communicating parties\n", + "\n", + "* **Site Policy Management**: the policies for resource management, authorization, and privacy protection defined by each site\n", + "\n", + "* **Communication Security**: the confidentiality of data communication messages\n", + "\n", + "* **Message Serialization**: techniques for ensuring safe serialization/deserialization process between communicating parties\n", + "\n", + "* **Data Privacy Protection**: techniques for preventing local data from being leaked and/or reverse-engineered\n", + "\n", + "* **Auditing**: techniques for keeping audit trails to record events (e.g. commands issued by users, learning/training related events that can be analyzed to understand the final results)\n", + "\n", + "All other security concerns must be handled by the site’s IT security infrastructure. These include, but are not limited to:\n", + "\n", + "Physical security\n", + "\n", + "Firewall policies\n", + "\n", + "Data management policies: storage, retention, cleaning, distribution, access, etc.\n", + "\n", + "Security Trust Boundary and Balance of Risk and Usability\n", + "\n", + "The security framework does not operate in vacuum; we assume that physical security is already in place for all participating server and client machines. TLS provides the authentication mechanism within the trusted environments.\n", + "\n", + "\n", + "--- \n", + "\n", + "## Terminologies and Roles\n", + "\n", + "### Terminologies\n", + "NVIDIA FLARE uses the following terminologies, let's define them here: \n", + "\n", + "* Project -- A federated learning study with identified participants.\n", + "* Org -- An organization that participates in the study.\n", + "* Site -- The computing system that runs NVFLARE application as part of the study. There are two kinds of sites: Server and Clients. Each site belongs to an organization.\n", + "* FL Server -- An application running on a Server site responsible for client coordination based on federation workflows. \n", + "* FL Client -- An application running on a client site that responds to Server’s task assignments and performs learning actions based on its local data.\n", + "* User -- A human that participates in the FL project.\n", + "\n", + "### Roles\n", + "\n", + "A role defines a type of users that have certain privileges of system operations. Each user is assigned a role in the project. There are four defined roles: Project Admin, Org Admin, Lead Researcher, and Member Researcher.\n", + "\n", + "* Project Admin Role -- The Project Admin is responsible for provisioning the participants and coordinating personnel from all sites for the project. There is only one Project Admin for each project.\n", + "\n", + "* Org Admin Role -- This role is responsible for the management of the sites of his/her organization.\n", + "\n", + "* Lead Researcher Role -- This role can be configured for increased privileges for an organization for a scientist who works with other researchers to ensure the success of the project.\n", + "\n", + "* Member Researcher Role -- This role can be configured for another level of privileges a scientist who works with the Lead Researcher to make sure his/her site is properly prepared for the project.\n", + "\n", + "* FLARE Console -- A console application running on a user’s machine that allows the user to perform NVFLARE system operations with a command line interface.\n", + "\n", + "Now let's dive into identity security, autentication and authorization [here](../06.2_authentication_and_authorization/site_specific_authentication_and_authorization.ipynb)\n", + "\n", + "\n", + "\n", + "## Identity Security\n", + "\n", + " see [here](../06.2_identity_security/identity_security.ipynb) for NVFLARE’s authentication model\n", + "\n", + "## Federated Policy\n", + "\n", + " see [here](../06.3_site_security_privacy_policy/site_policy.ipynb) for site-specific security and privacy polcies provided by NVIDIA FLARE\n", + " \n", + "## Customized Security Plugins\n", + "\n", + " see [here](../06.4_customized_site_security/customized_site_security.ipynb) for site-specific customized security integration\n", + "\n", + "## Communication Security\n", + "\n", + " see [here](../06.5_communition_security/communication_security.ipynb) for communication security & configuration\n", + "\n", + "## Message Serialization\n", + " todo \n", + "\n", + "## Auditing\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "622581aa", + "metadata": {}, "source": [] } ], diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_authentication_and_authorization/site_specific_authentication_and_authorization.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_authentication_and_authorization/site_specific_authentication_and_authorization.ipynb deleted file mode 100644 index 4f5d8c430a..0000000000 --- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_authentication_and_authorization/site_specific_authentication_and_authorization.ipynb +++ /dev/null @@ -1,33 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "0e725a2d-af59-43ae-8ef8-aeabe5581d4b", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "nvflare_example", - "language": "python", - "name": "nvflare_example" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_identity_security/identity_security.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_identity_security/identity_security.ipynb new file mode 100644 index 0000000000..ba357a6274 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.2_identity_security/identity_security.ipynb @@ -0,0 +1,210 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# **Indentity Security** \n", + ">note\n", + " This is section is the same as \n", + "[FLARE's Security documentation](https://nvflare.readthedocs.io/en/2.4/user_guide/security/identity_security.html)\n", + "\n", + "\n", + "## Authentication\n", + "\n", + "NVFLARE’s authentication model is based on Public Key Infrastructure (PKI) technology\n", + "\n", + "For the FL project, the Project Admin uses the Provisioning Tool to create a Root CA with a self-signed root certificate. This Root CA will be used to issue all other certs needed by communicating parties.\n", + "\n", + "Identities involved in the project (Server(s), Clients, the Overseer, Users) are provisioned with the Provisioning Tool. Each identity is defined with a unique common name. For each identity, the Provisioning Tool generates a separate password-protected Startup Kit, which includes security credentials for mutual TLS authentication:\n", + "\n", + "* The certificate of the Root CA\n", + "* The cert of the identity\n", + "* The private key of the identity\n", + "\n", + "Startup Kits are distributed to the intended identities:\n", + "* The FL Server’s kit is sent to the Project Admin\n", + "* The kit for each FL Client is sent to the Org Admin responsible for the site\n", + "* FLARE Console (previously called Admin Client) kits are sent to the user(s)\n", + "\n", + "To ensure the integrity of the Startup Kit, each file in the kit is signed by the Root CA.\n", + "\n", + "Each Startup Kit also contains a “start.sh” file, which can be used to properly start the NVFLARE application.\n", + "\n", + "Once started, the Client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the client and the server both have the correct Startup Kits.\n", + "\n", + "Similarly, when a user tries to operate the NVFLARE system with the Admin Client app, the admin client tries to establish a mutually-authenticated TLS connection with the Server, using the PKI credentials in its Startup Kits. This is possible only if the admin client and the server both have the correct Startup Kits. The admin user also must enter his/her assigned user name correctly.\n", + "\n", + "With Release 2.6.0, we also supports the normal TLS ( in additional to mutal TLS). \n", + "\n", + "\n", + "The security of the system comes from the PKI credentials in the Startup Kits. As you can see, this mechanism involves manual processing and human interactions for Startup Kit distribution, and hence the identity security of the system depends on the trust of the involved people. To minimize security risk, we recommend that people involved follow these best practice guidelines:\n", + "\n", + "The Project Admin, who is responsible for the provisioning process of the study, should protect the study’s configuration files and store created Startup Kits securely.\n", + "\n", + "When distributing Startup Kits, the Project Admin should use trusted communication methods, and never send passwords of the Startup Kits in the same communication. It is preferred to send the Kits and passwords with different communication methods.\n", + "\n", + "Org Admin and users must protect their Startup Kits and only use them for intended purposes.\n", + "\n", + "> Note\n", + "The provisioning tool tries to use the strongest cryptography suites possible when generating the PKI credentials. All of the certificates are compliant with the X.509 standard. All private keys are generated with a size of 2048-bits. The backend is openssl 1.1.1f, released on March 31, 2020, with no known CVE. All certificates expire within 360 days.\n", + "\n", + "> NVFlare Dashboard is a website that supports user and site registration. Users will be able to download their Startup Kits (and other artifacts) from the website.\n", + "\n", + "Let's take a look above in action\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First, let's generate a default startup kits using POC mode. POC internally call provision to generate the default startup kits" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! echo y | nvflare poc prepare\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "! tree /tmp/nvflare/poc/example_project/prod_00" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Look at the each site's start up kit. We see\n", + "\n", + "```\n", + "startup\n", + "│ │ ├── client.crt <-- certificate\n", + "│ │ ├── client.key <-- private key\n", + "│ │ ├── fed_client.json\n", + "│ │ ├── rootCA.pem <-- root certificate\n", + "│ │ ├── signature.json\n", + "│ │ ├── start.sh\n", + "│ │ ├── stop_fl.sh\n", + "│ │ └── sub_start.sh\n", + "\n", + "```\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Authorization\n", + "\n", + "Federated learning is conducted over computing resources owned by different organizations. Naturally these organizations have concerns about their computing resources being misused or abused. Even if an NVFLARE docker is trusted by participating orgs, researchers can still bring their own custom code to be part of a study (BYOC), which could be a big concern to many organizations. In addition, organizations may also have IP (intellectual property) requirements on the studies performed by their own researchers.\n", + "\n", + "Some framework disables the BYOC feature and only allowed the pre-installed package during production, but in research and developement period, researchers needs to modify the code constantly during experiments. \n", + "\n", + "\n", + "NVFLARE comes with an authorization system that can help address these security concerns and IP requirements. With this system, an organization can define strict policy to control access to their computing resources and/or FL jobs.\n", + "\n", + "Here are some examples that an org can do:\n", + "\n", + "* Restrict BYOC to only the org’s own researchers;\n", + "\n", + "* Allow jobs only from its own researchers, or from specified other orgs, or even from specified trusted other researchers;\n", + "\n", + "* Totally disable remote shell commands on its sites\n", + "\n", + "* Allow the “ls” shell command but disable all other remote shell commands\n", + "\n", + "\n", + "### Centralized vs. Federated Authorization\n", + "\n", + "Instead of relying a centralized (FL Server) to authorizate and authenticate the user, NVFLARE now uses federated authorization where each organization defines and enforces its own authorization policy:\n", + "\n", + "Each organization defines its policy in its own authorization.json (in the local folder of the workspace). This locally defined policy is loaded by FL Clients owned by the organization, The policy is also enforced by these FL Clients.\n", + "\n", + "This decentralized authorization has an added benefit: since each organization takes care of its own authorization, there will be no need to update the policy of any other participants (FL Server or Clients) when a new orgs or clients are added.\n", + "\n", + "### Simplified Authorization Policy Configuration\n", + "Since each organization defines its own policy, there will be no need to centrally define all orgs and users. The policy configuration for an org is simply a matrix of role/right permissions. Each role/right combination in the permission matrix answers this question: what kind of users of this role can have this right?\n", + "\n", + "To answer this question, the role/right combination defines one or more conditions, and the user must meet one of these conditions to have the right. The set of conditions is called a control.\n", + "\n", + "### Roles\n", + "Users are classified into roles. NVFLARE defines four roles:\n", + "\n", + "* **Project Admin** - this role is responsible for the whole FL project;\n", + "\n", + "* **Org Admin** - this role is responsible for the administration of all sites in its org. Each org must have one Org Admin;\n", + "\n", + "* **Lead (researcher)** - this role conducts FL studies\n", + "\n", + "* **Member (researcher)** - this role observes the FL study but cannot submit jobs\n", + "\n", + "### Rights\n", + "\n", + "* Admin commands are grouped into categories. For example, commands like abort_job, delete_job, start_app are in manage_job category; all shell commands are put into the shell_commands category. Each category is also a right.\n", + "\n", + "* BYOC is defined as a right so that some users are allowed to submit jobs with BYOC whereas some are not.\n", + "\n", + "This right system makes it easy to write simple policies that only use command categories. It also makes it possible to write policies to control individual commands. When both categories and commands are used, command-based control takes precedence over category-based control.\n", + "\n", + "\n", + "## Controls and Conditions\n", + "\n", + "\n", + "| Notation | Condition | Examples |\n", + "|----------------|-----------------------------------------------------|--------------------|\n", + "| o:site | The user belongs to the site’s organization | |\n", + "| n:submitter | The user is the job submitter | |\n", + "| o:submitter | The user and the job submitter belong to the same org| |\n", + "| n:| The user is a specified person | n:john@nvidia.com |\n", + "| o: | The user is in a specified org | o:nvidia |\n", + "\n", + "The words “site” and “submitter” are reserved.\n", + "\n", + "For more details please refer [documentation](https://nvflare.readthedocs.io/en/main/user_guide/security/identity_security.html)\n", + "\n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "nvflare_env", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_local_security_policy/local_security_policy.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_local_security_policy/local_security_policy.ipynb deleted file mode 100644 index dc76018c81..0000000000 --- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_local_security_policy/local_security_policy.ipynb +++ /dev/null @@ -1,33 +0,0 @@ -{ - "cells": [ - { - "cell_type": "code", - "execution_count": null, - "id": "ef224801-e14b-4b8e-92bb-1643364fbef9", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "nvflare_example", - "language": "python", - "name": "nvflare_example" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.2" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/README.rst b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/README.rst new file mode 100644 index 0000000000..46a2ddf795 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/README.rst @@ -0,0 +1,177 @@ +Example for Federated Policies +============================== + + +Overview +-------- + +The purpose of this example is to demonstrate following features of NVFlare, + +1. Run NVFlare in secure mode +2. Show secure admin client and authentication +3. Demonstrate local authorization policy +4. Demonstrate local privacy policy + +System Requirements +------------------- + +1. Install Python and Virtual Environment, +:: + python3 -m venv nvflare-env + source nvflare-env/bin/activate + +2. Install NVFlare +:: + pip install nvflare + +3. The example is part of the NVFlare source code. The source code can be obtained like this, +:: + git clone https://github.com/NVIDIA/NVFlare.git + +4. TLS requires domain names. Please add following line in :code:`/etc/hosts` file, +:: + 127.0.0.1 server1 + + +Setup +_____ + +The :code:`project.yml` file defines all the sites and users (called admin in NVFlare) +used in the examples. The startup kits will be created by :code:`setup.sh` +:: + cd NVFlare/examples/advanced/federated-policies + ./setup.sh +All the startup kits will be generated in this folder, +:: + workspace/fed_policy/prod_00 + +.. note:: + :code:`workspace` folder is removed everytime :code:`setup.sh` is run. Please do not save customized + files in this folder. + +Starting NVFlare +________________ + +This script will start up the server and 2 clients, +:: + ./start.sh + +Logging with Admin Console +__________________________ + +In secure mode, NVFlare creates one startup kit for each user. There are 5 users in +this example so there are 5 folders for admin login under :code:`workspace/fed_policy/prod_00` folder. + +To login as an user, the appropriate folder must be selected. + +For example, this is how to login as :code:`admin@a.org` user, +:: + cd workspace/fed_policy/prod_00/admin@a.org + ./startup/fl_admin.sh +At the prompt, enter the user email :code:`admin@a.org` + +Multiple users can login at the same time by using multiple terminals. + +The setup.sh has copied the jobs folder to the workspace folder. +So jobs can be submitted like this, type the following command in the admin console: + +:: + submit_job ../../job1 + +Participants +------------ +Site +____ +* :code:`server1`: NVFlare server +* :code:`site_a`: Client owned by a.org with a customized authorization policy, which only allows +users from the same org to submit job. +* :code:`site_b`: Client owned by b.org with a customized privacy policy. The policy defines +two scopes :code:`public` and :code:`private`. A custom filter is applied to :code:`private`. + +Users +_____ +* :code:`super@a.org`: Super user with role :code:`project_admin` who can do everything +* :code:`admin@a.org`: Admin for a.org with role :code:`org_admin` +* :code:`trainer@a.org`: Lead trainer for a.org with role :code:`lead` +* :code:`trainer@b.org`: Lead trainer for b.org with role :code:`lead` +* :code:`user@b.org`: Regular user for b.org with role :code:`member` + +Jobs +____ +All the jobs run the same app (numpy-sag) but have different scopes defined in :code:`meta.json`. + +* job1: Scope is :code:`public`. No filters. +* job2: Scope is :code:`test`. Test filters are applied to data and result. +* job3: Scope is :code:`private`. PercentilePrivacy filter is applied to result. +* job4: It has no scope defined. +* job5: It defines an non-existent scope :code:`foo` + + +Test Cases +---------- + +Authorization +_____________ +We will demo some authorization behaviors. + +Since authorization decision is determined using each site's authorization.json and each admin user's role, +we just use :code:`job1` in all the following tests. + +.. list-table:: Authorization Use Cases + :widths: 14 20 50 + :header-rows: 1 + + * - User + - Command + - Expected behavior + * - trainer@a.org + - submit_job ../../job1 + - Job deployed and started on all sites + * - trainer@a.org + - clone_job [the job ID that we previous submitted] + - Job deployed and started on all sites + * - trainer@b.org + - clone_job [the job ID that we previous submitted] + - Rejected because submitter is in a different org + * - admin@a.org + - submit_job ../../job1 + - Rejected because role "org_admin" is not allowed to submit jobs + * - trainer@b.org + - submit_job ../../job1 + - site_a rejected the job because the submitter is in a different org, while site_b accepted the job + so the job will still run since in meta.json we specify min_clients as 1 + +Privacy +_______ +site_a has no privacy policy defined. +So we will test the following cases on site_b. + +In each job's meta.json we specified their "scope" and in site's privacy.json file each site will define its own +privacy filters to apply for that scope. + +Note that default jobs are treated in "public" scope. + +Let's just use user trainer@b.org for the following tests. + +.. list-table:: Privacy Policy Use Cases + :widths: 10 50 + :header-rows: 1 + + * - Job + - Expected behavior + * - job1 + - Job deployed with no filters + * - job2 + - Job deployed with TestFilter applied + * - job3 + - Job deployed with PercentilePrivacy filter applied to the result + * - job4 + - Job deployed using default scope :code:`public` + * - job5 + - Job rejected by site_b because :code:`foo` doesn't exist + +Shutting down NVFlare +_____________________ +All NVFlare server and clients can be stopped by using this script, +:: + ./stop.sh diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/data/download.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/data/download.py new file mode 100644 index 0000000000..ebd8cfdc41 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/data/download.py @@ -0,0 +1,60 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This Dirichlet sampling strategy for creating a heterogeneous partition is adopted +# from FedMA (https://github.com/IBM/FedMA). + +# MIT License + +# Copyright (c) 2020 International Business Machines + +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: + +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. + +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +import argparse + +import torchvision.datasets as datasets + +# default dataset path +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + args = parser.parse_args() + return args + + +def main(args): + datasets.CIFAR10(root=args.dataset_path, train=True, download=True) + datasets.CIFAR10(root=args.dataset_path, train=False, download=True) + + +if __name__ == "__main__": + main(define_parser()) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/fl_job.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/fl_job.py new file mode 100644 index 0000000000..4e813132a0 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/fl_job.py @@ -0,0 +1,54 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os +import shutil + +from src.fedavg import FedAvg +from src.network import SimpleNetwork + +from nvflare.job_config.api import FedJob +from nvflare.job_config.script_runner import ScriptRunner + +if __name__ == "__main__": + num_clients = 2 + num_rounds = 2 + train_script = "src/client.py" + job_config_dir = "/tmp/nvflare/jobs/workdir" + + for i in range(5): + job_name = f"job_{i + 1}" + job = FedJob(name=job_name, min_clients=num_clients) + + controller = FedAvg( + stop_cond="accuracy > 25", + save_filename="global_model.pt", + initial_model=SimpleNetwork(), + num_clients=num_clients, + num_rounds=num_rounds, + ) + + job.to_server(controller) + + # Add clients + for site_name in ["site_a", "site_b"]: + executor = ScriptRunner(script=train_script) + job.to(executor, site_name) + + print("job-config is at ", os.path.join(job_config_dir, job_name)) + job.export_job(job_config_dir) + source_meta_file = os.path.join(f"job{i + 1}", "meta.json") + dest_meta_file = os.path.join(job_config_dir, job_name, "meta.json") + shutil.copy2(source_meta_file, dest_meta_file) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job1/meta.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job1/meta.json new file mode 100644 index 0000000000..eb9f014a84 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job1/meta.json @@ -0,0 +1,11 @@ +{ + "name": "iob_1", + "resource_spec": {}, + "min_clients" : 1, + "deploy_map": { + "app": [ + "@ALL" + ] + }, + "scope": "public" +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job2/meta.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job2/meta.json new file mode 100644 index 0000000000..0fa0b5c4d6 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job2/meta.json @@ -0,0 +1,11 @@ +{ + "name": "job_2", + "resource_spec": {}, + "min_clients" : 1, + "deploy_map": { + "app": [ + "@ALL" + ] + }, + "scope": "test" +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job3/meta.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job3/meta.json new file mode 100644 index 0000000000..35404ba28d --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job3/meta.json @@ -0,0 +1,11 @@ +{ + "name": "job_3", + "resource_spec": {}, + "min_clients" : 1, + "deploy_map": { + "app": [ + "@ALL" + ] + }, + "scope": "private" +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job4/meta.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job4/meta.json new file mode 100644 index 0000000000..6ca06b879b --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job4/meta.json @@ -0,0 +1,10 @@ +{ + "name": "job_4", + "resource_spec": {}, + "min_clients" : 1, + "deploy_map": { + "app": [ + "@ALL" + ] + } +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job5/meta.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job5/meta.json new file mode 100644 index 0000000000..2aba09beda --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/job5/meta.json @@ -0,0 +1,11 @@ +{ + "name": "job_5", + "resource_spec": {}, + "min_clients" : 1, + "deploy_map": { + "app": [ + "@ALL" + ] + }, + "scope": "foo" +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/requirements.txt b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/requirements.txt new file mode 100644 index 0000000000..57b4df2ed4 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/requirements.txt @@ -0,0 +1,3 @@ +torch +torchvision +tensorboard \ No newline at end of file diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/client.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/client.py new file mode 100644 index 0000000000..220559b3cf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/client.py @@ -0,0 +1,193 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import torch +import torch.nn as nn +import torch.optim as optim +import torchvision +import torchvision.transforms as transforms +from network import SimpleNetwork + +# (1) import nvflare client API +import nvflare.client as flare +from nvflare.app_common.app_constant import ModelName + +# (optional) set a fix place so we don't need to download everytime +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + +# (optional) We change to use GPU to speed things up. +# if you want to use CPU, change DEVICE="cpu" +DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + parser.add_argument("--batch_size", type=int, default=4, nargs="?") + parser.add_argument("--learning_rate", type=float, default=0.001, nargs="?") + parser.add_argument("--num_workers", type=int, default=1, nargs="?") + parser.add_argument("--local_epochs", type=int, default=2, nargs="?") + parser.add_argument("--model_path", type=str, default=f"{CIFAR10_ROOT}/cifar_net.pth", nargs="?") + return parser.parse_args() + + +def main(): + # define local parameters + args = define_parser() + + dataset_path = args.dataset_path + batch_size = args.batch_size + num_workers = args.num_workers + local_epochs = args.local_epochs + model_path = args.model_path + lr = args.learning_rate + + transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) + trainset = torchvision.datasets.CIFAR10(root=dataset_path, train=True, download=True, transform=transform) + trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers) + testset = torchvision.datasets.CIFAR10(root=dataset_path, train=False, download=True, transform=transform) + testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=num_workers) + + net = SimpleNetwork() + best_accuracy = 0.0 + + # wraps evaluation logic into a method to re-use for + # evaluation on both trained and received model + def evaluate(input_weights): + net = SimpleNetwork() + net.load_state_dict(input_weights) + # (optional) use GPU to speed things up + net.to(DEVICE) + + correct = 0 + total = 0 + # since we're not training, we don't need to calculate the gradients for our outputs + with torch.no_grad(): + for data in testloader: + # (optional) use GPU to speed things up + images, labels = data[0].to(DEVICE), data[1].to(DEVICE) + # calculate outputs by running images through the network + outputs = net(images) + # the class with the highest energy is what we choose as prediction + _, predicted = torch.max(outputs.data, 1) + total += labels.size(0) + correct += (predicted == labels).sum().item() + + return 100 * correct // total + + # (2) initialize NVFlare client API + flare.init() + + # (3) run continously when launch_once=true + while flare.is_running(): + + # (4) receive FLModel from NVFlare + input_model = flare.receive() + client_id = flare.get_site_name() + + # Based on different "task" we will do different things + # for "train" task (flare.is_train()) we use the received model to do training and/or evaluation + # and send back updated model and/or evaluation metrics, if the "train_with_evaluation" is specified as True + # in the config_fed_client we will need to do evaluation and include the evaluation metrics + # for "evaluate" task (flare.is_evaluate()) we use the received model to do evaluation + # and send back the evaluation metrics + # for "submit_model" task (flare.is_submit_model()) we just need to send back the local model + # (5) performing train task on received model + if flare.is_train(): + print(f"({client_id}) current_round={input_model.current_round}, total_rounds={input_model.total_rounds}") + + # (5.1) loads model from NVFlare + net.load_state_dict(input_model.params) + + criterion = nn.CrossEntropyLoss() + optimizer = optim.SGD(net.parameters(), lr=lr, momentum=0.9) + + # (optional) use GPU to speed things up + net.to(DEVICE) + # (optional) calculate total steps + steps = local_epochs * len(trainloader) + for epoch in range(local_epochs): # loop over the dataset multiple times + + running_loss = 0.0 + for i, data in enumerate(trainloader, 0): + # get the inputs; data is a list of [inputs, labels] + # (optional) use GPU to speed things up + inputs, labels = data[0].to(DEVICE), data[1].to(DEVICE) + + # zero the parameter gradients + optimizer.zero_grad() + + # forward + backward + optimize + outputs = net(inputs) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + + # print statistics + running_loss += loss.item() + if i % 2000 == 1999: # print every 2000 mini-batches + print(f"({client_id}) [{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}") + running_loss = 0.0 + break + + print(f"({client_id}) Finished Training") + + # (5.2) evaluation on local trained model to save best model + local_accuracy = evaluate(net.state_dict()) + print(f"({client_id}) Evaluating local trained model. Accuracy on the 10000 test images: {local_accuracy}") + if local_accuracy > best_accuracy: + best_accuracy = local_accuracy + torch.save(net.state_dict(), model_path) + + # (5.3) evaluate on received model for model selection + accuracy = evaluate(input_model.params) + print( + f"({client_id}) Evaluating received model for model selection. Accuracy on the 10000 test images: {accuracy}" + ) + + # (5.4) construct trained FL model + output_model = flare.FLModel( + params=net.cpu().state_dict(), + metrics={"accuracy": accuracy}, + meta={"NUM_STEPS_CURRENT_ROUND": steps}, + ) + + # (5.5) send model back to NVFlare + flare.send(output_model) + + # (6) performing evaluate task on received model + elif flare.is_evaluate(): + accuracy = evaluate(input_model.params) + print(f"({client_id}) accuracy: {accuracy}") + flare.send(flare.FLModel(metrics={"accuracy": accuracy})) + + # (7) performing submit_model task to obtain best local model + elif flare.is_submit_model(): + model_name = input_model.meta["submit_model_name"] + if model_name == ModelName.BEST_MODEL: + try: + weights = torch.load(model_path) + net = SimpleNetwork() + net.load_state_dict(weights) + flare.send(flare.FLModel(params=net.cpu().state_dict())) + except Exception as e: + raise ValueError("Unable to load best model") from e + else: + raise ValueError(f"Unknown model_type: {model_name}") + + +if __name__ == "__main__": + main() diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fedavg.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fedavg.py new file mode 100644 index 0000000000..a63f0005bf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fedavg.py @@ -0,0 +1,158 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from typing import Callable, Dict, List, Optional + +import torch + +from nvflare.app_common.abstract.fl_model import FLModel +from nvflare.app_common.utils.math_utils import parse_compare_criteria +from nvflare.app_common.workflows.base_fedavg import BaseFedAvg +from nvflare.app_opt.pt.decomposers import TensorDecomposer +from nvflare.fuel.utils import fobs + + +class FedAvg(BaseFedAvg): + """Controller for FedAvg Workflow with Early Stopping and Model Selection. + + Args: + num_clients (int, optional): The number of clients. Defaults to 3. + num_rounds (int, optional): The total number of training rounds. Defaults to 5. + stop_cond (str, optional): early stopping condition based on metric. + string literal in the format of " " (e.g. "accuracy >= 80") + save_filename (str, optional): filename for saving model + initial_model (nn.Module, optional): initial PyTorch model + """ + + def __init__( + self, + *args, + stop_cond: str, + num_rounds: int, + save_filename: str = "FL_global_model.pt", + initial_model=None, + **kwargs, + ): + super().__init__(*args, **kwargs) + + self.stop_cond = stop_cond + self.num_rounds = num_rounds + + if stop_cond: + self.stop_condition = parse_compare_criteria(stop_cond) + else: + self.stop_condition = None + self.save_filename = save_filename + self.initial_model = initial_model + self.best_model: Optional[FLModel] = None + + def run(self) -> None: + self.info("Start FedAvg.") + + if self.initial_model: + # Use FOBS for serializing/deserializing PyTorch tensors (self.initial_model) + fobs.register(TensorDecomposer) + # PyTorch weights + initial_weights = self.initial_model.state_dict() + else: + initial_weights = {} + + model = FLModel(params=initial_weights) + + model.start_round = self.start_round + model.total_rounds = self.num_rounds + + for self.current_round in range(self.start_round, self.start_round + self.num_rounds): + self.info(f"Round {self.current_round} started.") + model.current_round = self.current_round + + clients = self.sample_clients(self.num_clients) + + results: List[FLModel] = self.send_model_and_wait(targets=clients, data=model) + aggregate_results = self.aggregate( + results, aggregate_fn=self.aggregate_fn + ) # using default aggregate_fn with `WeightedAggregationHelper`. Can overwrite self.aggregate_fn with signature Callable[List[FLModel], FLModel] + + model = self.update_model(model, aggregate_results) + + self.info(f"Round {self.current_round} global metrics: {model.metrics}") + + self.select_best_model(model) + + self.save_model(self.best_model, os.path.join(os.getcwd(), self.save_filename)) + + if self.should_stop(model.metrics, self.stop_condition): + self.info( + f"Stopping at round={self.current_round} out of total_rounds={self.num_rounds}. Early stop condition satisfied: {self.stop_condition}" + ) + break + + self.info("Finished FedAvg.") + + def should_stop(self, metrics: Optional[Dict] = None, stop_condition: Optional[str] = None): + if stop_condition is None or metrics is None: + return False + + key, target, op_fn = stop_condition + value = metrics.get(key, None) + + if value is None: + raise RuntimeError(f"stop criteria key '{key}' doesn't exists in metrics") + + return op_fn(value, target) + + def select_best_model(self, curr_model: FLModel): + if self.best_model is None: + self.best_model = curr_model + return + + if self.stop_condition: + metric, _, op_fn = self.stop_condition + if self.is_curr_model_better(self.best_model, curr_model, metric, op_fn): + self.info("Current model is new best model.") + self.best_model = curr_model + else: + self.best_model = curr_model + + def is_curr_model_better( + self, best_model: FLModel, curr_model: FLModel, target_metric: str, op_fn: Callable + ) -> bool: + curr_metrics = curr_model.metrics + if curr_metrics is None: + return False + if target_metric not in curr_metrics: + return False + + best_metrics = best_model.metrics + return op_fn(curr_metrics.get(target_metric), best_metrics.get(target_metric)) + + def save_model(self, model, filepath=""): + params = model.params + # PyTorch save + torch.save(params, filepath) + + # save FLModel metadata + model.params = {} + fobs.dumpf(model, filepath + ".metadata") + model.params = params + + def load_model(self, filepath=""): + # PyTorch load + params = torch.load(filepath) + + # load FLModel metadata + model = fobs.loadf(filepath + ".metadata") + model.params = params + return model diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fl_job.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fl_job.py new file mode 100644 index 0000000000..8fc6f73846 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/fl_job.py @@ -0,0 +1,64 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os + +from src.fedavg import FedAvg +from src.network import SimpleNetwork + +from nvflare.job_config.api import FedJob +from nvflare.job_config.script_runner import ScriptRunner + +if __name__ == "__main__": + num_clients = 5 + num_rounds = 5 + job_name = "fedavg" + train_script = "src/client.py" + + job = FedJob(name=job_name, min_clients=num_clients) + + controller = FedAvg( + stop_cond="accuracy > 25", + save_filename="global_model.pt", + initial_model=SimpleNetwork(), + num_clients=num_clients, + num_rounds=num_rounds, + ) + + job.to_server(controller) + + # Add clients + + executor_1 = ScriptRunner(script=train_script, script_args="--learning_rate 0.01 --batch_size 12") + job.to(executor_1, "site-1") + + executor_2 = ScriptRunner(script=train_script, script_args="--learning_rate 0.01 --batch_size 10") + job.to(executor_2, "site-2") + + executor_3 = ScriptRunner(script=train_script, script_args="--learning_rate 0.001 --batch_size 8") + job.to(executor_3, "site-3") + + executor_4 = ScriptRunner(script=train_script, script_args="--learning_rate 0.001 --batch_size 6") + job.to(executor_3, "site-4") + + executor_5 = ScriptRunner(script=train_script, script_args="--learning_rate 0.0001 --batch_size 4") + job.to(executor_3, "site-5") + + job_config_dir = "/tmp/nvflare/jobs/workdir" + + print("job-config is at ", os.path.join(job_config_dir, job_name)) + + # job.export_job(job_config_dir) + job.simulator_run(job_config_dir) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/network.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/network.py new file mode 100644 index 0000000000..609b0b1581 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs/src/network.py @@ -0,0 +1,37 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +import torch.nn as nn +import torch.nn.functional as F + + +class SimpleNetwork(nn.Module): + def __init__(self): + super(SimpleNetwork, self).__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.fc1 = nn.Linear(16 * 5 * 5, 120) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = torch.flatten(x, 1) # flatten all dimensions except batch + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/authorization.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/authorization.json new file mode 100644 index 0000000000..b1c9e1909f --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/authorization.json @@ -0,0 +1,29 @@ +{ + "format_version": "1.0", + "permissions": { + "project_admin": "any", + "org_admin": { + "submit_job": "none", + "clone_job": "none", + "manage_job": "o:submitter", + "download_job": "o:submitter", + "view": "any", + "operate": "o:site", + "shell_commands": "o:site", + "byoc": "none" + }, + "lead": { + "submit_job": "o:site", + "clone_job": "n:submitter", + "manage_job": "n:submitter", + "download_job": "n:submitter", + "view": "any", + "operate": "o:site", + "shell_commands": "o:site", + "byoc": "any" + }, + "member": { + "view": "any" + } + } +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/resources.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/resources.json new file mode 100644 index 0000000000..930bb28786 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_a/resources.json @@ -0,0 +1,24 @@ +{ + "format_version": 2, + "client": { + "retry_timeout": 30, + "compression": "Gzip" + }, + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.list_resource_manager.ListResourceManager", + "args": { + "resources": { + "gpu": [0, 1] + } + } + }, + { + "id": "resource_consumer", + "path": "nvflare.app_common.resource_consumers.list_resource_consumer.ListResourceConsumer", + "args": { + } + } + ] +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/custom/test_filter.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/custom/test_filter.py new file mode 100644 index 0000000000..839a689154 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/custom/test_filter.py @@ -0,0 +1,34 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging + +from nvflare.apis.filter import ContentBlockedException, Filter +from nvflare.apis.fl_context import FLContext +from nvflare.apis.shareable import Shareable + +log = logging.getLogger(__name__) + + +class TestFilter(Filter): + def __init__(self, name, block=False): + self.name = name + self.block = block + + def process(self, shareable: Shareable, fl_ctx: FLContext) -> Shareable: + if self.block: + log.info(f"Filter {self.name} blocked the content") + raise ContentBlockedException("Content blocked by filter " + self.name) + + log.info(f"Filter {self.name} is invoked") + return shareable diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/privacy.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/privacy.json new file mode 100644 index 0000000000..2e320a0f0c --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/policies/site_b/privacy.json @@ -0,0 +1,54 @@ +{ + "scopes": [ + { + "name": "public", + "task_result_filters": [ + ], + "task_data_filters": [ + ] + + }, + { + "name": "test", + "task_result_filters": [ + { + "path": "test_filter.TestFilter", + "args": { + "name": "============== Result filter for test scope ============" + } + } + ], + "task_data_filters": [ + { + "path": "test_filter.TestFilter", + "args": { + "name": "============== Data filter for test scope ============" + } + } + ] + }, + { + "name": "private", + "task_result_filters": [ + { + "path": "nvflare.app_common.filters.percentile_privacy.PercentilePrivacy", + "args": { + "percentile": 10, + "gamma": 7.5 + } + } + ], + "task_data_filters": [ + { + "path": "test_filter.TestFilter", + "args": { + "name": "============== Data filter for test scope ============" + } + } + ] + } + + + ], + "default_scope": "public" +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/project.yml b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/project.yml new file mode 100644 index 0000000000..13cdef1845 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/project.yml @@ -0,0 +1,55 @@ +api_version: 3 +name: fed_policy +description: NVFlare example for federated policies + +participants: + - name: rtx + type: server + org: org_a + fed_learn_port: 8002 + admin_port: 8003 + - name: site_a + type: client + org: org_a + - name: site_b + type: client + org: org_b + + - name: super@a.org + type: admin + org: org_a + role: project_admin + - name: admin@a.org + type: admin + org: org_a + role: org_admin + - name: trainer@a.org + type: admin + org: org_a + role: lead + - name: trainer@b.org + type: admin + org: org_b + role: lead + - name: user@b.org + type: admin + org: org_b + role: member + +# The same methods in all builders are called in their order defined in builders section +builders: + - path: nvflare.lighter.impl.workspace.WorkspaceBuilder + args: + template_file: master_template.yml + - path: nvflare.lighter.impl.static_file.StaticFileBuilder + args: + # config_folder can be set to inform NVIDIA FLARE where to get configuration + config_folder: config + overseer_agent: + path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent + overseer_exists: false + args: + sp_end_point: server1:8002:8003 + + - path: nvflare.lighter.impl.cert.CertBuilder + - path: nvflare.lighter.impl.signature.SignatureBuilder diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/site_policy.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/site_policy.ipynb new file mode 100644 index 0000000000..ab2b42fb10 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/site_policy.ipynb @@ -0,0 +1,475 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "ef224801-e14b-4b8e-92bb-1643364fbef9", + "metadata": {}, + "source": [ + "# Site-specific security and privacy policies\n", + "\n", + "NVIDIA FLARE offers a set of security mechansim to control the user access for different opertions based on their roles. These control not limited to centralized server control, but also at each site. Such site-specific level of security is done via security policies which we call federated polices. Probably should called site-specific security & privacy polices.\n", + "\n", + "[FLARE's Security documentation](https://nvflare.readthedocs.io/en/2.4/user_guide/security/identity_security.html) has details regarding\n", + "* Centralized vs. Federated Authorization \n", + "* Policy Configuration \n", + "* Roles and Rights\n", + "* Controls and Conditions\n", + "* Command Categories\n", + "* Policy Evaluation\n", + "* Command Authorization Process\n", + "* Job Submission authorization process\n", + "* Job Management Commands authorization \n", + "\n", + "This section we are going dive into the usage of these policies via few examples, but first a refresher of terminology used in the examples. \n", + "\n", + "\n", + "_____________\n", + "\n", + "\n", + "Refresher: Command Category and Control Notations\n", + "----------\n", + "Before we go to the Test Cases, let's refresh some concepts and notatations\n", + "\n", + "**Command Category**\n", + "\n", + "| permission | command |\n", + "|------------------|-----------------|\n", + "| MANAGE_JOB | ABORT |\n", + "| MANAGE_JOB | ABORT_JOB |\n", + "| MANAGE_JOB | START_APP |\n", + "| MANAGE_JOB | DELETE_JOB |\n", + "| MANAGE_JOB | DELETE_WORKSPACE|\n", + "| VIEW | CHECK_STATUS |\n", + "| VIEW | SHOW_STATS |\n", + "| VIEW | RESET_ERRORS |\n", + "| VIEW | SHOW_ERRORS |\n", + "| VIEW | LIST_JOBS |\n", + "| OPERATE | SYS_INFO |\n", + "| OPERATE | RESTART |\n", + "| OPERATE | SHUTDOWN |\n", + "| OPERATE | REMOVE_CLIENT |\n", + "| OPERATE | SET_TIMEOUT |\n", + "| OPERATE | CALL |\n", + "| SHELL_COMMANDS | SHELL_CAT |\n", + "| SHELL_COMMANDS | SHELL_GREP |\n", + "| SHELL_COMMANDS | SHELL_HEAD |\n", + "| SHELL_COMMANDS | SHELL_LS |\n", + "| SHELL_COMMANDS | SHELL_PWD |\n", + "| SHELL_COMMANDS | SHELL_TAIL |\n", + "\n", + "\n", + "\n", + "**Notation and Condition**\n", + "\n", + "| Notation | Condition | Examples |\n", + "|----------------|------------------------------------------------|---------------------|\n", + "| o:site | The user belongs to the site’s organization | |\n", + "| n:submitter | The user is the job submitter | |\n", + "| o:submitter | The user and the job submitter belong to the same org | |\n", + "| n:| The user is a specified person | n:john@nvidia.com |\n", + "| o: | The user is in a specified org | o:nvidia |\n", + "\n", + "The words “site” and “submitter” are reserved.\n", + "\n", + "In addition, two words are used for extreme conditions:\n", + "\n", + "* Any user is allowed: any\n", + "* No user is allowed: none\n", + "\n", + "control is a set of one or more conditions that is specified in the permission matrix. Conditions specify relationships among the subject user, the site, and the job submitter. The following are supported relationships:\n", + "\n", + "* The user belongs to the site’s organization (user org = site org)\n", + "* The user is the job submitter (user name = submitter name)\n", + "* The user and the job submitter are in the same org (user org = submitter org)\n", + "* The user is a specified person (user name = specified name)\n", + "* The user is in a specified org (user org = specified org)\n", + "* Keep in mind that the relationship is always relative to the subject user - we check to see whether the user’s name or org has the right relationship with the site or job submitter.\n", + "\n", + "\n", + "\n", + "\n", + "---- \n", + "## Overview\n", + "\n", + "Now we are ready to discuss the examples for federated site policy. The purpose of this example is to demonstrate following features of NVFlare,\n", + "\n", + "1. Show secure admin client and authentication\n", + "2. Demonstrate local authorization policy\n", + "3. Demonstrate local privacy policy\n", + " \n", + "## Participants\n", + "\n", + "Site\n", + "____\n", + "* :code:`server`: NVFlare server\n", + "* :code:`site_a`: Client owned by a.org with a customized authorization policy, which only allows\n", + "users from the same org to submit job.\n", + "* :code:`site_b`: Client owned by b.org with a customized privacy policy. The policy defines\n", + "two scopes :code:`public` and :code:`private`. A custom filter is applied to :code:`private`.\n", + "\n", + "Users\n", + "_____\n", + "* :code:`super@a.org`: Super user with role :code:`project_admin` who can do everything\n", + "* :code:`admin@a.org`: Admin for a.org with role :code:`org_admin`\n", + "* :code:`trainer@a.org`: Lead trainer for a.org with role :code:`lead`\n", + "* :code:`trainer@b.org`: Lead trainer for b.org with role :code:`lead`\n", + "* :code:`user@b.org`: Regular user for b.org with role :code:`member`\n", + "\n", + "Jobs\n", + "____\n", + "All the jobs run the same app but have different scopes defined in :code:`meta.json`.\n", + "\n", + "* job1: Scope is :code:`public`. No filters.\n", + "* job2: Scope is :code:`test`. Test filters are applied to data and result.\n", + "* job3: Scope is :code:`private`. PercentilePrivacy filter is applied to result.\n", + "* job4: It has no scope defined.\n", + "* job5: It defines an non-existent scope :code:`foo`\n", + "\n", + "\n", + "\n", + "### Test Cases\n", + "----------\n", + "\n", + "\n", + "\n", + "\n", + "Test Cases\n", + "----------\n", + "\n", + "Authorization\n", + "_____________\n", + "We will demo some authorization behaviors.\n", + "\n", + "Since authorization decision is determined using each site's authorization.json and each admin user's role,\n", + "we just use :code:`job1` in all the following tests.\n", + "\n", + "| User | Command | Expected behavior |\n", + "|----------------|----------------------------------------------|-----------------------------------------------------------------------------------|\n", + "| trainer@a.org | submit_job /tmp/nvflare/jobs/workdir/job_1 | Job deployed and started on all sites |\n", + "| trainer@a.org | clone_job [the job ID that we previous submitted] | Job deployed and started on all sites |\n", + "| trainer@b.org | clone_job [the job ID that we previous submitted] | Rejected because submitter is in a different org |\n", + "| admin@a.org | submit_job /tmp/nvflare/jobs/workdir/job_1 | Rejected because role \"org_admin\" is not allowed to submit jobs |\n", + "| trainer@b.org | submit_job /tmp/nvflare/jobs/workdir/job_1 | site_a rejected the job because the submitter is in a different org, while site_b accepted the job so the job will still run since in meta.json we specify min_clients as 1 |\n", + "\n", + "Privacy\n", + "_______\n", + "site_a has no privacy policy defined.\n", + "So we will test the following cases on site_b.\n", + "\n", + "In each job's meta.json we specified their \"scope\" and in site's privacy.json file each site will define its own\n", + "privacy filters to apply for that scope.\n", + "\n", + "Note that default jobs are treated in \"public\" scope.\n", + "\n", + "Let's just use user trainer@b.org for the following tests.\n", + "\n", + "| Job | Expected behavior |\n", + "|------|--------------------|\n", + "| job1 | Job deployed with no filters |\n", + "| job2 | Job deployed with TestFilter applied |\n", + "| job3 | Job deployed with PercentilePrivacy filter applied to the result |\n", + "| job4 | Job deployed using default scope :code:`public` |\n", + "| job5 | Job rejected by site_b because :code:`foo` doesn't exist |\n" + ] + }, + { + "cell_type": "markdown", + "id": "8674b81f", + "metadata": {}, + "source": [ + "Setup FL System, Site Policies\n", + "----------" + ] + }, + { + "cell_type": "markdown", + "id": "5d44b2c4", + "metadata": {}, + "source": [ + "* Prepare POC with given project.yml file" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "97c7bfa1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "prepare poc at /tmp/nvflare/poc with code/federated-policies/project.yml\n", + "This will delete poc workspace directory: '/tmp/nvflare/poc' and create a new one. Is it OK to proceed? (y/N) provision at /tmp/nvflare/poc for 2 clients with code/federated-policies/project.yml\n", + "Generated results can be found under /tmp/nvflare/poc/fed_policy/prod_00. \n" + ] + } + ], + "source": [ + "! echo y | nvflare poc prepare -i code/federated-policies/project.yml" + ] + }, + { + "cell_type": "markdown", + "id": "37a97825", + "metadata": {}, + "source": [ + "* Setup polices for different sites " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "50540db7", + "metadata": {}, + "outputs": [], + "source": [ + "Workspace = \"/tmp/nvflare/poc/fed_policy/prod_00\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f9c89760", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "! cp -r code/federated-policies/policies/site_a/* {Workspace}/site_a/local\n", + "! cp -r code/federated-policies/policies/site_b/* {Workspace}/site_b/local" + ] + }, + { + "cell_type": "markdown", + "id": "366b9749", + "metadata": {}, + "source": [ + "We can take a look at the policies for site_a\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "091c22fe", + "metadata": {}, + "outputs": [], + "source": [ + "!cat {Workspace}/site_a/local/authorization.json" + ] + }, + { + "cell_type": "markdown", + "id": "521b3607", + "metadata": {}, + "source": [ + "##### Site-a Security Policy\n", + "\n", + "the user \"org_admin\"\n", + "\n", + "| Capability | Permission Scope |\n", + "|------------------|------------------|\n", + "| submit_job | none |\n", + "| clone_job | none |\n", + "| manage_job | o:submitter |\n", + "| download_job | o:submitter |\n", + "| view | any |\n", + "| operate | o:site |\n", + "| shell_commands | o:site |\n", + "| byoc | none |\n", + "\n", + "\n", + "These table essentially set the policy for the Organization Admin \"org_admin\" \n", + "* can not submit job \n", + "* can not clone job\n", + "* can manage job (such as abort job) for the jobs that submitted by the \"job sbumitter\" from the same organization. The \"job submitter\" is the user who has the submit_job permission \n", + "* can download the jobs submitted by the same organziation \n", + "* can view any jobs\n", + "* can perform shell commands ( see job category for details) from the site of the same organization \n", + "* can not have byoc (bring your own code): customized code in the \"custom\" directory. \n", + "\n", + "Similarly we have other user role policies\n", + "\n", + "\n", + "For the **lead** user\n", + "\n", + "| Capability | Permission Scope |\n", + "|------------------|------------------|\n", + "| submit_job | o:site |\n", + "| clone_job | n:submitter |\n", + "| manage_job | n:submitter |\n", + "| download_job | n:submitter |\n", + "| view | any |\n", + "| operate | o:site |\n", + "| shell_commands | o:site |\n", + "| byoc | any |\n", + "\n", + "The user \"lead\" is a \"submitter\" that can but only can submit to its organizational own site (considering one organization has many sites)\n", + "\n", + "\n", + "\n", + "For the **member** user\n", + "\n", + "| Capability | Permission Scope |\n", + "|------------|------------------|\n", + "| view | any |\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "deb08e63", + "metadata": {}, + "source": [ + "We can take a look at the policies for \n", + "\n", + "##### Site_b privacy policy" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "08461d9a", + "metadata": {}, + "outputs": [], + "source": [ + "!cat {Workspace}/site_b/local/privacy.json" + ] + }, + { + "cell_type": "markdown", + "id": "319eb7a5", + "metadata": {}, + "source": [ + "| scope | default_scope | task_data_filters | task_result_filters |\n", + "|---------|----------------|-----------------------------------------|--------------------------------------------------------------------|\n", + "| public | public | | |\n", + "| test | public | test_filter.TestFilter | test_filter.TestFilter |\n", + "| private | public | test_filter.TestFilter | nvflare.app_common.filters.percentile_privacy.PercentilePrivacy |\n" + ] + }, + { + "cell_type": "markdown", + "id": "7c84b47b", + "metadata": {}, + "source": [ + "In the privacy policy, we noticed that there are different scopes can be defined. The scope-specified policy is enforced the the FLARE's filter mechanism. For each non-public scope, one can define the different filters to enforce the desired behavior. For example, the organization administrator can instrument filter to prevent certain type of data leakage accidentally exposed by data sicentists\n", + "\n", + "The filters are look like this: \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ff5ac5f1", + "metadata": {}, + "outputs": [], + "source": [ + "!tree {Workspace}/site_b/local" + ] + }, + { + "cell_type": "markdown", + "id": "519dc42f", + "metadata": {}, + "source": [ + "Let's prepare the jobs. we have five different jobs each have different groups. \n", + "\n", + "##### Create Job Configs" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "26f9cf38", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy/code/federated-policies/jobs\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/chester/.local/lib/python3.10/site-packages/IPython/core/magics/osm.py:417: UserWarning: This is now an optional IPython functionality, setting dhist requires you to install the `pickleshare` library.\n", + " self.shell.db['dhist'] = compress_dhist(dhist)[-100:]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "job-config is at /tmp/nvflare/jobs/workdir/job_1\n", + "job-config is at /tmp/nvflare/jobs/workdir/job_2\n", + "job-config is at /tmp/nvflare/jobs/workdir/job_3\n", + "job-config is at /tmp/nvflare/jobs/workdir/job_4\n", + "job-config is at /tmp/nvflare/jobs/workdir/job_5\n", + "/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security_privacy_policy\n" + ] + } + ], + "source": [ + "%cd code/federated-policies/jobs\n", + "\n", + "! python fl_job.py\n", + "\n", + "%cd -" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c9cef14a", + "metadata": {}, + "outputs": [], + "source": [ + "!cat /tmp/nvflare/jobs/workdir/job_3/meta.json" + ] + }, + { + "cell_type": "markdown", + "id": "8cc01a6e", + "metadata": {}, + "source": [ + "### Start FL System and Run Jobs\n", + "\n", + "Start POC with \n", + "\n", + "```nvflare poc start \n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89b58024", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/custom/admin_auth.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/custom/admin_auth.py new file mode 100644 index 0000000000..9c6a8852f5 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/custom/admin_auth.py @@ -0,0 +1,76 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import getpass +import json + +import requests + +from nvflare.fuel.hci.client.event import EventContext, EventHandler, EventPropKey, EventType + + +class AdminAuth(EventHandler): + def __init__(self, orgs: dict): + # orgs is a dict of name => endpoint of the org's auth service + self.orgs = orgs + self.auth_tokens = {} + self.authentication_done = False + self.passwords = {} + + def _get_passwords_to_all_sites(self): + # This example asks the admin user to type in the password to each org. + for org_name, _ in self.orgs.items(): + password = getpass.getpass(f"Password to {org_name}:") + self.passwords[org_name] = password + + def _auth_org(self, user_name: str, org_name: str, endpoint: str) -> str: + try: + # The access token query depending on the KeyCloak user and client set up. + # We set up the user using the same admin user name for demonstrating. + payload = { + "client_id": "myclient", + "username": user_name, + "password": self.passwords[org_name], + "grant_type": "password", + } + response = requests.post(endpoint, data=payload) + token = json.loads(response.text).get("access_token") + except: + token = None + # If raising an exception here, it will prevent the admin tool connecting to the admin server + # and terminating the admin tool. + return f"{user_name}:{token}" + + def _authenticate_user_to_all_sites(self, ctx: EventContext): + user_name = ctx.get_prop(EventPropKey.USER_NAME) + for org_name, ep in self.orgs.items(): + access_token = self._auth_org(user_name, org_name, ep) + self.auth_tokens[org_name] = access_token + + def handle_event(self, event_type: str, ctx: EventContext): + if event_type == EventType.LOGIN_SUCCESS: + # called after the user is logged in successfully + # print("got event: LOGIN_SUCCESS") + if not self.authentication_done: + # print("authenticating user to orgs ...") + self._authenticate_user_to_all_sites(ctx) + elif event_type == EventType.BEFORE_EXECUTE_CMD: + cmd_name = ctx.get_prop(EventPropKey.CMD_NAME) + # print(f"got event: BEFORE_EXECUTE_CMD for cmd {cmd_name}") + if cmd_name == "submit_job": + # print(f"adding auth_tokens: {self.auth_tokens}") + ctx.set_custom_prop("auth_tokens", self.auth_tokens) + # print("added custom prop!") + elif event_type == EventType.BEFORE_LOGIN: + self._get_passwords_to_all_sites() diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/resources.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/resources.json new file mode 100644 index 0000000000..bbd34fba77 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/admin/local/resources.json @@ -0,0 +1,14 @@ +{ + "format_version": 2, + "handlers": [ + { + "id": "auth", + "path": "admin_auth.AdminAuth", + "args": { + "orgs": { + "site-1": "http://localhost:8080/realms/myrealm/protocol/openid-connect/token" + } + } + } + ] +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/data/download.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/data/download.py new file mode 100644 index 0000000000..ebd8cfdc41 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/data/download.py @@ -0,0 +1,60 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This Dirichlet sampling strategy for creating a heterogeneous partition is adopted +# from FedMA (https://github.com/IBM/FedMA). + +# MIT License + +# Copyright (c) 2020 International Business Machines + +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: + +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. + +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +import argparse + +import torchvision.datasets as datasets + +# default dataset path +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + args = parser.parse_args() + return args + + +def main(args): + datasets.CIFAR10(root=args.dataset_path, train=True, download=True) + datasets.CIFAR10(root=args.dataset_path, train=False, download=True) + + +if __name__ == "__main__": + main(define_parser()) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/fl_jobs.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/fl_jobs.py new file mode 100644 index 0000000000..e36a793ae5 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/fl_jobs.py @@ -0,0 +1,50 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os + +from src.fedavg import FedAvg +from src.network import SimpleNetwork + +from nvflare.job_config.api import FedJob +from nvflare.job_config.script_runner import ScriptRunner + +if __name__ == "__main__": + num_clients = 2 + num_rounds = 2 + job_name = "fedavg" + train_script = "src/client.py" + config_dir = "/tmp/nvflare/jobs/workdir" + + job = FedJob(name=job_name, min_clients=num_clients) + controller = FedAvg( + stop_cond="accuracy > 25", + save_filename="global_model.pt", + initial_model=SimpleNetwork(), + num_clients=num_clients, + num_rounds=num_rounds, + ) + + job.to_server(controller) + + # Add clients + for i in range(num_clients): + executor = ScriptRunner(script=train_script, script_args="") + job.to(executor, f"site-{i+1}") + + job_config_dir = os.path.join(config_dir, job_name) + print(f"job-config for {job_name} is at ", job_config_dir) + job.export_job(config_dir) + # job.simulator_run(config_dir) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/requirements.txt b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/requirements.txt new file mode 100644 index 0000000000..57b4df2ed4 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/requirements.txt @@ -0,0 +1,3 @@ +torch +torchvision +tensorboard \ No newline at end of file diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/client.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/client.py new file mode 100644 index 0000000000..220559b3cf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/client.py @@ -0,0 +1,193 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import torch +import torch.nn as nn +import torch.optim as optim +import torchvision +import torchvision.transforms as transforms +from network import SimpleNetwork + +# (1) import nvflare client API +import nvflare.client as flare +from nvflare.app_common.app_constant import ModelName + +# (optional) set a fix place so we don't need to download everytime +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + +# (optional) We change to use GPU to speed things up. +# if you want to use CPU, change DEVICE="cpu" +DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + parser.add_argument("--batch_size", type=int, default=4, nargs="?") + parser.add_argument("--learning_rate", type=float, default=0.001, nargs="?") + parser.add_argument("--num_workers", type=int, default=1, nargs="?") + parser.add_argument("--local_epochs", type=int, default=2, nargs="?") + parser.add_argument("--model_path", type=str, default=f"{CIFAR10_ROOT}/cifar_net.pth", nargs="?") + return parser.parse_args() + + +def main(): + # define local parameters + args = define_parser() + + dataset_path = args.dataset_path + batch_size = args.batch_size + num_workers = args.num_workers + local_epochs = args.local_epochs + model_path = args.model_path + lr = args.learning_rate + + transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) + trainset = torchvision.datasets.CIFAR10(root=dataset_path, train=True, download=True, transform=transform) + trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers) + testset = torchvision.datasets.CIFAR10(root=dataset_path, train=False, download=True, transform=transform) + testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=num_workers) + + net = SimpleNetwork() + best_accuracy = 0.0 + + # wraps evaluation logic into a method to re-use for + # evaluation on both trained and received model + def evaluate(input_weights): + net = SimpleNetwork() + net.load_state_dict(input_weights) + # (optional) use GPU to speed things up + net.to(DEVICE) + + correct = 0 + total = 0 + # since we're not training, we don't need to calculate the gradients for our outputs + with torch.no_grad(): + for data in testloader: + # (optional) use GPU to speed things up + images, labels = data[0].to(DEVICE), data[1].to(DEVICE) + # calculate outputs by running images through the network + outputs = net(images) + # the class with the highest energy is what we choose as prediction + _, predicted = torch.max(outputs.data, 1) + total += labels.size(0) + correct += (predicted == labels).sum().item() + + return 100 * correct // total + + # (2) initialize NVFlare client API + flare.init() + + # (3) run continously when launch_once=true + while flare.is_running(): + + # (4) receive FLModel from NVFlare + input_model = flare.receive() + client_id = flare.get_site_name() + + # Based on different "task" we will do different things + # for "train" task (flare.is_train()) we use the received model to do training and/or evaluation + # and send back updated model and/or evaluation metrics, if the "train_with_evaluation" is specified as True + # in the config_fed_client we will need to do evaluation and include the evaluation metrics + # for "evaluate" task (flare.is_evaluate()) we use the received model to do evaluation + # and send back the evaluation metrics + # for "submit_model" task (flare.is_submit_model()) we just need to send back the local model + # (5) performing train task on received model + if flare.is_train(): + print(f"({client_id}) current_round={input_model.current_round}, total_rounds={input_model.total_rounds}") + + # (5.1) loads model from NVFlare + net.load_state_dict(input_model.params) + + criterion = nn.CrossEntropyLoss() + optimizer = optim.SGD(net.parameters(), lr=lr, momentum=0.9) + + # (optional) use GPU to speed things up + net.to(DEVICE) + # (optional) calculate total steps + steps = local_epochs * len(trainloader) + for epoch in range(local_epochs): # loop over the dataset multiple times + + running_loss = 0.0 + for i, data in enumerate(trainloader, 0): + # get the inputs; data is a list of [inputs, labels] + # (optional) use GPU to speed things up + inputs, labels = data[0].to(DEVICE), data[1].to(DEVICE) + + # zero the parameter gradients + optimizer.zero_grad() + + # forward + backward + optimize + outputs = net(inputs) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + + # print statistics + running_loss += loss.item() + if i % 2000 == 1999: # print every 2000 mini-batches + print(f"({client_id}) [{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}") + running_loss = 0.0 + break + + print(f"({client_id}) Finished Training") + + # (5.2) evaluation on local trained model to save best model + local_accuracy = evaluate(net.state_dict()) + print(f"({client_id}) Evaluating local trained model. Accuracy on the 10000 test images: {local_accuracy}") + if local_accuracy > best_accuracy: + best_accuracy = local_accuracy + torch.save(net.state_dict(), model_path) + + # (5.3) evaluate on received model for model selection + accuracy = evaluate(input_model.params) + print( + f"({client_id}) Evaluating received model for model selection. Accuracy on the 10000 test images: {accuracy}" + ) + + # (5.4) construct trained FL model + output_model = flare.FLModel( + params=net.cpu().state_dict(), + metrics={"accuracy": accuracy}, + meta={"NUM_STEPS_CURRENT_ROUND": steps}, + ) + + # (5.5) send model back to NVFlare + flare.send(output_model) + + # (6) performing evaluate task on received model + elif flare.is_evaluate(): + accuracy = evaluate(input_model.params) + print(f"({client_id}) accuracy: {accuracy}") + flare.send(flare.FLModel(metrics={"accuracy": accuracy})) + + # (7) performing submit_model task to obtain best local model + elif flare.is_submit_model(): + model_name = input_model.meta["submit_model_name"] + if model_name == ModelName.BEST_MODEL: + try: + weights = torch.load(model_path) + net = SimpleNetwork() + net.load_state_dict(weights) + flare.send(flare.FLModel(params=net.cpu().state_dict())) + except Exception as e: + raise ValueError("Unable to load best model") from e + else: + raise ValueError(f"Unknown model_type: {model_name}") + + +if __name__ == "__main__": + main() diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/fedavg.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/fedavg.py new file mode 100644 index 0000000000..a63f0005bf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/fedavg.py @@ -0,0 +1,158 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from typing import Callable, Dict, List, Optional + +import torch + +from nvflare.app_common.abstract.fl_model import FLModel +from nvflare.app_common.utils.math_utils import parse_compare_criteria +from nvflare.app_common.workflows.base_fedavg import BaseFedAvg +from nvflare.app_opt.pt.decomposers import TensorDecomposer +from nvflare.fuel.utils import fobs + + +class FedAvg(BaseFedAvg): + """Controller for FedAvg Workflow with Early Stopping and Model Selection. + + Args: + num_clients (int, optional): The number of clients. Defaults to 3. + num_rounds (int, optional): The total number of training rounds. Defaults to 5. + stop_cond (str, optional): early stopping condition based on metric. + string literal in the format of " " (e.g. "accuracy >= 80") + save_filename (str, optional): filename for saving model + initial_model (nn.Module, optional): initial PyTorch model + """ + + def __init__( + self, + *args, + stop_cond: str, + num_rounds: int, + save_filename: str = "FL_global_model.pt", + initial_model=None, + **kwargs, + ): + super().__init__(*args, **kwargs) + + self.stop_cond = stop_cond + self.num_rounds = num_rounds + + if stop_cond: + self.stop_condition = parse_compare_criteria(stop_cond) + else: + self.stop_condition = None + self.save_filename = save_filename + self.initial_model = initial_model + self.best_model: Optional[FLModel] = None + + def run(self) -> None: + self.info("Start FedAvg.") + + if self.initial_model: + # Use FOBS for serializing/deserializing PyTorch tensors (self.initial_model) + fobs.register(TensorDecomposer) + # PyTorch weights + initial_weights = self.initial_model.state_dict() + else: + initial_weights = {} + + model = FLModel(params=initial_weights) + + model.start_round = self.start_round + model.total_rounds = self.num_rounds + + for self.current_round in range(self.start_round, self.start_round + self.num_rounds): + self.info(f"Round {self.current_round} started.") + model.current_round = self.current_round + + clients = self.sample_clients(self.num_clients) + + results: List[FLModel] = self.send_model_and_wait(targets=clients, data=model) + aggregate_results = self.aggregate( + results, aggregate_fn=self.aggregate_fn + ) # using default aggregate_fn with `WeightedAggregationHelper`. Can overwrite self.aggregate_fn with signature Callable[List[FLModel], FLModel] + + model = self.update_model(model, aggregate_results) + + self.info(f"Round {self.current_round} global metrics: {model.metrics}") + + self.select_best_model(model) + + self.save_model(self.best_model, os.path.join(os.getcwd(), self.save_filename)) + + if self.should_stop(model.metrics, self.stop_condition): + self.info( + f"Stopping at round={self.current_round} out of total_rounds={self.num_rounds}. Early stop condition satisfied: {self.stop_condition}" + ) + break + + self.info("Finished FedAvg.") + + def should_stop(self, metrics: Optional[Dict] = None, stop_condition: Optional[str] = None): + if stop_condition is None or metrics is None: + return False + + key, target, op_fn = stop_condition + value = metrics.get(key, None) + + if value is None: + raise RuntimeError(f"stop criteria key '{key}' doesn't exists in metrics") + + return op_fn(value, target) + + def select_best_model(self, curr_model: FLModel): + if self.best_model is None: + self.best_model = curr_model + return + + if self.stop_condition: + metric, _, op_fn = self.stop_condition + if self.is_curr_model_better(self.best_model, curr_model, metric, op_fn): + self.info("Current model is new best model.") + self.best_model = curr_model + else: + self.best_model = curr_model + + def is_curr_model_better( + self, best_model: FLModel, curr_model: FLModel, target_metric: str, op_fn: Callable + ) -> bool: + curr_metrics = curr_model.metrics + if curr_metrics is None: + return False + if target_metric not in curr_metrics: + return False + + best_metrics = best_model.metrics + return op_fn(curr_metrics.get(target_metric), best_metrics.get(target_metric)) + + def save_model(self, model, filepath=""): + params = model.params + # PyTorch save + torch.save(params, filepath) + + # save FLModel metadata + model.params = {} + fobs.dumpf(model, filepath + ".metadata") + model.params = params + + def load_model(self, filepath=""): + # PyTorch load + params = torch.load(filepath) + + # load FLModel metadata + model = fobs.loadf(filepath + ".metadata") + model.params = params + return model diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/network.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/network.py new file mode 100644 index 0000000000..609b0b1581 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/code/src/network.py @@ -0,0 +1,37 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +import torch.nn as nn +import torch.nn.functional as F + + +class SimpleNetwork(nn.Module): + def __init__(self): + super(SimpleNetwork, self).__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.fc1 = nn.Linear(16 * 5 * 5, 120) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = torch.flatten(x, 1) # flatten all dimensions except batch + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/edit_site_local_resources.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/edit_site_local_resources.py new file mode 100644 index 0000000000..4b0ff4a76a --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/edit_site_local_resources.py @@ -0,0 +1,64 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import json +import os +import sys + + +def get_security_handler() -> dict: + return json.loads( + """ + { + "id": "security_handler", + "path": "keycloak_security_handler.CustomSecurityHandler" + } + """ + ) + + +def add_components_to_json( + input_file_path, output_file_path, site: str, receiving: bool = False, streaming_to_server: bool = False +): + try: + with open(input_file_path, "r") as file: + data = json.load(file) + except (FileNotFoundError, json.JSONDecodeError): + print(f"Error: Unable to read or parse JSON file: {input_file_path}") + return + + new_components = [get_security_handler()] + + # Append new components to the list + data["components"].extend(new_components) + + # Write the updated JSON back to the file + with open(output_file_path, "w") as file: + json.dump(data, file, indent=4) + + print(f"Successfully generate file: '{output_file_path}'.") + + +if __name__ == "__main__": + + site_name = sys.argv[1] + project_root_dir = sys.argv[2] + + print(site_name, project_root_dir) + + input_file_path = os.path.join(project_root_dir, site_name, "local", "resources.json.default") + output_file_path = os.path.join(project_root_dir, site_name, "local", "resources.json") + + add_components_to_json(input_file_path, output_file_path, site_name) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/get_keycloak_access_token.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/get_keycloak_access_token.py new file mode 100644 index 0000000000..631de725ad --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/get_keycloak_access_token.py @@ -0,0 +1,69 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os +import sys + +import requests + + +def save_access_token(access_token: str, destination_path): + + # Ensure the destination directory exists + os.makedirs(os.path.dirname(destination_path), exist_ok=True) + + with open(destination_path, "w") as f: + f.write(access_token) + print(f"Access token saved to {destination_path}") + + +def get_keycloak_acces_token(username, password, client_id, keycloak_url) -> str: + + # Request payload + data = {"username": username, "password": password, "grant_type": "password", "client_id": client_id} + + try: + # Make a POST request to get the access token + response = requests.post(keycloak_url, data=data, headers={"Content-Type": "application/x-www-form-urlencoded"}) + response_data = response.json() + + # Extract the access token + access_token = response_data.get("access_token") + + if not access_token: + print("Failed to retrieve access token.") + else: + return access_token + + except Exception as e: + print(f"Error fetching access token: {e}") + + +if __name__ == "__main__": + + # Define variables + keycloak_url = "http://localhost:8080/realms/master/protocol/openid-connect/token" + username = "admin" + password = "admin123" + client_id = "admin-cli" + destination_path = sys.argv[1] + + token = get_keycloak_acces_token( + username=username, password=password, client_id=client_id, keycloak_url=keycloak_url + ) + + print("token=", token) + + save_access_token(token, destination_path=destination_path) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/docker-compose.yml b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/docker-compose.yml new file mode 100644 index 0000000000..26feecc5ec --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/docker-compose.yml @@ -0,0 +1,45 @@ +version: '3.8' + +services: + keycloak: + build: . + container_name: keycloak + environment: + - KEYCLOAK_ADMIN=admin + - KEYCLOAK_ADMIN_PASSWORD=admin123 + ports: + - "8080:8080" + volumes: + - .:/opt/keycloak-setup + depends_on: + db: + condition: service_healthy + networks: + - bridge_network + + + db: + image: postgres:15 + container_name: keycloak-db + environment: + - POSTGRES_DB=keycloak + - POSTGRES_USER=keycloak + - POSTGRES_PASSWORD=keycloak + ports: + - "5432:5432" + volumes: + - pgdata:/var/lib/postgresql/data + healthcheck: + test: ["CMD-SHELL", "pg_isready -U keycloak"] + interval: 10s + retries: 5 + start_period: 10s + networks: + - bridge_network + +volumes: + pgdata: + +networks: + bridge_network: + driver: bridge \ No newline at end of file diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/dockerfile b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/dockerfile new file mode 100644 index 0000000000..cac41467cd --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/dockerfile @@ -0,0 +1,18 @@ +FROM bitnami/keycloak:24 + +USER root + +# Install jq (and any other necessary dependencies) +RUN apt-get update && apt-get install -y jq + +# Set working directory +WORKDIR /opt/keycloak-setup + +# Copy the setup scripts to the container +COPY ./init.sh /opt/keycloak-setup/init.sh + +# Set permissions to ensure init.sh is executable +RUN chmod +x /opt/keycloak-setup/init.sh + +# Set the entrypoint +ENTRYPOINT ["/bin/sh", "-c", "/opt/bitnami/scripts/keycloak/run.sh & sleep 10 && /opt/keycloak-setup/init.sh && tail -f /dev/null"] \ No newline at end of file diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/init.sh b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/init.sh new file mode 100755 index 0000000000..59708ffbf9 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak-setup/init.sh @@ -0,0 +1,70 @@ +#!/bin/bash + +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Wait for Keycloak to be ready +echo "Waiting for Keycloak to be ready..." +until curl -sf http://keycloak:8080/realms/master > /dev/null; do + printf '.' + sleep 5 +done + + +echo "Keycloak is ready!" + +# Get Admin Token +ACCESS_TOKEN=$(curl -s -X POST "http://keycloak:8080/realms/master/protocol/openid-connect/token" \ + -H "Content-Type: application/x-www-form-urlencoded" \ + -d "username=admin" \ + -d "password=admin123" \ + -d "grant_type=password" \ + -d "client_id=admin-cli" | jq -r .access_token) + +# Create Realm +curl -X POST "http://keycloak:8080/admin/realms" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $ACCESS_TOKEN" \ + -d '{"realm": "myrealm", "enabled": true}' + +# Create User +curl -X POST "http://keycloak:8080/admin/realms/myrealm/users" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $ACCESS_TOKEN" \ + -d '{ + "username": "myuser@example.com", + "enabled": true, + "email": "myuser@example.com", + "firstName": "My", + "lastName": "User", + "credentials": [{ + "type": "password", + "value": "password123", + "temporary": false + }] + }' + +# Create Client +curl -X POST "http://keycloak:8080/admin/realms/myrealm/clients" \ + -H "Content-Type: application/json" \ + -H "Authorization: Bearer $ACCESS_TOKEN" \ + -d '{ + "clientId": "myclient", + "enabled": true, + "protocol": "openid-connect", + "publicClient": true, + "redirectUris": ["http://localhost:8080/*"] + }' + +echo "Setup completed!" diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak_integration.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak_integration.ipynb new file mode 100644 index 0000000000..1c02e7788f --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/keycloak_integration.ipynb @@ -0,0 +1,362 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c8393461", + "metadata": {}, + "source": [ + "## Integration with external authentication system\n", + "\n", + "#### Overview\n", + "\n", + "In Federated Computing systems, many participating institutions already have their own in-house authentication systems. Instead of introducing a new authentication mechanism, we need to integrate with them. Note that this integration can be site-specific; in other words, each site may be different. Site-1 can be using OLAP, site-2 can use OAuth, and a third site can use something else.\n", + "\n", + "In this example, we demonstrate NVIDIA FLARE's event-based plugin-component that can be used to integrate any type of authentication/authorization mechanism, using open-source KeyCloak integration as an example.\n", + "\n", + "### Setup KeyCloak\n", + "\n", + "Before we start, we need to download and start the KeyCloa service. To do that, we create a [dockerfile](./examples/custom_client_side_auth_system_integration/keycloak-setup/dockerfile) and [docker-compose](./examples/custom_client_side_auth_system_integration/keycloak-setup/docker-compose.yml) file\n", + "\n", + "to start cd to ```custom_client_side_auth_system_integration/keycloak-setup``` directory and run the following command\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8dbd0c77", + "metadata": {}, + "outputs": [], + "source": [ + "%cd keycloak-setup\n", + "! docker compose up -d --build\n", + "%cd -\n" + ] + }, + { + "cell_type": "markdown", + "id": "bc6ccaa0", + "metadata": {}, + "source": [ + "#* check if the keycloak is ready \n", + "!docker ps\n", + "!docker logs keycloak \n", + "\n", + "you should see something like \n", + "\n", + "```\n", + " Keycloak is ready!\n", + " ...\n", + "\n", + " Setup completed!\n", + "```\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "id": "dd304d80", + "metadata": {}, + "source": [ + "You can also check if the website and login with credential user = admin password = admin123" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fd65e072", + "metadata": {}, + "outputs": [], + "source": [ + "http://localhost:8080\n" + ] + }, + { + "cell_type": "markdown", + "id": "be7f97c7", + "metadata": {}, + "source": [ + "### Setup KeyCloak Authentication Plugin\n", + "\n", + "This integration involves two plugins: \n", + "\n", + "* At admin client during job submision, we will require the login for the given site (site-1). We also need to get the access token to pass to the job context\n", + "* one at site local, we need to plugin for job authorization\n", + "\n", + "\n", + "#### Set up FL Client Job Authorization Configuration\n", + "\n", + "First, we need to overwrite the default local resources.json.default with resources.json to add the custom security check component\n", + "\n", + " {\n", + " \"id\": \"security_handler\",\n", + " \"path\": \"keycloak_security_handler.CustomSecurityHandler\"\n", + " }\n", + "\n", + "The \"keycloak_security_handler.CustomSecurityHandler\" is defined as " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d0da496", + "metadata": {}, + "outputs": [], + "source": [ + "!cat security/custom/keycloak_security_handler.py" + ] + }, + { + "cell_type": "markdown", + "id": "8cf35e71", + "metadata": {}, + "source": [ + "we also need to save the KeyCloak public_key in the `/tmp/nvflare/poc/example_project/prod_00/site-1/local/site-1/local/public_key.pem` file, with the following format:\n", + "\n", + "```\n", + "-----BEGIN PUBLIC KEY-----\n", + "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAre3kQxqOfTJ7LLRwlpotw47goqSsuyFOg9Ihx5IXDMbO8HTGuGQcdDVJaYJQYphfhp2qdw+1o6qVN2yPBxwiBWju/XZQMPbCXRBu2bVDffWJVMoelLDbr3uY9hCgYgmB7qYpDdNOmxb2+xIlg/x0q+vrRRMtdd8SGicvjg0mQSEEF4a7QOSwuDnwBX8+bMOXfyB5qQJlakNVND1Bc+MjDENkHLtImVowX9XZcz8M6Ap9Eq1z2agl6lmFxTLtZroTE6IQS/dFYPVy4rZ1Zuy5cvs/3j+SYzlplH/iP3qZs8UiKrTJMmfIuLmDbP3hEAOsEmQ/M3lRxnE4wuGxvel5rwIDAQAB\n", + "-----END PUBLIC KEY-----\n", + "```\n", + "\n", + "In the local/custom/resources.json config file, it contains the following additional security handler:\n", + "\n", + "```\n", + " {\n", + " \"id\": \"security_handler\",\n", + " \"path\": \"keycloak_security_handler.CustomSecurityHandler\"\n", + " }\n", + "```\n", + "\n", + "The CustomSecurityHandler in the custom/keycloak_security_handler.py contains the logic to validate the admin user's KeyCloak access token when the admin user submits a job, or scheduler picks up an already submitted job from the admin user. If the access token is invalid, the job will not be authorized to run.\n", + "\n", + "We can do this in the following code. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8551b25b", + "metadata": {}, + "outputs": [], + "source": [ + "# prepare poc\n", + "! echo y | nvflare poc prepare -n 2\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "636b2727", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "!python get_keycloak_access_token.py /tmp/nvflare/poc/example_project/prod_00/site-1/local/site-1/local/public_key.pem\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1e8644a9", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "!python edit_site_local_resources.py site-1 /tmp/nvflare/poc/example_project/prod_00\n" + ] + }, + { + "cell_type": "markdown", + "id": "93dcf155", + "metadata": {}, + "source": [ + "\n", + "#### Set up Admin user authentication\n", + "\n", + "\n", + "In the local/custom/resources.json config file, it contains the following admin event handler. the \"orgs\" arg provides a list of site names, and it's corresponding KeyCloak access_token URLs:\n", + "\n", + "```\n", + " {\n", + " \"id\": \"auth\",\n", + " \"path\": \"admin_auth.AdminAuth\",\n", + " \"args\": {\n", + " \"orgs\": {\n", + " \"site-a\": \"http://localhost:8080/realms/myrealm/protocol/openid-connect/token\"\n", + " }\n", + " }\n", + " }\n", + "```\n", + "\n", + "The AdminAuth event handler in the custom/admin_auth.py has the logic to acquire the KeyCloak access tokens to each individual site. When the admin user submits a job, it will set the tokens in the FLContext.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "89ca8b71", + "metadata": {}, + "outputs": [], + "source": [ + "! mkdir -p /tmp/nvflare/poc/example_project/prod_00/admin@nvidia.com/local/\n", + "! cp -r admin/local/* /tmp/nvflare/poc/example_project/prod_00/admin@nvidia.com/local/" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "2044c070", + "metadata": {}, + "outputs": [], + "source": [ + "! cp -r site/local/* /tmp/nvflare/poc/example_project/prod_00/site-1/local/" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c48462ee", + "metadata": {}, + "outputs": [], + "source": [ + "! tree /tmp/nvflare/poc/example_project/prod_00/admin@nvidia.com/local/" + ] + }, + { + "cell_type": "markdown", + "id": "e6b36385", + "metadata": {}, + "source": [ + "### Test the setup \n", + "\n", + "start poc\n", + "\n", + "\n", + "#### Logging with Admin Console\n", + "\n", + "At the prompt, enter the user email `admin@nvidia.com`, and then provide the password to `site_1` KeyCloak.\n", + "\n", + "\n", + "#### Require authenticated admin user when running jobs\n", + "\n", + "With this system set up, the `site-1` will require only the authenticated admin user to be able to submit and run a job. `site-2` does not have this additional security requirement. Any admin user can submit and run the job.\n", + "\n", + "\n", + "##### Authenticated admin user\n", + "\n", + "* `admin@nvidia.com` should successfully authenticated to `site-1` KeyCloak system. The job is successfully submitted and run on both `site-1` and `site-2`.\n", + "\n", + "##### Un-authenticated admin user\n", + "\n", + "* If the wrong password is provided, or for some reason KeyCloak system is not available when starting the admin tool, or submitting the job, the job won't be able to run the `site-1`. \n", + "* `site-1` will show \"ERROR - Authorization failed\".but the job can successfully run on `site-2`.\n", + "* `list_jobs -d JOB_ID` command will show \"job_deploy_detail\" information of this job.\n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "40073347", + "metadata": {}, + "source": [ + "Let's try this out" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "eadd3f5c", + "metadata": {}, + "outputs": [], + "source": [ + "# Prepare the data\n", + "\n", + "! python code/data/download.py" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "58688bcd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security/custom_client_side_auth_system_integration/code\n", + "job-config for fedavg is at /tmp/nvflare/jobs/workdir/fedavg\n", + "/home/chester/projects/NVFlare/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.3_site_security/custom_client_side_auth_system_integration\n" + ] + } + ], + "source": [ + "# create job config\n", + "%cd code/\n", + "\n", + "! python ./fl_jobs.py\n", + "\n", + "%cd -" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "355c3624", + "metadata": {}, + "outputs": [], + "source": [ + "! nvflare simulator /tmp/nvflare/jobs/workdir/fedavg -w /tmp/nvflare/workspace/tmp" + ] + }, + { + "cell_type": "markdown", + "id": "5f3de6a0", + "metadata": {}, + "source": [ + "* Start POC without admin console\n", + "\n", + " ``` nvflare poc start -ex admin@nvidia.com```" + ] + }, + { + "cell_type": "markdown", + "id": "4228c715", + "metadata": {}, + "source": [ + "* Start POC admin console in separate terminal\n", + "\n", + " ``` nvflare poc start -p admin@nvidia.com```" + ] + }, + { + "cell_type": "markdown", + "id": "f7f2dbc4", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/custom/keycloak_security_handler.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/custom/keycloak_security_handler.py new file mode 100644 index 0000000000..0deb745b57 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/custom/keycloak_security_handler.py @@ -0,0 +1,77 @@ +# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from typing import Tuple + +import jwt + +from nvflare.apis.event_type import EventType +from nvflare.apis.fl_component import FLComponent +from nvflare.apis.fl_constant import FLContextKey +from nvflare.apis.fl_context import FLContext +from nvflare.apis.job_def import JobMetaKey + + +class CustomSecurityHandler(FLComponent): + def handle_event(self, event_type: str, fl_ctx: FLContext): + if event_type == EventType.AUTHORIZE_COMMAND_CHECK: + result, reason = self.authorize(fl_ctx=fl_ctx) + if not result: + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_RESULT, False, sticky=False) + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_REASON, reason, sticky=False) + + def _validate_token(self, token, fl_ctx: FLContext): + try: + workspace_root = fl_ctx.get_prop(FLContextKey.WORKSPACE_ROOT) + public_key_file = os.path.join(workspace_root, "local/public_key.pem") + with open(public_key_file, "r") as f: + public_key = f.read() + + # This JWT decode is depending on the KeyCloak set up, which uses the proper algorithm and audience for + # the access token decode. + access_token_json = jwt.decode( + token, public_key, algorithms=["RS256"], audience="account", options={"verify_signature": True} + ) + # access_token_json contains more information regarding the access token. The sample code here + # only extracts the "preferred_username" for demonstrating purpose to indicate token valid or not. + user_name = access_token_json.get("preferred_username") + if user_name: + token_valid = True + else: + token_valid = False + except: + token_valid = False + + # print(f"_validate_token: {token_valid}") + return token_valid + + def authorize(self, fl_ctx: FLContext) -> Tuple[bool, str]: + command = fl_ctx.get_prop(FLContextKey.COMMAND_NAME) + if command in ["check_resources", "submit_job"]: + security_items = fl_ctx.get_prop(FLContextKey.SECURITY_ITEMS) + job_meta = security_items.get(FLContextKey.JOB_META) + auth_tokens = job_meta.get(JobMetaKey.CUSTOM_PROPS, {}).get("auth_tokens") + if not auth_tokens: + return False, f"Not authorized to execute command: {command}" + + site_name = fl_ctx.get_identity_name() + site_auth_token = auth_tokens.get(site_name).split(":")[1] + + if not self._validate_token(site_auth_token, fl_ctx): + return False, f"Not authorized to execute command: {command}" + else: + return True, "" + else: + return True, "" diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/resources.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/resources.json new file mode 100644 index 0000000000..2281f704d8 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_auth_system_integration/site/local/resources.json @@ -0,0 +1,26 @@ +{ + "format_version": 2, + "client": { + "retry_timeout": 30, + "compression": "Gzip" + }, + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", + "args": { + "num_of_gpus": 0, + "mem_per_gpu_in_GiB": 0 + } + }, + { + "id": "resource_consumer", + "path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer", + "args": {} + }, + { + "id": "security_handler", + "path": "keycloak_security_handler.CustomSecurityHandler" + } + ] +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/client_side_security_check.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/client_side_security_check.ipynb new file mode 100644 index 0000000000..27696ea45c --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/client_side_security_check.ipynb @@ -0,0 +1,213 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "e0053b6f", + "metadata": {}, + "source": [ + "## Client Side: Customized Job-level Auhtorization\n", + "\n", + "let's take look authorization on the client side\n", + "\n", + "**Setup**\n", + "\n", + "* `server`: NVFlare server\n", + "* `site-1`: Site-1 has a CustomSecurityHandler set up which does not allow the job \"secret-job\" to run. All other jobs will be able to deploy and run on site-1.\n", + "* `site-2`: Site-2 allows any job to be deployed and run.\n", + "\n", + "**Expectation**\n", + "* \"secret-job\" will be deployed and run on site-2 but not on site-1\n", + "\n", + "\n", + "What we will do: \n", + "\n", + "* install dependencies\n", + "* download data\n", + "* generate two job configs. We can use fl_jobs.py\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2a9cd2a1", + "metadata": {}, + "outputs": [], + "source": [ + "# install dependencies\n", + "\n", + "! pip install -r code/requirements.txt\n", + "\n", + "# download data\n", + "\n", + "! python code/data/download.py" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c49a6ea6", + "metadata": {}, + "outputs": [], + "source": [ + "%cd code\n", + "\n", + "! python fl_jobs.py\n", + "\n", + "# change back\n", + "%cd - \n" + ] + }, + { + "cell_type": "markdown", + "id": "b06fce58", + "metadata": {}, + "source": [ + "Next we \n", + "* create a POC workspace and\n", + "* then install the customized securitry handler to site-1,\n", + "* edit site-1/local/resources.json to add security handler component \n", + "> note: \n", + " to simplify, we just copy the pre-edit resources.json to that location\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "15ec4aee", + "metadata": {}, + "outputs": [], + "source": [ + "# prepare poc\n", + "! echo y | nvflare poc prepare -n 2\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "c1856aff", + "metadata": {}, + "outputs": [], + "source": [ + "# cp security handler and component config\n", + "!cp -r security/site-1/* /tmp/nvflare/poc/example_project/prod_00/site-1/local/.\n" + ] + }, + { + "cell_type": "markdown", + "id": "b1699805", + "metadata": {}, + "source": [ + "Now we are ready to run the job. \n", + "\n", + "\n", + "* start poc\n", + "\n", + " Use a terminal ( note notebook cell) start the poc with the following command\n", + "\n", + " ```\n", + " nvflare poc start -ex admin@nvidia.com \n", + "\n", + " ```\n", + "\n", + " this bring up the FL system \n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "79e8d608", + "metadata": {}, + "outputs": [], + "source": [ + "# Submit jobs\n", + "# Assuming at this point FL system is already running via poc start command\n", + "\n", + "! nvflare job submit -j /tmp/nvflare/jobs/workdir/fedavg\n", + "\n", + "# The job should finish as expected" + ] + }, + { + "cell_type": "markdown", + "id": "c82c3ba6", + "metadata": {}, + "source": [ + "The fedavg job completed well. Now let's submit \"secret-job\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "e1872ca6", + "metadata": {}, + "outputs": [], + "source": [ + "# Submit jobs\n", + "# Assuming at this point FL system is already running via poc start command\n", + "\n", + "! nvflare job submit -j /tmp/nvflare/jobs/workdir/secret-job" + ] + }, + { + "cell_type": "markdown", + "id": "97480fa8", + "metadata": {}, + "source": [ + "you should get something like\n", + "\n", + "```\n", + "2025-02-02 20:31:03,494 - site_security - ERROR - Authorization failed. Reason: Not authorized to execute: check_resources\n", + "2025-02-02 20:31:03,496 - ServerEngine - ERROR - Client reply error: Not authorized to execute: check_resources\n", + "\n", + "```\n", + "\n", + "* Cleanup " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "9b0a4532", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "! nvflare poc stop" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8cf83674", + "metadata": {}, + "outputs": [], + "source": [ + "! nvflare poc clean" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/data/download.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/data/download.py new file mode 100644 index 0000000000..ebd8cfdc41 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/data/download.py @@ -0,0 +1,60 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This Dirichlet sampling strategy for creating a heterogeneous partition is adopted +# from FedMA (https://github.com/IBM/FedMA). + +# MIT License + +# Copyright (c) 2020 International Business Machines + +# Permission is hereby granted, free of charge, to any person obtaining a copy +# of this software and associated documentation files (the "Software"), to deal +# in the Software without restriction, including without limitation the rights +# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +# copies of the Software, and to permit persons to whom the Software is +# furnished to do so, subject to the following conditions: + +# The above copyright notice and this permission notice shall be included in all +# copies or substantial portions of the Software. + +# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +# SOFTWARE. +import argparse + +import torchvision.datasets as datasets + +# default dataset path +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + args = parser.parse_args() + return args + + +def main(args): + datasets.CIFAR10(root=args.dataset_path, train=True, download=True) + datasets.CIFAR10(root=args.dataset_path, train=False, download=True) + + +if __name__ == "__main__": + main(define_parser()) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/fl_jobs.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/fl_jobs.py new file mode 100644 index 0000000000..b574730490 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/fl_jobs.py @@ -0,0 +1,51 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import os + +from src.fedavg import FedAvg +from src.network import SimpleNetwork + +from nvflare.job_config.api import FedJob +from nvflare.job_config.script_runner import ScriptRunner + +if __name__ == "__main__": + num_clients = 2 + num_rounds = 2 + job_names = ["fedavg", "secret-job"] + train_script = "src/client.py" + config_dir = "/tmp/nvflare/jobs/workdir" + + for job_name in job_names: + job = FedJob(name=job_name, min_clients=num_clients) + controller = FedAvg( + stop_cond="accuracy > 25", + save_filename="global_model.pt", + initial_model=SimpleNetwork(), + num_clients=num_clients, + num_rounds=num_rounds, + ) + + job.to_server(controller) + + # Add clients + for i in range(num_clients): + executor = ScriptRunner(script=train_script, script_args="") + job.to(executor, f"site-{i+1}") + + job_config_dir = os.path.join(config_dir, job_name) + print(f"job-config for {job_name} is at ", job_config_dir) + job.export_job(config_dir) + # job.simulator_run(config_dir) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/requirements.txt b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/requirements.txt new file mode 100644 index 0000000000..57b4df2ed4 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/requirements.txt @@ -0,0 +1,3 @@ +torch +torchvision +tensorboard \ No newline at end of file diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/client.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/client.py new file mode 100644 index 0000000000..220559b3cf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/client.py @@ -0,0 +1,193 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import argparse + +import torch +import torch.nn as nn +import torch.optim as optim +import torchvision +import torchvision.transforms as transforms +from network import SimpleNetwork + +# (1) import nvflare client API +import nvflare.client as flare +from nvflare.app_common.app_constant import ModelName + +# (optional) set a fix place so we don't need to download everytime +CIFAR10_ROOT = "/tmp/nvflare/data/cifar10" + +# (optional) We change to use GPU to speed things up. +# if you want to use CPU, change DEVICE="cpu" +DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") + + +def define_parser(): + parser = argparse.ArgumentParser() + parser.add_argument("--dataset_path", type=str, default=CIFAR10_ROOT, nargs="?") + parser.add_argument("--batch_size", type=int, default=4, nargs="?") + parser.add_argument("--learning_rate", type=float, default=0.001, nargs="?") + parser.add_argument("--num_workers", type=int, default=1, nargs="?") + parser.add_argument("--local_epochs", type=int, default=2, nargs="?") + parser.add_argument("--model_path", type=str, default=f"{CIFAR10_ROOT}/cifar_net.pth", nargs="?") + return parser.parse_args() + + +def main(): + # define local parameters + args = define_parser() + + dataset_path = args.dataset_path + batch_size = args.batch_size + num_workers = args.num_workers + local_epochs = args.local_epochs + model_path = args.model_path + lr = args.learning_rate + + transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))]) + trainset = torchvision.datasets.CIFAR10(root=dataset_path, train=True, download=True, transform=transform) + trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=num_workers) + testset = torchvision.datasets.CIFAR10(root=dataset_path, train=False, download=True, transform=transform) + testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size, shuffle=False, num_workers=num_workers) + + net = SimpleNetwork() + best_accuracy = 0.0 + + # wraps evaluation logic into a method to re-use for + # evaluation on both trained and received model + def evaluate(input_weights): + net = SimpleNetwork() + net.load_state_dict(input_weights) + # (optional) use GPU to speed things up + net.to(DEVICE) + + correct = 0 + total = 0 + # since we're not training, we don't need to calculate the gradients for our outputs + with torch.no_grad(): + for data in testloader: + # (optional) use GPU to speed things up + images, labels = data[0].to(DEVICE), data[1].to(DEVICE) + # calculate outputs by running images through the network + outputs = net(images) + # the class with the highest energy is what we choose as prediction + _, predicted = torch.max(outputs.data, 1) + total += labels.size(0) + correct += (predicted == labels).sum().item() + + return 100 * correct // total + + # (2) initialize NVFlare client API + flare.init() + + # (3) run continously when launch_once=true + while flare.is_running(): + + # (4) receive FLModel from NVFlare + input_model = flare.receive() + client_id = flare.get_site_name() + + # Based on different "task" we will do different things + # for "train" task (flare.is_train()) we use the received model to do training and/or evaluation + # and send back updated model and/or evaluation metrics, if the "train_with_evaluation" is specified as True + # in the config_fed_client we will need to do evaluation and include the evaluation metrics + # for "evaluate" task (flare.is_evaluate()) we use the received model to do evaluation + # and send back the evaluation metrics + # for "submit_model" task (flare.is_submit_model()) we just need to send back the local model + # (5) performing train task on received model + if flare.is_train(): + print(f"({client_id}) current_round={input_model.current_round}, total_rounds={input_model.total_rounds}") + + # (5.1) loads model from NVFlare + net.load_state_dict(input_model.params) + + criterion = nn.CrossEntropyLoss() + optimizer = optim.SGD(net.parameters(), lr=lr, momentum=0.9) + + # (optional) use GPU to speed things up + net.to(DEVICE) + # (optional) calculate total steps + steps = local_epochs * len(trainloader) + for epoch in range(local_epochs): # loop over the dataset multiple times + + running_loss = 0.0 + for i, data in enumerate(trainloader, 0): + # get the inputs; data is a list of [inputs, labels] + # (optional) use GPU to speed things up + inputs, labels = data[0].to(DEVICE), data[1].to(DEVICE) + + # zero the parameter gradients + optimizer.zero_grad() + + # forward + backward + optimize + outputs = net(inputs) + loss = criterion(outputs, labels) + loss.backward() + optimizer.step() + + # print statistics + running_loss += loss.item() + if i % 2000 == 1999: # print every 2000 mini-batches + print(f"({client_id}) [{epoch + 1}, {i + 1:5d}] loss: {running_loss / 2000:.3f}") + running_loss = 0.0 + break + + print(f"({client_id}) Finished Training") + + # (5.2) evaluation on local trained model to save best model + local_accuracy = evaluate(net.state_dict()) + print(f"({client_id}) Evaluating local trained model. Accuracy on the 10000 test images: {local_accuracy}") + if local_accuracy > best_accuracy: + best_accuracy = local_accuracy + torch.save(net.state_dict(), model_path) + + # (5.3) evaluate on received model for model selection + accuracy = evaluate(input_model.params) + print( + f"({client_id}) Evaluating received model for model selection. Accuracy on the 10000 test images: {accuracy}" + ) + + # (5.4) construct trained FL model + output_model = flare.FLModel( + params=net.cpu().state_dict(), + metrics={"accuracy": accuracy}, + meta={"NUM_STEPS_CURRENT_ROUND": steps}, + ) + + # (5.5) send model back to NVFlare + flare.send(output_model) + + # (6) performing evaluate task on received model + elif flare.is_evaluate(): + accuracy = evaluate(input_model.params) + print(f"({client_id}) accuracy: {accuracy}") + flare.send(flare.FLModel(metrics={"accuracy": accuracy})) + + # (7) performing submit_model task to obtain best local model + elif flare.is_submit_model(): + model_name = input_model.meta["submit_model_name"] + if model_name == ModelName.BEST_MODEL: + try: + weights = torch.load(model_path) + net = SimpleNetwork() + net.load_state_dict(weights) + flare.send(flare.FLModel(params=net.cpu().state_dict())) + except Exception as e: + raise ValueError("Unable to load best model") from e + else: + raise ValueError(f"Unknown model_type: {model_name}") + + +if __name__ == "__main__": + main() diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/fedavg.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/fedavg.py new file mode 100644 index 0000000000..a63f0005bf --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/fedavg.py @@ -0,0 +1,158 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import os +from typing import Callable, Dict, List, Optional + +import torch + +from nvflare.app_common.abstract.fl_model import FLModel +from nvflare.app_common.utils.math_utils import parse_compare_criteria +from nvflare.app_common.workflows.base_fedavg import BaseFedAvg +from nvflare.app_opt.pt.decomposers import TensorDecomposer +from nvflare.fuel.utils import fobs + + +class FedAvg(BaseFedAvg): + """Controller for FedAvg Workflow with Early Stopping and Model Selection. + + Args: + num_clients (int, optional): The number of clients. Defaults to 3. + num_rounds (int, optional): The total number of training rounds. Defaults to 5. + stop_cond (str, optional): early stopping condition based on metric. + string literal in the format of " " (e.g. "accuracy >= 80") + save_filename (str, optional): filename for saving model + initial_model (nn.Module, optional): initial PyTorch model + """ + + def __init__( + self, + *args, + stop_cond: str, + num_rounds: int, + save_filename: str = "FL_global_model.pt", + initial_model=None, + **kwargs, + ): + super().__init__(*args, **kwargs) + + self.stop_cond = stop_cond + self.num_rounds = num_rounds + + if stop_cond: + self.stop_condition = parse_compare_criteria(stop_cond) + else: + self.stop_condition = None + self.save_filename = save_filename + self.initial_model = initial_model + self.best_model: Optional[FLModel] = None + + def run(self) -> None: + self.info("Start FedAvg.") + + if self.initial_model: + # Use FOBS for serializing/deserializing PyTorch tensors (self.initial_model) + fobs.register(TensorDecomposer) + # PyTorch weights + initial_weights = self.initial_model.state_dict() + else: + initial_weights = {} + + model = FLModel(params=initial_weights) + + model.start_round = self.start_round + model.total_rounds = self.num_rounds + + for self.current_round in range(self.start_round, self.start_round + self.num_rounds): + self.info(f"Round {self.current_round} started.") + model.current_round = self.current_round + + clients = self.sample_clients(self.num_clients) + + results: List[FLModel] = self.send_model_and_wait(targets=clients, data=model) + aggregate_results = self.aggregate( + results, aggregate_fn=self.aggregate_fn + ) # using default aggregate_fn with `WeightedAggregationHelper`. Can overwrite self.aggregate_fn with signature Callable[List[FLModel], FLModel] + + model = self.update_model(model, aggregate_results) + + self.info(f"Round {self.current_round} global metrics: {model.metrics}") + + self.select_best_model(model) + + self.save_model(self.best_model, os.path.join(os.getcwd(), self.save_filename)) + + if self.should_stop(model.metrics, self.stop_condition): + self.info( + f"Stopping at round={self.current_round} out of total_rounds={self.num_rounds}. Early stop condition satisfied: {self.stop_condition}" + ) + break + + self.info("Finished FedAvg.") + + def should_stop(self, metrics: Optional[Dict] = None, stop_condition: Optional[str] = None): + if stop_condition is None or metrics is None: + return False + + key, target, op_fn = stop_condition + value = metrics.get(key, None) + + if value is None: + raise RuntimeError(f"stop criteria key '{key}' doesn't exists in metrics") + + return op_fn(value, target) + + def select_best_model(self, curr_model: FLModel): + if self.best_model is None: + self.best_model = curr_model + return + + if self.stop_condition: + metric, _, op_fn = self.stop_condition + if self.is_curr_model_better(self.best_model, curr_model, metric, op_fn): + self.info("Current model is new best model.") + self.best_model = curr_model + else: + self.best_model = curr_model + + def is_curr_model_better( + self, best_model: FLModel, curr_model: FLModel, target_metric: str, op_fn: Callable + ) -> bool: + curr_metrics = curr_model.metrics + if curr_metrics is None: + return False + if target_metric not in curr_metrics: + return False + + best_metrics = best_model.metrics + return op_fn(curr_metrics.get(target_metric), best_metrics.get(target_metric)) + + def save_model(self, model, filepath=""): + params = model.params + # PyTorch save + torch.save(params, filepath) + + # save FLModel metadata + model.params = {} + fobs.dumpf(model, filepath + ".metadata") + model.params = params + + def load_model(self, filepath=""): + # PyTorch load + params = torch.load(filepath) + + # load FLModel metadata + model = fobs.loadf(filepath + ".metadata") + model.params = params + return model diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/network.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/network.py new file mode 100644 index 0000000000..609b0b1581 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/code/src/network.py @@ -0,0 +1,37 @@ +# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import torch +import torch.nn as nn +import torch.nn.functional as F + + +class SimpleNetwork(nn.Module): + def __init__(self): + super(SimpleNetwork, self).__init__() + self.conv1 = nn.Conv2d(3, 6, 5) + self.pool = nn.MaxPool2d(2, 2) + self.conv2 = nn.Conv2d(6, 16, 5) + self.fc1 = nn.Linear(16 * 5 * 5, 120) + self.fc2 = nn.Linear(120, 84) + self.fc3 = nn.Linear(84, 10) + + def forward(self, x): + x = self.pool(F.relu(self.conv1(x))) + x = self.pool(F.relu(self.conv2(x))) + x = torch.flatten(x, 1) # flatten all dimensions except batch + x = F.relu(self.fc1(x)) + x = F.relu(self.fc2(x)) + x = self.fc3(x) + return x diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/custom/security_handler.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/custom/security_handler.py new file mode 100644 index 0000000000..84a8f58a8b --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/custom/security_handler.py @@ -0,0 +1,42 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Tuple + +from nvflare.apis.event_type import EventType +from nvflare.apis.fl_component import FLComponent +from nvflare.apis.fl_constant import FLContextKey +from nvflare.apis.fl_context import FLContext +from nvflare.apis.job_def import JobMetaKey + + +class CustomSecurityHandler(FLComponent): + def handle_event(self, event_type: str, fl_ctx: FLContext): + if event_type == EventType.AUTHORIZE_COMMAND_CHECK: + result, reason = self.authorize(fl_ctx=fl_ctx) + if not result: + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_RESULT, False, sticky=False) + fl_ctx.set_prop(FLContextKey.AUTHORIZATION_REASON, reason, sticky=False) + + def authorize(self, fl_ctx: FLContext) -> Tuple[bool, str]: + command = fl_ctx.get_prop(FLContextKey.COMMAND_NAME) + if command in ["check_resources"]: + security_items = fl_ctx.get_prop(FLContextKey.SECURITY_ITEMS) + job_meta = security_items.get(FLContextKey.JOB_META) + if job_meta.get(JobMetaKey.JOB_NAME) == "secret-job": + return False, f"Not authorized to execute: {command}" + else: + return True, "" + else: + return True, "" diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/resources.json b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/resources.json new file mode 100644 index 0000000000..14c465df84 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_client_side_job_level_authorization/security/site-1/resources.json @@ -0,0 +1,36 @@ +{ + "format_version": 2, + "client": { + "retry_timeout": 30, + "compression": "Gzip" + }, + "components": [ + { + "id": "resource_manager", + "path": "nvflare.app_common.resource_managers.gpu_resource_manager.GPUResourceManager", + "args": { + "num_of_gpus": 0, + "mem_per_gpu_in_GiB": 0 + } + }, + { + "id": "resource_consumer", + "path": "nvflare.app_common.resource_consumers.gpu_resource_consumer.GPUResourceConsumer", + "args": {} + }, + { + "id": "process_launcher", + "path": "nvflare.app_common.job_launcher.client_process_launcher.ClientProcessJobLauncher", + "args": {} + }, + { + "id": "error_log_sender", + "path": "nvflare.app_common.logging.log_sender.ErrorLogSender", + "args": {} + }, + { + "id": "security_handler", + "path": "security_handler.CustomSecurityHandler" + } + ] +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/edit_site_local_resources.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/edit_site_local_resources.py new file mode 100644 index 0000000000..d5faa325ca --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/edit_site_local_resources.py @@ -0,0 +1,64 @@ +# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +import json +import os +import sys + + +def get_security_handler() -> dict: + return json.loads( + """ + { + "id": "security_handler", + "path": "security_handler.ServerCustomSecurityHandler" + } + """ + ) + + +def add_components_to_json(input_file_path, output_file_path, site: str): + + try: + with open(input_file_path, "r") as file: + data = json.load(file) + except (FileNotFoundError, json.JSONDecodeError): + print("Error: Unable to read or parse JSON file.") + return + + if "components" not in data or not isinstance(data["components"], list): + print("Error: 'components' key not found or is not a list.") + return + + new_components = [get_security_handler()] + + # Append new components to the list + data["components"].extend(new_components) + + # Write the updated JSON back to the file + with open(output_file_path, "w") as file: + json.dump(data, file, indent=4) + + print(f"Successfully generate file: '{output_file_path}'.") + + +if __name__ == "__main__": + + site_name = sys.argv[1] + project_root_dir = sys.argv[2] + + input_file_path = os.path.join(project_root_dir, site_name, "local", "resources.json.default") + output_file_path = os.path.join(project_root_dir, site_name, "local", "resources.json") + add_components_to_json(input_file_path, output_file_path, site_name) diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/security/server/custom/security_handler.py b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/security/server/custom/security_handler.py new file mode 100644 index 0000000000..33348e971a --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/security/server/custom/security_handler.py @@ -0,0 +1,30 @@ +# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from nvflare.apis.event_type import EventType +from nvflare.apis.fl_component import FLComponent +from nvflare.apis.fl_context import FLContext +from nvflare.apis.fl_exception import NotAuthenticated + + +class ServerCustomSecurityHandler(FLComponent): + def handle_event(self, event_type: str, fl_ctx: FLContext): + if event_type == EventType.CLIENT_REGISTER_RECEIVED: + self.authenticate(fl_ctx=fl_ctx) + + def authenticate(self, fl_ctx: FLContext): + peer_ctx: FLContext = fl_ctx.get_peer_context() + client_name = peer_ctx.get_identity_name() + if client_name == "site-2": + raise NotAuthenticated("site-2 not allowed to register") diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/server_side_security_plugin.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/server_side_security_plugin.ipynb new file mode 100644 index 0000000000..a8c5aa48d6 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/custom_server_side_authentication/server_side_security_plugin.ipynb @@ -0,0 +1,163 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c1f37174", + "metadata": {}, + "source": [ + "# Server Side Customized Authenticaton\n", + "\n", + "In this example, we will see how do we designed a custom plugin with additional authentication check, \n", + "As result for two sites in POC, site-2 is NOT able to start and register to the server. It's blocked by the ServerCustomSecurityHandler logic during the client registration.\n", + "\n", + "## Define a server side security handler\n", + "\n", + "Notice the we the customized the handler raise NotAuthenticated(\"site_2 not allowed to register\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "235ed0fa", + "metadata": {}, + "outputs": [], + "source": [ + "!cat security/server/custom/security_handler.py" + ] + }, + { + "cell_type": "markdown", + "id": "58c8f45a", + "metadata": {}, + "source": [ + "To register this plugin handler, we need to add this component to the server site's local configuration\n", + "\n", + "by adding it to the components array \n", + "\n", + "```\n", + " components: [\n", + " ...\n", + " {\n", + " \"id\": \"security_handler\",\n", + " \"path\": \"security_handler.ServerCustomSecurityHandler\"\n", + " }\n", + " ] \n", + "``` \n", + "\n", + "In this example, we will copy \"custom\" folder and \n", + "\n", + "use python code ```edit_site_local_resources.py``` to create \"resources.json\" to the at site local directory\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d5bfe22f", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "! echo y | nvflare poc prepare" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4c63cb5a", + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "\n", + "!cp -r security/server/* /tmp/nvflare/poc/example_project/prod_00/server/local/.\n", + "!python edit_site_local_resources.py server /tmp/nvflare/poc/example_project/prod_00" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7d425be8", + "metadata": {}, + "outputs": [], + "source": [ + "# double check\n", + "! tree /tmp/nvflare/poc/example_project/prod_00/server/local/ " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a3eeae9", + "metadata": {}, + "outputs": [], + "source": [ + "!cat /tmp/nvflare/poc/example_project/prod_00/server/local/resources.json" + ] + }, + { + "cell_type": "markdown", + "id": "eb6e964e", + "metadata": {}, + "source": [ + "Now, go to a terminal and try to start FL system with \n", + "\n", + "```\n", + "\n", + "nvflare poc start -ex admin@nvidia.com\n", + "```\n", + "\n", + "See what happens" + ] + }, + { + "cell_type": "markdown", + "id": "f33e482c", + "metadata": {}, + "source": [ + "You should see something like this: \n", + "\n", + "\n", + "```\n", + "2025-02-02 16:35:40,059 - Communicator - INFO - Trying to register with server ...\n", + "2025-02-02 16:35:40,060 - ServerCustomSecurityHandler - ERROR - [identity=server, run=?, peer=site-2, peer_run=?] - Exception when handling event \"_client_register_received\": NotAuthenticated: site-2 not allowed to register\n", + "\n", + "\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "84ab1ecd", + "metadata": {}, + "outputs": [], + "source": [ + "# Clean up\n", + "! nvflare poc stop\n", + "! nvflare poc clean" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/customized_site_security.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/customized_site_security.ipynb new file mode 100644 index 0000000000..13f321b533 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/customized_site_security.ipynb @@ -0,0 +1,73 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "0e725a2d-af59-43ae-8ef8-aeabe5581d4b", + "metadata": {}, + "source": [ + "# Site-specific Authentication and Federated Job-level Authorization\n", + "\n", + "Site-specific authentication and authorization allows users to inject their own authentication and authorization methods into the NVFlare system. This includes the FL server / clients registration, authentication, and the job deployment and run authorization.\n", + "\n", + "NVFlare provides a general purpose event based pluggable authentication and authorization framework to allow for expanding functionality such as:\n", + "\n", + "* exposing the app through a WAF (Web Application Firewall) or any other network element enforcing Mutual Transport Layer Security(mTLS)\n", + "\n", + "* using a confidential certification authority to ensure the identity of each participating site and to ensure that they meet the computing requirements for confidential computing\n", + "\n", + "* defining additional roles to manage who can submit which kind of jobs to execute within NVFlare, identify who submits jobs and which dataset can be accessed\n", + "\n", + "Users can write their own FLComponents, listening to the NVFlare system events at different points of their workflow, then easily plug in their authentication and authorization logic as needed.\n", + "\n", + "### Assumptions and Risks\n", + "By enabling the customized site-specific authentication and authorization, NVFlare will make several security related data available to the external FL components, e.g. IDENTITY_NAME, PUBLIC_KEY, CERTIFICATE, etc. In order to protect them from being compromised, that data needs to be made read-only.\n", + "\n", + "Because of the external pluginable authentication and authorization processes, the results of the processes could potentially cause the jobs to not be able to be deployed or run. When configuring and using these functions, the users need to be aware of the impact and know where to plug in the authentication and authorization check.\n", + "\n", + "### Event based pluginable authentication and authorization\n", + "The NVFlare event based solution supports site-specific authentication and federated job-level authorization. Users can provide and implement any sort of additional security checks by building and plugging in FLcomponents which listen to the appropriate events and provide custom authentication and authorization functions.\n" + ] + }, + { + "cell_type": "markdown", + "id": "928772df", + "metadata": {}, + "source": [ + "Lets look these mechanism via \n", + "\n", + "* [Customized Server Side Security Check](./custom_server_side_authentication/server_side_security_plugin.ipynb)\n", + "* [Customized Client side job-level check](./custom_client_side_job_level_authorization/client_side_security_check.ipynb)\n", + "* [client side 3rd party authentication integration](./custom_client_side_auth_system_integration/keycloak_integration.ipynb)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fa94cace", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/unsafe_component_detection/unsafe_detection.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_customized_site_security/unsafe_component_detection/unsafe_detection.ipynb new file mode 100644 index 0000000000..e69de29bb2 diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.5_communition_security/communication_security.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.5_communition_security/communication_security.ipynb new file mode 100644 index 0000000000..f3c9fd76b9 --- /dev/null +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.5_communition_security/communication_security.ipynb @@ -0,0 +1,195 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Connection and Message Security\n", + "\n", + "Currently Flare's message security comes from mutual TLS: server and client authenticate each other when making direct connections. This means that only clients that have the right startup kits can make a connection to the server.\n", + "\n", + "However this solution may be hard for some customers due to their IT infrastructure policies. To enable customers to use their own connectivity solution, Flare now supports **BYOConn** (bring your own connectivity). \n", + "\n", + "Customer provided connectivity solution must meet the following requirements:\n", + "\n", + "* A client must be able to reach the server. The client and the server do not have to be connected directly. Any number of proxies can be used to provide the communication path between the client and the server.\n", + "* The communication path between the client and the server must ensure the confidentiality and integrity of messages.\n", + "\n", + "Since the connectivity solution may expose the server to the internet and allow any one to connect to the server, it is important to ensure that messages sent to the server are explicitly authenticated.\n", + "\n", + "Flare 2.5 and above support explicit message authentication: messages received by the server must have an auth token, and the token must be validated successfully to prove that it was issued by the server.\n", + "\n", + "\n", + "Here is how it works:\n", + "\n", + "* The very first thing after the client is started is to login to the server. The server and the client authenticate each other explicitly with the credentials in their startup kits. This step is independent of how the client and the server are connected (mTLS or normal TLS).\n", + "\n", + "* If the client credential is validated successfully, the server issues a token and a signature that binds the client name and token together. The signature is generated with the server's private key to prove that the signature can only be issued by the server.\n", + "\n", + "* When sending a message to the server, the client adds its name, token and the signature as headers to the message.\n", + "\n", + "* When the message is received, the server validates the token and the signature. Messages that are missing these headers or fail to validate will be rejected.\n", + "\n", + "\n", + "Note that this mechanism is based on the security of the startup kits: login processing and token signature are performed using the PKI credentials in the startup kits of the sites. All sites must protect their startup kits securely, and never share the received token and signature with others.\n", + "\n", + "\n", + "Message authentication applies to ALL messages that go through the Server, regardless of how the message is initiated (by FL clients, 3rd-party systems using client API, and even by the server itself). \n", + "\n", + "-------\n", + "## Provision Connection Security\n", + "\n", + "\n", + "The Provision system now allows you to specify connection security explicitly for each site. \n", + "\n", + "\n", + "TLS\n", + "\n", + "\n", + "This is normal TLS (i.e. 1-way SSL). Client certs are not required for establishing the connection, but a Root Cert is required to validate the server. You can provide a custom root cert for validating the server (usually the endpoint that represents the server, and the server is actually behind that endpoint). If you do not provide a custom root certificate, the root cert generated by the Provision tool will be used.\n", + "\n", + "\n", + "\n", + "\n", + "mTLS\n", + "\n", + "\n", + "This is mutual TLS (i.e. 2-way SSL). PKI credentials in the startup kits will be used for client/server connections.\n", + "\n", + "\n", + ">TLS and mTLS use two different ways for authenticating the peers. In TLS, client authenticates the server, but the server does not authenticate the client. TLS is also called one-way SSL. During the SSL handshake, server has to present evidence signed with its private key; and the client has to be able to validate the evidence with the server's public key (usually contained in the server's certificate). But the client does not need to present anything to the server.\n", + "\n", + "In mTLS, client and server authenticate each other, meaning that they must present evidence signed with tier private key and the they must be able to validate the evidence with each other's public key.\n", + "\n", + "So both site using TLS doesn't mean mTLS. In fact, both sites must use the same mode (mTLS or TLS) at the same time\n", + "\n", + "\n", + "Clear\n", + "\n", + "\n", + "Messages are not encrypted. This is usually used when the server is deployed behind a proxy, and the communication between the proxy and the server is in clear text.\n", + "\n", + "\n", + "### Configuration\n", + "\n", + "\n", + "To configure connection security, you need to define the connection_security\n", + "and the custom_ca_cert property. You can define them at the project level and participant level. \n", + "\n", + "Here is a snippet that shows how to use these properties:\n", + "\n", + "```yaml\n", + "\n", + "api_version: 3\n", + "name: test25\n", + "description: NVIDIA FLARE sample project yaml file\n", + "connection_security: tls\n", + "custom_ca_cert: /path/of/customRoot.pem\n", + "\n", + "\n", + "participants:\n", + " # change example.com to the FQDN of the server\n", + " - name: server\n", + " type: server\n", + " org: nvidia\n", + " fed_learn_port: 8002\n", + " admin_port: 8003\n", + " connection_security: clear\n", + " - name: site-1\n", + " type: client\n", + " org: nvidia\n", + "\n", + "```\n", + "\n", + "In this example, the default connection security for the project is “tls” since it’s specified at the project level. However the server is configured to be “clear” since it is behind a secure proxy. All clients will use “tls” to connect to the proxy. The “custom_ca_cert” is actually the cert for the clients to validate the proxy server.\n", + "\n", + "The custom_ca_cert property is only used for server authentication when making a TLS connection. If not specified, the root CA cert generated by the Provision System will be used.\n", + "\n", + "If connection_security is not specified, the default will be mTLS.\n", + "\n", + "### When to Use Custom CA Certificates\n", + "\n", + "Customers may implement their connectivity using some proxies that will sit between FL Clients and the FL Server. FL Clients only directly connect to the proxy server, which then connects to the FL Server. Typically FL Clients will use one-way SSL to connect to the proxy server. In this case, the custom_ca_cert is the CA cert used by the proxy server.\n", + "\n", + "\n", + "-----------\n", + "## Multi-Address Support of FL Server\n", + "\n", + "Currently, the FL Server has a single address that must be used for all FL clients to access. Some customers find it to be very limiting, and they want to be able to use different addresses for different clients. Another limitation is that the server address must be specified as a domain name - IP addresses are not supported. Furthermore, the domain name also has to be specified as the “name” of the server in Provision config. Since the name is set to the “common name” of the server’s certificate, it cannot exceed 63 characters. This makes it impossible to use domain names longer than 63 characters.\n", + "\n", + "Flare 2.6 will support multiple addresses for the FL Server. In this case, the FL Server can expose multiple addresses that can be used for FL clients to connect. Depending on the customer’s IT policies, different FL clients may use different addresses. For example, the FL Server may provide two addresses, one accessible from the internet, another accessible only from the internal network.\n", + "\n", + "Server address can be specified as a domain name or an IP address.\n", + "\n", + "Example\n", + "\n", + "```yaml\n", + "\n", + "participants:\n", + " # change example.com to the FQDN of the server\n", + " - name: server\n", + " type: server\n", + " org: nvidia\n", + " fed_learn_port: 8002\n", + " admin_port: 8003\n", + " host_names: [localhost, 127.0.0.1]\n", + " default_host: localhost\n", + " # connection_security: insecure\n", + " - name: red\n", + " type: client\n", + " org: nvidia\n", + " connect_to: 127.0.0.1\n", + " - name: blue\n", + " type: client\n", + " org: nvidia\n", + " connect_to: localhost\n", + " - name: silver\n", + " type: client\n", + " org: nvidia\n", + " - name: admin@nvidia.com\n", + " type: admin\n", + " org: nvidia\n", + " role: project_admin\n", + " connect_to: 127.0.0.1\n", + "\n", + "```\n", + "\n", + "In this example, the FL Server defines additional two host names using the host_names property: localhost (a domain name) and 127.0.0.1 (an IP address). To be backward compatible, the name of the server is treated as the default address if the “default_host” property is not explicitly defined. In this example, the default address is explicitly defined as “localhost”.\n", + "\n", + "Addresses specified with the host_names property are limited to 255 characters.\n", + "\n", + "Three clients are defined here: red, blue, and silver. The “connect_to” property specifies the address to use for the client. Of course, the specified address must be available from the server.\n", + "\n", + "In this example, client “red” will connect to 127.0.0.1; client “blue” will connect to “localhost”; client “silver” will connect to the default address of the server, which is “localhost”.\n", + "\n", + "The admin client will connect to 127.0.0.1.\n", + "\n", + "For this configuration to work, the IT Administrator of the FL Server must ensure that the specified addresses are actually accessible.\n", + "\n", + "## Connection Security and secure_train Flag\n", + "\n", + "Connection Security specifies how the connection will be made (Clear, TLS or mTLS), which is independent of whether training is secure mode or not. If conn sec is not explicitly specified for a site, then its conn sec will be decided by the “secure_train” flag. If secure_train is True, then use mTLS, otherwise use clear. This is to be backward compatible. In the past, these two things were treated as one.\n", + "\n", + "The secure_train flag applies to the whole project - you cannot have the case that this flag is True for some sites and False for another.\n", + "\n", + "Connection Security is site specific - different sites can use different settings as long as connections can be made properly. For example, the Server may define conn sec to be “clear”, whereas Site-1 may define it to be TLS, and Site-2 may define it to be mTLS. You may wonder how this could ever work. It of course won’t work if the two sites connect to the Server directly. But in the case of BYOConn, you never know how the customer will set up their communication network. For example, if they use proxies between the Server and the sites, the sites won’t directly connect to the Server. The customer will have to set up the conn sec properly so that the sites can connect to their proxies properly.\n", + "\n", + "The secure_train flag also triggers privacy protection features (e.g. loading privacy resources from the site’s local folder). This flag is always set to True, except when using the Simulator.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_recap/recap.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.8_recap/recap.ipynb similarity index 100% rename from examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.4_recap/recap.ipynb rename to examples/tutorials/self-paced-training/part-3_security_and_privacy/chapter-6_Security_in_federated_compute_system/06.8_recap/recap.ipynb diff --git a/examples/tutorials/self-paced-training/part-3_security_and_privacy/part-3_introduction.ipynb b/examples/tutorials/self-paced-training/part-3_security_and_privacy/part-3_introduction.ipynb index 3d062d9e51..ca7c151d09 100644 --- a/examples/tutorials/self-paced-training/part-3_security_and_privacy/part-3_introduction.ipynb +++ b/examples/tutorials/self-paced-training/part-3_security_and_privacy/part-3_introduction.ipynb @@ -11,9 +11,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "[Chapter 3.1 Privacy in Federated Learning](./chapter-3.1_Privacy_In_Federated_Learning/03.1.0_introduction.ipynb)\n", + "[Chapter 3.1 Privacy in Federated Learning](./chapter-5_Privacy_In_Federated_Learning/05.0_introduction/introduction.ipynb)\n", + "\n", + "[Chapter 3.2 Security in Federated Computing System](./chapter-6_Security_in_federated_compute_system/06.0_introduction/introduction.ipynb)\n", "\n", - "[Chapter 3.2 Security in Federated Computing System](chapter-3.2_Security_in_federated_compute_system/03.2.0_introduction.ipynb)\n", "\n" ] }, @@ -21,9 +22,153 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Federated Learning (FL) has emerged as a groundbreaking approach to distributed machine learning, enabling collaborative model training without sharing raw data. This paradigm is particularly vital for sensitive domains like healthcare, finance, and smart cities, where data privacy is paramount. However, the distributed nature of FL introduces unique security and privacy challenges, such as safeguarding against data leakage, adversarial attacks, and ensuring the integrity of model updates. NVIDIA FLARE addresses these concerns by providing a robust, extensible framework designed for secure and privacy-preserving FL workflows. By incorporating advanced cryptographic techniques, secure aggregation protocols, and role-based access control, NVIDIA FLARE empowers organizations to harness the full potential of FL while mitigating risks associated with data and model vulnerabilities. This ensures that collaborative machine learning remains not only effective but also trustworthy.\n", + "Federated Learning (FL) has emerged as a groundbreaking approach to distributed machine learning, enabling collaborative model training without sharing raw data. This paradigm is particularly vital for sensitive domains like healthcare, finance, and smart cities, where data privacy is paramount. However, the distributed nature of FL introduces unique security and privacy challenges, such as safeguarding against data leakage, adversarial attacks, and ensuring the integrity of model updates. NVIDIA FLARE addresses these concerns by providing a robust, extensible framework designed for secure and privacy-preserving FL workflows. By incorporating advanced cryptographic techniques, secure aggregation protocols, and role-based access control, NVIDIA FLARE empowers organizations to harness the full potential of FL while mitigating risks associated with data and model vulnerabilities. This ensures that collaborative machine learning remains not only effective but also trustworthy.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "---\n", + "\n", + "# **Introduction to Federated Learning Security and Privacy** \n", + "\n", + "Federated Learning (FL) is a decentralized machine learning paradigm that enables multiple participants to collaboratively train a global model without sharing raw data. This approach enhances privacy and efficiency but also introduces security and privacy challenges unique to distributed learning environments. Ensuring robust FL deployments requires addressing both **privacy protection** and **security mechanisms**, as well as mitigating potential attacks. \n", + "\n", + "This discussion is structured as follows: \n", + "1. **Security and privacy protection in FL** (general overview). \n", + "2. **Privacy attacks and protection approaches** (threats and defenses). \n", + "3. **Security aspects of FL** (authentication, authorization, communication, and trust mechanisms). \n", + "\n", + "--- \n", + "\n", + "## **1. Security and Privacy Protection in Federated Learning** \n", + "\n", + "FL improves data privacy by keeping raw data localized on client devices or within institutional boundaries. However, it is still vulnerable to privacy leaks through model updates, and security threats that may compromise the integrity and trustworthiness of the learning process. \n", + "\n", + "### **1.1 Privacy in FL** \n", + "- **Privacy-Preserving Nature**: Unlike centralized learning, FL ensures that sensitive user data remains local, reducing exposure risks. \n", + "- **Threats to Privacy**: Even though raw data isn't shared, model updates (gradients, weights) can still reveal private information through reconstruction attacks. \n", + "- **Privacy-Preserving Techniques**: Differential privacy, secure aggregation, and homomorphic encryption are commonly employed to mitigate risks. \n", + "\n", + "### **1.2 Security in FL** \n", + "- **Threat Landscape**: FL is vulnerable to adversarial attacks, model poisoning, and communication threats that can compromise model performance and security. \n", + "- **Trust Management**: Since multiple untrusted clients contribute to the global model, FL requires robust mechanisms for authentication, authorization, and trust evaluation. \n", + "\n", + "---\n", + "\n", + "## **2. Privacy Attacks and Protection Approaches in FL** \n", + "\n", + "Despite keeping raw data local, FL is vulnerable to privacy leaks through indirect means. Below are the major privacy attacks and their respective protection strategies. \n", + "\n", + "### **2.1 Privacy Attacks** \n", + "\n", + "#### **2.1.1 Gradient Leakage & Model Inversion** \n", + "- Attackers analyze model gradients to reconstruct original training data. \n", + "- **Example**: A malicious server infers personal images or text from gradient updates. \n", + "- **Protection**: Differential privacy (adds noise to gradients), homomorphic encryption (encrypts updates before sharing). \n", + "\n", + "#### **2.1.2 Membership Inference Attacks** \n", + "- Adversaries determine whether a specific data sample was used in model training. \n", + "- **Protection**: Differential privacy, adversarial regularization, and dropout techniques. \n", + "\n", + "#### **2.1.3 Property Inference Attacks** \n", + "- Attackers infer sensitive attributes about the training data, even if they cannot fully reconstruct it. \n", + "- **Protection**: Private set intersection (PSI) to limit exposure, feature obfuscation. \n", + "\n", + "### **2.2 Privacy Protection Approaches** \n", + "\n", + "#### **2.2.1 Differential Privacy (DP)** \n", + "- Introduces controlled noise to training updates to prevent individual data points from being distinguishable. \n", + "- **Common Methods**: Local DP (applied at the client level), Global DP (applied at the server). \n", + "\n", + "#### **2.2.2 Secure Multi-Party Computation (SMPC)** \n", + "- Allows multiple participants to jointly compute a function without revealing their inputs. \n", + "- **Example**: Clients encrypt updates before sending them to the server. \n", + "\n", + "#### **2.2.3 Homomorphic Encryption (HE)** \n", + "- Enables computations on encrypted data without decryption. \n", + "- **Challenge**: High computational overhead on edge devices. \n", + "\n", + "#### **2.2.4 Secure Aggregation** \n", + "- Ensures that individual updates remain hidden by aggregating encrypted updates from multiple participants before decryption. \n", + "- **Example**: Federated averaging with secure aggregation to mask individual updates. \n", + "\n", + "---\n", + "\n", + "## **3. Security Aspects of Federated Learning Systems** \n", + "\n", + "FL requires robust security mechanisms to ensure that only legitimate and trusted participants contribute, while also protecting communication channels and enforcing authorization policies. Below are the critical security components of an FL system. \n", + "\n", + "### **3.1 Authentication Mechanisms** \n", + "Ensures that only verified clients and servers participate in the FL process. \n", + "\n", + "#### **Public Key Infrastructure (PKI) & Digital Signatures** \n", + "- Each participant has a cryptographic key pair for identity verification. \n", + "- Prevents impersonation attacks. \n", + " \n", + "---\n", + "\n", + "### **3.2 Authorization & Access Control** \n", + "Ensures that only authorized participants can contribute to or access the FL model. \n", + "\n", + "#### **3.2.1 Role-Based Access Control (RBAC)** \n", + "- Assigns permissions based on predefined roles (e.g., model trainer, auditor). \n", + "- Prevents unauthorized modification of the global model. \n", + "\n", + "#### **3.2.2 Attribute-Based Access Control (ABAC)** \n", + "- Extends RBAC by dynamically evaluating client attributes such as reputation or past behavior. \n", + " \n", + "---\n", + "\n", + "### **3.3 Secure Communication Protocols** \n", + "Protects FL updates from eavesdropping, interception, and tampering. \n", + "\n", + "#### **3.3.1 End-to-End Encryption (E2EE)** \n", + "- Ensures that model updates remain encrypted during transmission. \n", + "- Prevents man-in-the-middle (MitM) attacks. \n", + "\n", + "#### **3.3.2 Transport Layer Security (TLS) & Secure Channels** \n", + "- Encrypts communication channels between FL participants. \n", + "- **gRPC with TLS**: Secure, efficient communication for FL. \n", + "\n", + "---\n", + "\n", + "### **3.4 Trust and Reputation Mechanisms** \n", + "FL relies on trust-based mechanisms to handle the participation of potentially untrusted clients. \n", + "\n", + "#### **3.4.1 Trust-Based Client Selection** \n", + "- Assigns reputation scores based on previous behavior. \n", + "- Malicious or unreliable clients are gradually excluded. \n", + "\n", + "#### **3.4.2 Federated Auditing and Verifiable Training** \n", + "- Verifies whether clients follow protocol and do not inject poisoned updates. \n", + "\n", + "#### **TEE-Based Trust Management in Federated Learning**\n", + "- TEE is a secure VM or process that isolates sensitive computations from the rest of the system. It provides:\n", + "* Confidentiality: Prevents unauthorized access to sensitive data.\n", + "* Integrity: Ensures code and data within the TEE cannot be tampered with.\n", + "* Remote Attestation: Allows verification that computations are performed inside a trusted environment.\n", + "\n", + "---\n", + "\n", + "## **Conclusion** \n", + "\n", + "Federated Learning introduces significant security and privacy challenges, requiring a multi-layered approach to protection. \n", + "\n", + "1. **Privacy Protection**: Techniques like differential privacy, secure aggregation, and homomorphic encryption mitigate privacy risks. \n", + "2. **Security Measures**: Authentication, authorization, encrypted communication, and trust mechanisms secure the FL ecosystem against adversarial threats. \n", + "3. **Resilience to Attacks**: Byzantine-resilient aggregation, anomaly detection, and blockchain-based trust management improve FL security. \n", + "\n", + "As FL adoption expands in industries like healthcare, finance, and edge AI, addressing these concerns will be crucial for its long-term success. \n", "\n", - "In this part, we will have two chapters with one focused on privacy and another focused on security. " + "In this part, we will discuss how NVDIA FLARE implements many aspected discussed here\n" ] }, {