sacs-epfl
diff --git a/‎.gitignore
Lines changed: 14 additions & 0 deletions b/‎.gitignore
Lines changed: 14 additions & 0 deletions
diff --git a/‎.gitmodules
Lines changed: 3 additions & 0 deletions b/‎.gitmodules
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 69 additions & 0 deletions b/‎README.md
Lines changed: 69 additions & 0 deletions
diff --git a/‎cpfl/__init__.py
Lines changed: 3 additions & 0 deletions b/‎cpfl/__init__.py
Lines changed: 3 additions & 0 deletions
diff --git a/‎cpfl/core/__init__.py
Lines changed: 10 additions & 0 deletions b/‎cpfl/core/__init__.py
Lines changed: 10 additions & 0 deletions
diff --git a/‎cpfl/core/community.py
Lines changed: 160 additions & 0 deletions b/‎cpfl/core/community.py
Lines changed: 160 additions & 0 deletions
@@ -0,0 +1,14 @@
+.idea/**
+.DS_Store
+data/**
+
+__pycache__/
+*.py[cod]
+simulations/data/**
+simulations/dfl/data/**
+simulations/dl/data/**
+simulations/gl/data/**
+
+scripts/data/**
+*.log
+*.txt
@@ -0,0 +1,3 @@
+[submodule "pyipv8"]
+	path = pyipv8
+	url = https://github.com/devos50/py-ipv8
@@ -0,0 +1,69 @@
+# Cohort-Parallel Federated Learning (CPFL)
+Repository for the source code of our paper *[Harnessing Increased Client Participation with Cohort-Parallel Federated Learning](https://arxiv.org/pdf/2405.15644)* published at [The Workshop on Machine Learning and System 2025](https://euromlsys.eu/#).
+
+## Abstract
+
+Federated learning (FL) is a machine learning approach where nodes collaboratively train a global model.
+As more nodes participate in a round of FL, the effectiveness of individual model updates by nodes also diminishes.
+In this study, we increase the effectiveness of client updates by dividing the network into smaller partitions, or _cohorts_.
+We introduce Cohort-Parallel Federated Learning (CPFL): a novel learning approach where each cohort independently trains a global model using FL, until convergence, and the produced models by each cohort are then unified using knowledge distillation.
+The insight behind CPFL is that smaller, isolated networks converge quicker than in a one-network setting where all nodes participate.
+Through exhaustive experiments involving realistic traces and non-IID data distributions on the CIFAR-10 and FEMNIST image classification tasks, we investigate the balance between the number of cohorts, model accuracy, training time, and compute resources.
+Compared to traditional FL, CPFL with four cohorts, non-IID data distribution, and CIFAR-10 yields a 1.9x reduction in train time and a 1.3x reduction in resource usage, with a minimal drop in test accuracy.
+
+## Installation
+
+Start by cloning the repository recursively (since CPFL depends on the PyIPv8 networking library):
+
+```
+git clone [email protected]:sacs-epfl/cpfl.git --recursive
+```
+
+Install the required dependencies (preferably in a virtual environment to avoid conflicts with existing libraries):
+
+```
+pip install -r requirements.txt
+```
+
+In our paper, we evaluate CPFL using the CIFAR-10 and FEMNIST datasets.
+For CIFAR-10 we use `torchvision`. The FEMNIST dataset has to be downloaded manually and we refer the reader to the [decentralizepy framework](https://github.com/sacs-epfl/decentralizepy) that uses the same dataset.
+
+## Running CPFL
+
+Training with CPFL can be done by invoking the following scripts from the root of the repository:
+
+```
+# Running with the CIFAR-10 dataset
+bash scripts/cohorts/run_e2e_cifar10.sh <number_of_cohorts> <seed> <alpha> <peers>
+
+# Running with the FEMNIST dataset
+bash scripts/cohorts/run_2e2_femnist.sh <number_of_cohorts> <seed>
+```
+
+We refer to the respective bash scripts for more configuration options, such as the number of local steps, the number of participants, and other learning parameters.
+
+The script first splits the data across participants and participants across cohorts. These assignments are used during the distillation process.
+Then during FL training, each cohort will periodically checkpoint the current global model, as well as checkpointing the current best model (based on the loss obtained with a validation testset). The output of this experiment can be found in a separate folder in the `data` directory.
+
+After training, the checkpointed models can be distilled into a single model using the following command:
+
+```
+python3 scripts/distill.py $PWD/data n_200_cifar10_dirichlet0.100000_sd24082_ct10_dfl cifar10 stl10 --cohort-file cohorts/cohorts_cifar10_n200_c10.txt --public-data-dir <path_to_public_data> --learning-rate 0.001 --momentum 0.9 --partitioner dirichlet --alpha 0.1 --weighting-scheme label --check-teachers-accuracy > output_distill.log 2>&1
+```
+
+The above command invokes the `distill.py` script that scans the models in the `n_200_cifar10_dirichlet0.100000_sd24082_ct10_dfl` directory (created by the previous experiment) and merges them.
+The command also requires the path to the cohort information file created during the previous steps.
+The `distill.py` script automatically determines the attained accuracies of the obtained model after distillation.
+
+## Reference
+
+If you find our work useful, you can cite us as follows:
+
+```
+@inproceedings{dhasade2025cpfl,
+  title={Harnessing Increased Client Participation with Cohort-Parallel Federated Learning},
+  author={Dhasade, Akash and Kermarrec, Anne-Marie and Nguyen, Tuan-Ahn and Pires, Rafael and de Vos, Martijn},
+  booktitle={Proceedings of the 5th Workshop on Machine Learning and Systems},
+  year={2025}
+}
+```
@@ -0,0 +1,3 @@
+"""
+Contains code related to the DFL framework.
+"""
@@ -0,0 +1,10 @@
+from enum import Enum
+
+
+class TransmissionMethod(Enum):
+    EVA = 0
+
+
+class NodeMembershipChange(Enum):
+    JOIN = 0
+    LEAVE = 1
@@ -0,0 +1,160 @@
+import asyncio
+import time
+from asyncio import Future, ensure_future
+from binascii import unhexlify, hexlify
+from typing import Optional, Callable, Dict, List
+
+from cpfl.core import TransmissionMethod
+from cpfl.core.model_manager import ModelManager
+from cpfl.core.peer_manager import PeerManager
+from cpfl.core.session_settings import SessionSettings
+from cpfl.util.eva.protocol import EVAProtocol
+from cpfl.util.eva.result import TransferResult
+
+from ipv8.community import Community
+from ipv8.requestcache import RequestCache
+from ipv8.types import Peer
+
+
+class LearningCommunity(Community):
+    community_id = unhexlify('d5889074c1e4c60423cdb6e9307ba0ca5695ead7')
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        self.request_cache = RequestCache()
+        self.my_id = self.my_peer.public_key.key_to_bin()
+        self.round_complete_callback: Optional[Callable] = None
+        self.aggregate_complete_callback: Optional[Callable] = None
+
+        self.peers_list: List[Peer] = []
+
+        # Settings
+        self.settings: Optional[SessionSettings] = None
+
+        # State
+        self.is_active = False
+        self.did_setup = False
+        self.shutting_down = False
+
+        # Components
+        self.peer_manager: PeerManager = PeerManager(self.my_id, 100000)
+        self.model_manager: Optional[ModelManager] = None    # Initialized when the process is setup
+
+        # Model exchange parameters
+        self.eva = EVAProtocol(self, self.on_receive, self.on_send_complete, self.on_error)
+
+        # Availability traces
+        self.traces: Optional[Dict] = None
+        self.traces_count: int = 0
+
+        self.logger.info("The %s started with peer ID: %s", self.__class__.__name__,
+                         self.peer_manager.get_my_short_id())
+
+    def start(self):
+        """
+        Start to participate in the training process.
+        """
+        assert self.did_setup, "Process has not been setup - call setup() first"
+        self.is_active = True
+
+    def set_traces(self, traces: Dict) -> None:
+        self.traces = traces
+        events: int = 0
+
+        # Schedule the join/leave events
+        for active_timestamp in self.traces["active"]:
+            if active_timestamp == 0:
+                continue  # We assume peers will be online at t=0
+
+            self.register_anonymous_task("join", self.go_online, delay=active_timestamp)
+            events += 1
+
+        for inactive_timestamp in self.traces["inactive"]:
+            self.register_anonymous_task("leave", self.go_offline, delay=inactive_timestamp)
+            events += 1
+
+        self.logger.info("Scheduled %d join/leave events for peer %s (trace length in sec: %d)", events,
+                         self.peer_manager.get_my_short_id(), traces["finish_time"])
+
+        # Schedule the next call to set_traces
+        self.register_task("reapply-trace-%s-%d" % (self.peer_manager.get_my_short_id(), self.traces_count),
+                           self.set_traces, self.traces, delay=self.traces["finish_time"])
+        self.traces_count += 1
+
+    def go_online(self):
+        self.is_active = True
+        cur_time = asyncio.get_event_loop().time() if self.settings.is_simulation else time.time()
+        self.logger.info("Participant %s comes online (t=%d)", self.peer_manager.get_my_short_id(), cur_time)
+
+    def go_offline(self, graceful: bool = True):
+        self.is_active = False
+        cur_time = asyncio.get_event_loop().time() if self.settings.is_simulation else time.time()
+        self.logger.info("Participant %s will go offline (t=%d)", self.peer_manager.get_my_short_id(), cur_time)
+
+    def setup(self, settings: SessionSettings):
+        self.settings = settings
+        for participant in settings.participants:
+            self.peer_manager.add_peer(unhexlify(participant))
+
+        # Initialize the model
+        participant_index = settings.all_participants.index(hexlify(self.my_id).decode())
+        self.model_manager = ModelManager(None, settings, participant_index)
+
+        # Setup the model transmission
+        if self.settings.transmission_method == TransmissionMethod.EVA:
+            self.logger.info("Setting up EVA protocol")
+            self.eva.settings.block_size = settings.eva_block_size
+            self.eva.settings.max_simultaneous_transfers = settings.eva_max_simultaneous_transfers
+        else:
+            raise RuntimeError("Unsupported transmission method %s", self.settings.transmission_method)
+
+        self.did_setup = True
+
+    def get_peers(self):
+        if self.peers_list:
+            return self.peers_list
+        return super().get_peers()
+
+    def get_peer_by_pk(self, target_pk: bytes):
+        peers = list(self.get_peers())
+        for peer in peers:
+            if peer.public_key.key_to_bin() == target_pk:
+                return peer
+        return None
+
+    def on_eva_send_done(self, future: Future, peer: Peer, serialized_response: bytes, binary_data: bytes, start_time: float):
+        if future.cancelled():  # Do not reschedule if the future was cancelled
+            return
+
+        if future.exception():
+            peer_id = self.peer_manager.get_short_id(peer.public_key.key_to_bin())
+            self.logger.warning("Transfer to participant %s failed, scheduling it again (Exception: %s)",
+                                peer_id, future.exception())
+            # The transfer failed - try it again after some delay
+            ensure_future(asyncio.sleep(self.settings.model_send_delay)).add_done_callback(
+                lambda _: self.schedule_eva_send_model(peer, serialized_response, binary_data, start_time))
+        else:
+            # The transfer seems to be completed - record the transfer time
+            end_time = asyncio.get_event_loop().time() if self.settings.is_simulation else time.time()
+
+    def schedule_eva_send_model(self, peer: Peer, serialized_response: bytes, binary_data: bytes, start_time: float) -> Future:
+        # Schedule the transfer
+        future = ensure_future(self.eva.send_binary(peer, serialized_response, binary_data))
+        future.add_done_callback(lambda f: self.on_eva_send_done(f, peer, serialized_response, binary_data, start_time))
+        return future
+
+    async def on_receive(self, result: TransferResult):
+        raise NotImplementedError()
+
+    async def on_send_complete(self, result: TransferResult):
+        peer_id = self.peer_manager.get_short_id(result.peer.public_key.key_to_bin())
+        my_peer_id = self.peer_manager.get_my_short_id()
+        self.logger.info(f'Outgoing transfer {my_peer_id} -> {peer_id} has completed: {result.info.decode()}')
+
+    async def on_error(self, peer, exception):
+        self.logger.error(f'An error has occurred in transfer to peer {peer}: {exception}')
+
+    async def unload(self):
+        self.shutting_down = True
+        await self.request_cache.shutdown()
+        await super().unload()
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+[submodule "pyipv8"]`
	`2`	`+ path = pyipv8`
	`3`	`+ url = https://github.com/devos50/py-ipv8`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+"""`
	`2`	`+Contains code related to the DFL framework.`
	`3`	`+"""`