Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many pings and one client always disconnects #4300

Open
ajulyav opened this issue Oct 7, 2024 · 7 comments
Open

Too many pings and one client always disconnects #4300

ajulyav opened this issue Oct 7, 2024 · 7 comments
Labels
bug Something isn't working part: communication Issues/PRs that affect federated communication e.g. gRPC. stale If issue/PR hasn't been updated within 3 weeks.

Comments

@ajulyav
Copy link

ajulyav commented Oct 7, 2024

Describe the bug

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Too many pings"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:192.168.229.99:5040 {grpc_message:"Too many pings", grpc_status:14, created_time:"2024-10-07T15:40:46.164225255+02:00"}"
>

I've got my grpc server settings as:

        ("grpc.http2.max_pings_without_data", 0),
        # Is it permissible to send keepalive pings from the client without
        # any outstanding streams. More explanation here:
        # https://github.com/adap/flower/pull/2197
        ("grpc.keepalive_permit_without_calls", 0),

but it does not help though

later, i added up two options:

        ("grpc.http2.max_ping_strikes", 0),
        ("grpc.http2.min_ping_interval_without_data_ms", 10)

it allowed me escape the initial error, but then I have:

    raise GrpcBridgeClosed()
flwr.server.superlink.fleet.grpc_bidi.grpc_bridge.GrpcBridgeClosed

Steps/Code to Reproduce

I use basic FedAvg strategy except that i send additional round of evaluation on each client during aggregate_fit
EvaluateRes = client_proxy.evaluate(ins = evaluate_ins, timeout = None, group_id=rnd) . Sometimes when rerun the clients and server, the error happens after 1 successful round, so it is not always happens the same moment.

Expected Results

Client stays alive

Actual Results

Client disconnects

@ajulyav ajulyav added the bug Something isn't working label Oct 7, 2024
@oabuhamdan
Copy link

Did you come to a solution?

@ajulyav ajulyav closed this as completed Oct 24, 2024
@ajulyav ajulyav reopened this Oct 24, 2024
@ajulyav ajulyav closed this as completed Oct 24, 2024
@ajulyav
Copy link
Author

ajulyav commented Oct 24, 2024

Did you come to a solution?

Hello, I am still encountering this problem, and it occurs quite randomly. A few things have helped me reduce the frequency of this issue:

  1. Run the server and clients on the same machine, so you can use "localhost" as the server address.
  2. If you're using loops that send messages to clients, try replacing the loop with non-loop code.

@ajulyav ajulyav reopened this Oct 24, 2024
@WilliamLindskog WilliamLindskog added stale If issue/PR hasn't been updated within 3 weeks. part: communication Issues/PRs that affect federated communication e.g. gRPC. labels Dec 11, 2024
@WilliamLindskog
Copy link
Contributor

Hi @ajulyav,

Thanks for raising this. Are you still experiencing this issue?

@WilliamLindskog
Copy link
Contributor

Hi @ajulyav,

Could you please paste the code that you used to produce this error?

@ajulyav
Copy link
Author

ajulyav commented Mar 7, 2025

@WilliamLindskog Hello! I'll try to send you the code, though it's not really code-specific. What I've noticed is that even with the same code/dataset, running it on multiple nodes can cause issues after a few rounds (or sometimes just one round). However, running it on a single node seems to be more stable with some tricks.

I'll try to come back with more details, including the code and sbatch script. Thank you!

@ajulyav ajulyav closed this as completed Mar 7, 2025
@ajulyav ajulyav reopened this Mar 7, 2025
@ajulyav
Copy link
Author

ajulyav commented Mar 12, 2025

So, yesterday I run a simple experiment 4 clients, 1 server

I got this error on only 1 client after 4 global rounds:

 File "main_cnn.py", line 77, in <module>
    Main(args)
  File "main_cnn.py", line 67, in Main
    fl.client.start_client(server_address="localhost:8005", client=trainer.to_client())  
  File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/app.py", line 157, in start_client
    _start_client_internal(
  File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/app.py", line 333, in _start_client_internal
    message = receive()
  File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/flwr/client/grpc_client/connection.py", line 144, in receive
    proto = next(server_message_iterator)
  File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/grpc/_channel.py", line 543, in __next__
    return self._next()
  File "/home/user/.conda/envs/Flower/lib/python3.8/site-packages/grpc/_channel.py", line 952, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "Too many pings"
	debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:8005 {created_time:"2025-03-11T18:13:54.91980787+01:00", grpc_status:14, grpc_message:"Too many pings"}"
>

My training code is quite simple across all clients it is same, so for other clients it did not cause the same issue:

 def train_epoch(self):

     train_loss = 0.0
     train_ph_acc = 0.0

     self.model.train()
     for bidx, batch in enumerate(self.train_loader):
         self.optimizer.zero_grad()

         batch_input, batch_ph, = batch
         batch_input, batch_ph = batch_input.cuda(), batch_ph.cuda()
        
         with torch.cuda.amp.autocast(enabled=self.use_amp):
             pred_ph = self.model(batch_input, None)
             loss = self.loss_func(batch_ph, pred_ph)

         train_loss += loss.item()
         self.scaler.scale(loss).backward()
         self.scaler.step(self.optimizer)
         self.scaler.update()

         torch.cuda.synchronize()
         #some code for logging and computing some metrics on client side

     return train_loss, train_ph_acc

So, I assume that the problem is not in the user code.

@ajulyav ajulyav closed this as completed Mar 12, 2025
@ajulyav ajulyav reopened this Mar 12, 2025
@WilliamLindskog
Copy link
Contributor

Hi @ajulyav,

From what I can see, this is code based on a no longer supported version of Flower. Have you tested newer examples like: https://github.com/adap/flower/tree/main/examples/quickstart-pytorch?

You can reproduce your set-up by changing the number of supernodes in pyproject.toml.

options.num-supernodes = 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working part: communication Issues/PRs that affect federated communication e.g. gRPC. stale If issue/PR hasn't been updated within 3 weeks.
Projects
None yet
Development

No branches or pull requests

3 participants