-
Notifications
You must be signed in to change notification settings - Fork 956
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many pings and one client always disconnects #4300
Comments
Did you come to a solution? |
Hello, I am still encountering this problem, and it occurs quite randomly. A few things have helped me reduce the frequency of this issue:
|
Hi @ajulyav, Thanks for raising this. Are you still experiencing this issue? |
Hi @ajulyav, Could you please paste the code that you used to produce this error? |
@WilliamLindskog Hello! I'll try to send you the code, though it's not really code-specific. What I've noticed is that even with the same code/dataset, running it on multiple nodes can cause issues after a few rounds (or sometimes just one round). However, running it on a single node seems to be more stable with some tricks. I'll try to come back with more details, including the code and sbatch script. Thank you! |
So, yesterday I run a simple experiment 4 clients, 1 server I got this error on only 1 client after 4 global rounds:
My training code is quite simple across all clients it is same, so for other clients it did not cause the same issue:
So, I assume that the problem is not in the user code. |
Hi @ajulyav, From what I can see, this is code based on a no longer supported version of Flower. Have you tested newer examples like: https://github.com/adap/flower/tree/main/examples/quickstart-pytorch? You can reproduce your set-up by changing the number of supernodes in
|
Describe the bug
I've got my grpc server settings as:
but it does not help though
later, i added up two options:
it allowed me escape the initial error, but then I have:
Steps/Code to Reproduce
I use basic FedAvg strategy except that i send additional round of evaluation on each client during aggregate_fit
EvaluateRes = client_proxy.evaluate(ins = evaluate_ins, timeout = None, group_id=rnd)
. Sometimes when rerun the clients and server, the error happens after 1 successful round, so it is not always happens the same moment.Expected Results
Client stays alive
Actual Results
Client disconnects
The text was updated successfully, but these errors were encountered: