Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling issues #156

Open
6 of 8 tasks
eigenraven opened this issue Oct 14, 2021 · 3 comments
Open
6 of 8 tasks

Scaling issues #156

eigenraven opened this issue Oct 14, 2021 · 3 comments

Comments

@eigenraven
Copy link
Collaborator

eigenraven commented Oct 14, 2021

This is a grab-bag of issues encountered when scaling Faasm and Faabric to many requests per second.

Solutions to most of these problems are implemented in this fork: https://github.com/auto-ndp/faabric

There is a corresponding issue in Faasm: faasm/faasm#504

  • Executor initialization is synchronous within the scheduler thread before the child executor thread is spawned (thus blocking the scheduler). Instead it can be moved to just before the thread loop in the Executor.
  • ZeroMQ socket explosion - for each Executor we open several sockets to handle the different client/ server pairs. This means we open n_executors * n_ports_per_executor, which hits system-wide limits with a lot of Executors. Instead we could multiplex more/ all calls through a single socket. Because Executors are handled with separate threads, it would not be possible to cache the clients (as 0MQ sockets cannot be shared between threads). AFAICT Distributed Tests for MPI #260 should address this for MPI which was the main culprit.
  • ZeroMQ lacking thread-safety - There are a few places where the transport code is (potentially unnecessarily) complex, due to the lack of ZeroMQ thread safety. ZeroMQ's successor (x2) nng could be used as a thread-safe replacement. (Done in ZeroMQ->NNG change for thread-safe sockets #286)
  • The current HTTP Endpoint implementation can quickly become a bottleneck when serving lots of external requests. An alternative, async HTTP implementation based on Boost beast has worked well and can handle thousands of concurrent connections per worker thread. (Done in Asio+Beast-based endpoint server #274)
  • The use of Redis to return function results is unnecessary when the call is synchronous, as it just needs to be returned to the calling host. This can be done with a direct ZeroMQ message (similar to how we handle thread results).
  • Use of Redis for discovery may also be unnecessary if we can have a proxy running somewhere in the cluster, or use some form of broadcast (In-house host membership accounting #300)
  • Executor shutdown doesn't clean up resources, it just moves the Executor to a vector of dead Executors. This is done to avoid some kind of deadlock (according to the comments), but it causes a serious memory leak if the Executors load a lot of data. There should be a way to prune these dead executors, either based on time or the overall function lifecycle. Scheduler-controlled executor shutdown #252
  • Execution does not time out, even if the calling client times out. This is difficult as we would need some sort of monitoring thread to kill long-running tasks, and it would be impossible to tell whether they had hung or were just really long-running. Instead, we could add a check to the scheduler, to avoid executing any tasks that have already passed their timeout before execution. This would avoid a traffic jam problem, but not solve the original lack of timeout.
@Shillaker
Copy link
Collaborator

Shillaker commented Oct 14, 2021

Awesome, thanks @eigenraven. I'll rearrange the raw text above and split between this issue and faasm/faasm#504 for the Faasm-specific stuff.

@eigenraven
Copy link
Collaborator Author

Thinking about the protobuf inefficiencies, because gRPC is gone, I don't think there's any reason not to convert everything to flatbuffers at this point

@Shillaker
Copy link
Collaborator

Shillaker commented Oct 14, 2021

Yes this was the ultimate aim when we started using FB, as it would remove many of the serialisation issues (all the TODOs sprinkled about the place regarding copies). My gut feel is that this would be non-trivial, but it could be done object-by-object (leaving the big Message class till last).

@Shillaker Shillaker changed the title Faasm/Faabric/WAVM scaling issues Scaling issues Oct 14, 2021
eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 15, 2021
eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 15, 2021
eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 18, 2021
eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants