Scaling issues #156

eigenraven · 2021-10-14T09:40:23Z

This is a grab-bag of issues encountered when scaling Faasm and Faabric to many requests per second.

Solutions to most of these problems are implemented in this fork: https://github.com/auto-ndp/faabric

There is a corresponding issue in Faasm: faasm/faasm#504

Executor initialization is synchronous within the scheduler thread before the child executor thread is spawned (thus blocking the scheduler). Instead it can be moved to just before the thread loop in the Executor.
ZeroMQ socket explosion - for each Executor we open several sockets to handle the different client/ server pairs. This means we open n_executors * n_ports_per_executor, which hits system-wide limits with a lot of Executors. Instead we could multiplex more/ all calls through a single socket. Because Executors are handled with separate threads, it would not be possible to cache the clients (as 0MQ sockets cannot be shared between threads). AFAICT Distributed Tests for MPI #260 should address this for MPI which was the main culprit.
ZeroMQ lacking thread-safety - There are a few places where the transport code is (potentially unnecessarily) complex, due to the lack of ZeroMQ thread safety. ZeroMQ's successor (x2) nng could be used as a thread-safe replacement. (Done in ZeroMQ->NNG change for thread-safe sockets #286)
The current HTTP Endpoint implementation can quickly become a bottleneck when serving lots of external requests. An alternative, async HTTP implementation based on Boost beast has worked well and can handle thousands of concurrent connections per worker thread. (Done in Asio+Beast-based endpoint server #274)
The use of Redis to return function results is unnecessary when the call is synchronous, as it just needs to be returned to the calling host. This can be done with a direct ZeroMQ message (similar to how we handle thread results).
Use of Redis for discovery may also be unnecessary if we can have a proxy running somewhere in the cluster, or use some form of broadcast (In-house host membership accounting #300)
Executor shutdown doesn't clean up resources, it just moves the Executor to a vector of dead Executors. This is done to avoid some kind of deadlock (according to the comments), but it causes a serious memory leak if the Executors load a lot of data. There should be a way to prune these dead executors, either based on time or the overall function lifecycle. Scheduler-controlled executor shutdown #252
Execution does not time out, even if the calling client times out. This is difficult as we would need some sort of monitoring thread to kill long-running tasks, and it would be impossible to tell whether they had hung or were just really long-running. Instead, we could add a check to the scheduler, to avoid executing any tasks that have already passed their timeout before execution. This would avoid a traffic jam problem, but not solve the original lack of timeout.

The text was updated successfully, but these errors were encountered:

Shillaker · 2021-10-14T09:52:49Z

Awesome, thanks @eigenraven. I'll rearrange the raw text above and split between this issue and faasm/faasm#504 for the Faasm-specific stuff.

eigenraven · 2021-10-14T10:13:06Z

Thinking about the protobuf inefficiencies, because gRPC is gone, I don't think there's any reason not to convert everything to flatbuffers at this point

Shillaker · 2021-10-14T10:34:44Z

Yes this was the ultimate aim when we started using FB, as it would remove many of the serialisation issues (all the TODOs sprinkled about the place regarding copies). My gut feel is that this would be non-trivial, but it could be done object-by-object (leaving the big Message class till last).

Shillaker mentioned this issue Oct 14, 2021

Scaling issues faasm/faasm#504

Open

Shillaker changed the title ~~Faasm/Faabric/WAVM scaling issues~~ Scaling issues Oct 14, 2021

eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 15, 2021

A large number of performance changes (related to faasm#156)

e240750

eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 15, 2021

A large number of performance changes (related to faasm#156)

3518406

eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 18, 2021

A large number of performance changes (related to faasm#156)

64df567

eigenraven added a commit to auto-ndp/faabric that referenced this issue Oct 29, 2021

A large number of performance changes (related to faasm#156)

a03d5fe

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling issues #156

Scaling issues #156

eigenraven commented Oct 14, 2021 •

edited by csegarragonz

Shillaker commented Oct 14, 2021 •

edited by eigenraven

eigenraven commented Oct 14, 2021

Shillaker commented Oct 14, 2021 •

edited

Scaling issues #156

Scaling issues #156

Comments

eigenraven commented Oct 14, 2021 • edited by csegarragonz

Shillaker commented Oct 14, 2021 • edited by eigenraven

eigenraven commented Oct 14, 2021

Shillaker commented Oct 14, 2021 • edited

eigenraven commented Oct 14, 2021 •

edited by csegarragonz

Shillaker commented Oct 14, 2021 •

edited by eigenraven

Shillaker commented Oct 14, 2021 •

edited