Distributed sharding and cache agnosticism #317

khionu · 2021-09-02T06:32:20Z

With Elixir/BEAM, there's a big advantage we have over most other languages: almost no need for orchestration. There are still hurdles, but ideally, this would be handled by tools like Horde or Swarm. Still there are blockers making Nostrum unable to run distributed.

There are three primary aspects of Nostrum that need to be extensible or replaceable in order for this to be reasonable: supervision, shard management, and caching.

Supervision

For future proofing this issue, the best suggestion I have is to delegate as much supervision to the end user as possible. This means taking the prep work from the Application and moving it elsewhere, perhaps a GenServer meant for coordination.

The line can be drawn at Sessions. Sessions are a fairly thin wrapper around the WS connection, so they can remain children of the Shard. The only thing I think is missing at this point is persisting enough information to enable the Session to resume on crash. I'd recommend using an Agent with the increment. If we close the Session intentionally, we can set the value to nil.

This migration will allow the user to use whatever spawning/supervision techniques they desire. This will enable Horde support since they require you use their own DynamicSupervisor. We can still allow the user to have us supervise everything. We can take a function that accepts our intended process name and provides the actual process name to be used. This will allow for much more simple support for Swarm.

Caching

Some people prefer per-shard caching and some prefer global. Both have valid reasons and issues, so both should be supported in order to best support all use cases.

A cache Protocol should be written to enable swapping backends. CRUD operations aside, I think the most simple way to handle the divergence of approaches is to have a function in the Protocol that takes the shard ID and returns a PID. This allows the cache implementation to decide whether each shard gets its own cache or whether they all get the same cache. Note that this means that implementations will need to use the function to process what the name should be, else they won't be able to take advantage of all benefits in the previous section.

This work should probably get adopted by #143.

Shard Management

Shard management is mostly solved by the above proposals. There is nuance in breaking out the Shard Supervisor and Connector, however. The Connector can become part of the aforementioned coordination process. To not block this new process, we can put the process sleep in a Task that then uses GenServer.reply/2. The Supervisor needs to become a Protocol, converting the concrete module into a series of helper functions and a use macro. This should allow people to compose their own shard management with whatever rules they wish, or pick our standard implementation with the use.

The text was updated successfully, but these errors were encountered:

khionu · 2021-09-02T22:53:59Z

Alternative for the Connector, rough concept:

def handle_call({:ratelimit, bucket_pattern}, {to, tag} = _from, state) do
  {delay, state} = calculate_delay(bucket_pattern, state)

  # This is how it's done internally in GenServers [1]
  Process.send_after(to, {tag, :ok}, delay)

  {:no_reply, state}
end

[1]

khionu · 2021-09-16T01:18:29Z

Summarizing from DMs with @jchristgit:

Ratelimit needs some form of distribution; single-process with failovers on each node should work
- Assuming multiple availability zones in the same region, latency should be fine
- Passive updates on a regular frequency should be sufficient. Any lost updates can be handled via handling the 429s
Extra emphasis on providing a default root Supervisor for all of Nostrum
- Ergonomic overhead of these changes should be minimal
- Documentation of things that need to be supervised if not using default is a must

Next steps:

Apply Cache Adapter notes from here to @jchristgit's branch
Stop using Application.get_env #326
Fix ratelimits and implement aforementioned strategy

Work on supervisors and shards is still to be finalized.

jchristgit · 2024-04-19T14:48:26Z

Distributed caching and cache agnosticism have been implemented, with the cache being held as generic as possible and nostrum performing internal query operations cache-agnostically via QLC. For the distributed sharding, which is planned to be the headlining feature for 1.0, I will make a separate issue.

jchristgit added the enhancement label Sep 3, 2021

This was referenced Sep 10, 2021

Stop using Application.get_env #326

Open

Allow more than one api client per application/move config out of config.exs #134

Open

khionu mentioned this issue Sep 16, 2021

Multiple bots support #327

Open

jchristgit closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed sharding and cache agnosticism #317

Distributed sharding and cache agnosticism #317

khionu commented Sep 2, 2021 •

edited

Loading

khionu commented Sep 2, 2021

khionu commented Sep 16, 2021

jchristgit commented Apr 19, 2024

Distributed sharding and cache agnosticism #317

Distributed sharding and cache agnosticism #317

Comments

khionu commented Sep 2, 2021 • edited Loading

Supervision

Caching

Shard Management

khionu commented Sep 2, 2021

khionu commented Sep 16, 2021

jchristgit commented Apr 19, 2024

khionu commented Sep 2, 2021 •

edited

Loading