Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed sharding and cache agnosticism #317

Closed
khionu opened this issue Sep 2, 2021 · 3 comments
Closed

Distributed sharding and cache agnosticism #317

khionu opened this issue Sep 2, 2021 · 3 comments

Comments

@khionu
Copy link
Contributor

khionu commented Sep 2, 2021

With Elixir/BEAM, there's a big advantage we have over most other languages: almost no need for orchestration. There are still hurdles, but ideally, this would be handled by tools like Horde or Swarm. Still there are blockers making Nostrum unable to run distributed.

There are three primary aspects of Nostrum that need to be extensible or replaceable in order for this to be reasonable: supervision, shard management, and caching.

Supervision

For future proofing this issue, the best suggestion I have is to delegate as much supervision to the end user as possible. This means taking the prep work from the Application and moving it elsewhere, perhaps a GenServer meant for coordination.

The line can be drawn at Sessions. Sessions are a fairly thin wrapper around the WS connection, so they can remain children of the Shard. The only thing I think is missing at this point is persisting enough information to enable the Session to resume on crash. I'd recommend using an Agent with the increment. If we close the Session intentionally, we can set the value to nil.

This migration will allow the user to use whatever spawning/supervision techniques they desire. This will enable Horde support since they require you use their own DynamicSupervisor. We can still allow the user to have us supervise everything. We can take a function that accepts our intended process name and provides the actual process name to be used. This will allow for much more simple support for Swarm.

Caching

Some people prefer per-shard caching and some prefer global. Both have valid reasons and issues, so both should be supported in order to best support all use cases.

A cache Protocol should be written to enable swapping backends. CRUD operations aside, I think the most simple way to handle the divergence of approaches is to have a function in the Protocol that takes the shard ID and returns a PID. This allows the cache implementation to decide whether each shard gets its own cache or whether they all get the same cache. Note that this means that implementations will need to use the function to process what the name should be, else they won't be able to take advantage of all benefits in the previous section.

This work should probably get adopted by #143.

Shard Management

Shard management is mostly solved by the above proposals. There is nuance in breaking out the Shard Supervisor and Connector, however. The Connector can become part of the aforementioned coordination process. To not block this new process, we can put the process sleep in a Task that then uses GenServer.reply/2. The Supervisor needs to become a Protocol, converting the concrete module into a series of helper functions and a use macro. This should allow people to compose their own shard management with whatever rules they wish, or pick our standard implementation with the use.

@khionu
Copy link
Contributor Author

khionu commented Sep 2, 2021

Alternative for the Connector, rough concept:

def handle_call({:ratelimit, bucket_pattern}, {to, tag} = _from, state) do
  {delay, state} = calculate_delay(bucket_pattern, state)

  # This is how it's done internally in GenServers [1]
  Process.send_after(to, {tag, :ok}, delay)

  {:no_reply, state}
end

[1]

@khionu
Copy link
Contributor Author

khionu commented Sep 16, 2021

Summarizing from DMs with @jchristgit:

  • Ratelimit needs some form of distribution; single-process with failovers on each node should work
    • Assuming multiple availability zones in the same region, latency should be fine
    • Passive updates on a regular frequency should be sufficient. Any lost updates can be handled via handling the 429s
  • Extra emphasis on providing a default root Supervisor for all of Nostrum
    • Ergonomic overhead of these changes should be minimal
    • Documentation of things that need to be supervised if not using default is a must

Next steps:

Work on supervisors and shards is still to be finalized.

@jchristgit
Copy link
Collaborator

Distributed caching and cache agnosticism have been implemented, with the cache being held as generic as possible and nostrum performing internal query operations cache-agnostically via QLC. For the distributed sharding, which is planned to be the headlining feature for 1.0, I will make a separate issue.

@jchristgit jchristgit closed this as not planned Won't fix, can't repro, duplicate, stale Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants