Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ratelimiter seemingly ending up in a bad state after HTTP 502 response #533

Open
recursiveGecko opened this issue Mar 7, 2024 · 2 comments
Assignees

Comments

@recursiveGecko
Copy link

recursiveGecko commented Mar 7, 2024

Hi, I've run into an odd issue where Nostrum.Api.create_message call never returns.

Here's a relevant log message from the library itself that happens just before this issue occurs:

[warning] No ratelimits received on bucket /channels/xxxxxxxxx/messages, likely due to a server error. Holding off request queue pipelining until next client request.

Here's a message from my application, immediately following it. This indicates that this particular Api call returned:

[error] Failed to post Discord message: %Nostrum.Error.ApiError{status_code: 502, response: "error code: 502"}

But then the following Nostrum.Api.create_message call from the same process seems to hang in this state indefinitely:

iex(f1bot@3ee06c9c3ce1)21> Process.whereis(F1Bot.Output.Discord) |> Process.info(:current_stacktrace)
{:current_stacktrace,
 [
   {:gen, :do_call, 4, [file: 'gen.erl', line: 240]},
   {:gen_statem, :call, 3, [file: 'gen_statem.erl', line: 923]},
   {Nostrum.Api, :create_message, 2, [file: 'lib/nostrum/api.ex', line: 284]},
   {F1Bot.ExternalApi.Discord.Live, :"-post_message/1-fun-0-", 3,
    [file: 'lib/f1_bot/external_api/discord/live.ex', line: 23]},
   {Enum, :"-reduce/3-lists^foldl/2-0-", 3, [file: 'lib/enum.ex', line: 2468]},
   {F1Bot.ExternalApi.Discord.Live, :post_message, 1,
    [file: 'lib/f1_bot/external_api/discord/live.ex', line: 22]},
   {F1Bot.Output.Discord, :handle_info, 2,
    [file: 'lib/f1_bot/output/discord.ex', line: 213]},
   {:gen_server, :try_handle_info, 3, [file: 'gen_server.erl', line: 1095]}
 ]}

...While Ratelimiter doesn't seem to be doing any work:

iex(f1bot@3ee06c9c3ce1)24> Process.whereis(Nostrum.Api.Ratelimiter) |> Process.info(:current_stacktrace)
{:current_stacktrace,
 [
   {:gen_statem, :loop_receive, 3, [file: 'gen_statem.erl', line: 1250]},
   {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 241]}
 ]}

iex(f1bot@3ee06c9c3ce1)27> Process.whereis(Nostrum.Api.Ratelimiter) |> :sys.get_state
{:disconnected,
 %{
   running: %{},
   remaining_in_window: 50,
   outstanding: %{},
   conn: nil,
   inflight: %{}
 }}

...but Nostrum.Api.create_message calls from an attached console continue to work just fine:

iex(f1bot@3ee06c9c3ce1)25> Nostrum.Api.create_message 0123456789, "test"
{:ok,
 %Nostrum.Struct.Message{
   activity: nil,
   application: nil,
   application_id: nil,
   attachments: [],
   # ...
  }
}

For context, F1Bot.Output.Discord is a long-running process that receives messages from PubSub and publishes them as Discord messages, but it ends up blocked when Nostrum.Api doesn't return.

I'm using Nostrum from commit 4fabfc5bf59878fdde118acd686f6a5e075b5f8e due to #522. Happy to provide more info!

@recursiveGecko
Copy link
Author

recursiveGecko commented Mar 8, 2024

Here are results of my production trace (Rexbug.start(["Nostrum.Api :: return", "Nostrum.Api.Ratelimiter :: return"])) with every function call and return value logged, so it should be possible to follow through the internal state of the state machine.

nostrum-bug-trace-redacted.log

Line 3166 is where the send_message function call gets stuck. A lot of the information has been redacted, but hopefully in a way that doesn't impede understanding.

I'm using Elixir 1.14 with OTP 26. I hope this helps!

@jchristgit
Copy link
Collaborator

First of all, sorry for the long delay in getting back to you. I've read this
multiple times now but unfortunately I'm still unsure about the actual issue.

The original issue seems mostly clear to me, and I believe there might be a bug
in there. Specifically the only way I see how that can happen is if the
ratelimiter would not go into :disconnected state before the second request
is sent, and the request is then somehow silently dropped in the ratelimiter. I
can't really see why so far, so I looked into the log you attached.

This is where things get a bit confusing to me. You say line 3166 is where it
gets stuck - do you mean the call in IEx or the outstanding call to channel
aaaaaaaa? Because for the outstanding request to create a message in channel
aaaaaa, it just always sits at :initial, which means nostrum is still
waiting for a reply, and for a long time judging by the length of your bug
trace. I see two possibilities for that to happen:

  1. gun never sends us a reply: this shouldn't happen, but we should
    probably add a reasonably long timeout to outbound requests to exclude the
    possibility. There is also a ticket upstream at Connection process stopped handling requests ninenines/gun#334, which I
    wonder if it's related.

  2. There is a bug in the ratelimiter for handling of :initial requests:
    Certainly possible, but I don't see how so far, since it's a special atom
    value we match against and we do integer arithmetic to deal with things
    which would crash...

But what confuses me is that the issue in your log and the issue you originally
opened seemed to be two separate problems - or am I misunderstanding it?

Either way, I don't think it will solve this problem, I think we should maybe
introduce some reasonably long timeouts on all outbound requests, and maybe set
up a debug monitor for them somehow. Such that we can say, "this thing was sent
but never finished". @jb3 would be happy for your thoughts on this as well.

By the way, thanks for the super useful bug trace. FWIW, if you need to debug
this again, the ratelimiter also has :sys.trace support, which might yield
some other information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Bugs
Development

No branches or pull requests

2 participants