Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemory on API call to get Access Token. #4362

Open
ismaelagra opened this issue Mar 21, 2025 · 9 comments
Open

OutOfMemory on API call to get Access Token. #4362

ismaelagra opened this issue Mar 21, 2025 · 9 comments

Comments

@ismaelagra
Copy link

Hello,

We have an ejabberd instance with 1590 potential users but the average of connected users is 750 aprox.

We want to implement automatized tasks that sends messages through API to MUC groups. When we try to make a call to the API to request the access token the request freezes and the server gets an OutOfMemory error from erlang:

Crash dump is being written to: /opt/ejabberd/logs/erl_crash_YYYYMMDD-hhmmss.dump...heart:: Erlang is crashing .. (waiting for crash dump file)
heart:  Waiting for dump - timeout set to 60 seconds.
done
[os_mon] memory supervisor port (memsup): Erlang has closed
heart: Would reboot. Terminating.

I used https://github.com/ferd/recon/blob/master/script/erl_crashdump_analyzer.sh script for analyzing the dump and i get this:


Slogan: eheap_alloc: Cannot allocate 20310580872 bytes of memory (of type "heap").

Memory:
===
  processes: 18108 Mb
  processes_used: 18108 Mb
  system: 159 Mb
  atom: 1 Mb
  atom_used: 1 Mb
  binary: 87 Mb
  code: 23 Mb
  ets: 21 Mb
  ---
  total: 18268 Mb

Different message queue lengths (5 largest different):
===
      1 36
      1 31
      1 25
      2 2
      2 1

Error logger queue length:
===

File descriptors open:
===
  UDP:  0
  TCP:  227
  Files:  0
  ---
  Total:  227

Number of processes:
===
855

Processes Heap+Stack memory sizes (words) used in the VM (5 largest different):
===
      1 850246556
     59 196650
      8 121536
      7 75113
     45 46422

Processes OldHeap memory sizes (words) used in the VM (5 largest different):
===
      1 1469226048
      2 999631
      2 833026
      3 514838
     45 318187

Process States when crashing (sum): 
===
      1 CONNECTED
      1 CONNECTED|BINARY_IO
     46 CONNECTED|BINARY_IO|PORT_LOCK
      1 CONNECTED|BINARY_IO|SOFT_EOF|PORT_LOCK
    214 CONNECTED|PORT_LOCK
      1 Current Process Garbing
      1 Current Process Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
      1 Garbing
      1 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH
      3 Internal ACT_PRIO_HIGH | USR_PRIO_HIGH | PRQ_PRIO_HIGH | OFF_HEAP_MSGQ
      1 Internal ACT_PRIO_LOW | USR_PRIO_LOW | PRQ_PRIO_LOW
      5 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX
      1 Internal ACT_PRIO_MAX | USR_PRIO_MAX | PRQ_PRIO_MAX | OFF_HEAP_MSGQ
    830 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
      1 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
      2 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | IN_PRQ_NORMAL | ACTIVE | IN_RUNQ
      1 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | IN_PRQ_NORMAL | MSG_SIG_IN_Q | ACTIVE | IN_RUNQ | NMSG_SIG_IN_Q | ACTIVE_SYS
     10 Internal ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | OFF_HEAP_MSGQ
      3 Scheduled
    851 Waiting

We have an scheduled task every saturday that extracts through API some information and it works fine, so we think is a capacity problem. But, we increased memory from 24GB to 32 GB and same problem, we tried also to increase to 48 and same issue.

At this moment our configuration is:

ejabberd version: 23.10
CPU: 30
RAM: 32GB

Is our instance undersized or is there any bug with this ejabberd version? We didn't find any info about it.

@prefiks
Copy link
Member

prefiks commented Mar 21, 2025

Could you see open that crash log in editor and look for line with 'Stack+heap: 850246556' and then copy here lines starting from '=proc:' before that line, to next '=proc'?

@ismaelagra
Copy link
Author

Here is the requested info:

=proc:<0.1175.0>
State: Garbing
Spawned as: proc_lib:init_p/5
Last scheduled in for: acl:match_acl/3
Spawned by: <0.1002.0>
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [#Port<0.248>, <0.1002.0>]
Reductions: 751065605
Stack+heap: 850246556
OldHeap: 1469226048
Heap unused: 129
OldHeap unused: 24443340
BinVHeap: 0
OldBinVHeap: 76
BinVHeap unused: 46422
OldBinVHeap unused: 46346
Memory: 18555782936
New heap start: 7FF8E7DD6CB0
New heap top: 7FF8E7DD6CE0
Stack top: 7FF8E7DD6E28
Stack end: 7FF8E7DD70E8
Old heap start: 7FF567882028
Old heap top: 7FF8187533C8
Old heap end: 7FF8241D0228
Program counter: 0x00007ff90f4136c8 (acl:match_acl/3 + 48)
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL | ACTIVE | GC | DIRTY_ACTIVE_SYS | DIRTY_RUNNING_SYS
=proc:<0.1176.0>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.1001.0>
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 24
Link list: [<0.1001.0>, #Port<0.249>]
Reductions: 33506108
Stack+heap: 121536
OldHeap: 0
Heap unused: 52430
OldHeap unused: 0
BinVHeap: 1118
OldBinVHeap: 0
BinVHeap unused: 45304
OldBinVHeap unused: 46422
Memory: 973504
Program counter: 0x00007ff90f3b69bc (p1_server:collect_messages/3 + 724)
arity = 0
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
=proc:<0.1177.0>

@prefiks
Copy link
Member

prefiks commented Mar 21, 2025

Could you also please copy paste section for =proc:<0.1002.0>

@ismaelagra
Copy link
Author

For sure:

=proc:<0.1002.0>
State: Waiting
Name: ejabberd_http_sup
Spawned as: proc_lib:init_p/5
Spawned by: <0.609.0>
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 0
Link list: [<0.1175.0>, <0.609.0>]
Reductions: 138
Stack+heap: 233
OldHeap: 0
Heap unused: 30
OldHeap unused: 0
BinVHeap: 0
OldBinVHeap: 0
BinVHeap unused: 46422
OldBinVHeap unused: 46422
Memory: 2800
Program counter: 0x00007ff90ec6ab44 (gen_server:loop/7 + 564)
arity = 0
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL
=proc:<0.1003.0>
State: Waiting
Spawned as: proc_lib:init_p/5
Spawned by: <0.1001.0>
Message queue length: 0
Number of heap fragments: 0
Heap fragment data: 943
Link list: [#Port<0.54>, <0.1001.0>]
Reductions: 1993842
Stack+heap: 196650
OldHeap: 0
Heap unused: 112916
OldHeap unused: 0
BinVHeap: 29840
OldBinVHeap: 0
BinVHeap unused: 16582
OldBinVHeap unused: 75110
Memory: 1581768
Program counter: 0x00007ff90f3b69bc (p1_server:collect_messages/3 + 724)
arity = 0
Internal State: ACT_PRIO_NORMAL | USR_PRIO_NORMAL | PRQ_PRIO_NORMAL

@prefiks
Copy link
Member

prefiks commented Mar 21, 2025

Hm, so it looks like you process that overflows memory is, one spawned to handle http api call. and it possibly did that when trying to see if you have permission to call that command? Quite frankly i never seen anything like that.

It could be something causing stack overflow maybe, could you see if you have '=proc_stack:<0.1175.0>' section in crash file, and copy paste it here if it's here (if it's not super long, if it's long please copy just some initial parts).

Would you be able to share you ejabberd.yml config? Or at least acl:, access_rules: and api_permissions: parts from it?

@ismaelagra
Copy link
Author

The section for proc_stack<0.1175.0> is (removing the heap part):

=proc_stack:<0.1175.0>
0x00007ff8e7dd6e28:SReturn addr 0xECE179C (lists:any/2 + 188)
y0:N
y1:H7FF567882028
0x00007ff8e7dd6e40:SReturn addr 0xECE14EC (lists:all/2 + 188)
y0:N
y1:H7FF567882060
0x00007ff8e7dd6e58:SReturn addr 0xF414624 (acl:match_rules/4 + 188)
y0:A5:allow
y1:N
y2:A4:deny
y3:H7FF567882098
y4:A6:global
0x00007ff8e7dd6e88:SReturn addr 0xF41CA18 (ejabberd_access_permissions:'-matches_definition/6-fun-2-'/4 + 608)
0x00007ff8e7dd6e90:SReturn addr 0xECE18F4 (lists:any_1/2 + 164)
y0:H7FF5678820C8
y1:H7FF5678820D8
0x00007ff8e7dd6ea8:SReturn addr 0xF41CFC4 (ejabberd_access_permissions:'-can_access/2-fun-0-'/7 + 196)
y0:H7FF567882118
y1:H7FF567882098
y2:A11:oauth_issue_token
0x00007ff8e7dd6ec8:SReturn addr 0xECE1FEC (lists:foldl_1/3 + 164)
y0:H7FF567882130
y1:H7FF567882140
0x00007ff8e7dd6ee0:SReturn addr 0xF418C3C (ejabberd_access_permissions:can_access/2 + 564)
y0:N
y1:N
y2:N
y3:H7FF567882098
y4:A11:oauth_issue_token
0x00007ff8e7dd6f10:SReturn addr 0xFB7ABD8 (ejabberd_commands:execute_command2/4 + 176)
y0:H7FF567882190
y1:H7FF5678821A0
0x00007ff8e7dd6f28:SReturn addr 0xF44C114 (mod_http_api:handle2/4 + 324)
y0:N
y1:I0
y2:N
y3:H7FF567882098
y4:A11:oauth_issue_token
0x00007ff8e7dd6f58:SReturn addr 0xF44B794 (mod_http_api:handle/4 + 228)
y0:N
y1:H7FF567882248
y2:N
y3:A11:oauth_issue_token
y4:SCatch 0xF44B7BE (mod_http_api:handle/4 + 270)
0x00007ff8e7dd6f88:SReturn addr 0xF44B234 (mod_http_api:perform_call/4 + 388)
y0:A11:oauth_issue_token
0x00007ff8e7dd6f98:SReturn addr 0xF44A870 (mod_http_api:process/2 + 744)
y0:N
y1:N
y2:N
y3:N
y4:N
y5:SCatch 0xF44A89A (mod_http_api:process/2 + 786)
0x00007ff8e7dd6fd0:SReturn addr 0xF43DEB8 (ejabberd_http:process/2 + 1160)
y0:N
y1:N
y2:H7FF567882258
y3:H7FF567882268
y4:SCatch 0xF43DED3 (ejabberd_http:process/2 + 1187)
0x00007ff8e7dd7000:SReturn addr 0xF43F08C (ejabberd_http:process_request/1 + 1820)
y0:N
y1:N
y2:N
y3:N
y4:H7FF567882300
y5:N
y6:N
y7:N
y8:H7FF567882300
0x00007ff8e7dd7050:SReturn addr 0xF43D088 (ejabberd_http:process_header/2 + 3944)
y0:N
y1:A8:fast_tls
y2:H7FF567882300
0x00007ff8e7dd7070:SReturn addr 0xF43C030 (ejabberd_http:parse_headers/1 + 552)
y0:N
y1:N
0x00007ff8e7dd7088:SReturn addr 0xF43AFB0 (ejabberd_http:init/3 + 1576)
y0:N
y1:N
y2:N
y3:N
y4:H7FF5678823B8
y5:SCatch 0xF43AFDA (ejabberd_http:init/3 + 1618)
0x00007ff8e7dd70c0:SReturn addr 0xEC36DAC (proc_lib:init_p_do_apply/3 + 220)
y0:N
y1:N
y2:SCatch 0xEC36DD6 (proc_lib:init_p_do_apply/3 + 262)
0x00007ff8e7dd70e0:SReturn addr 0xEB87198 (<terminate process normally>)
=proc_heap:<0.1175.0>

I

  • The ACL for API is defined using a mod_roster shared group:
acl:
...
  api_users:
    shared_group:
      - "[email protected]"
  • access_rules:
access_rules:
  local:
    allow: local
  c2s:
    deny: blocked
    allow: all
  announce:
    allow: admin
  configure:
    allow: admin
  muc_create:
    allow: local
  pubsub_createnode:
    allow: local
  trusted_network:
    allow: loopback
  • api_permissions:
api_permissions:
...
  "scheduled tasks":
    from:
      - mod_http_api
    who:
      ip: 
        - <scheduler-host-ip-1> 
        - <scheduler-host-ip-2>
      access:
        allow:
          - acl: api_users
      oauth:
        scope:
          - "status_list" 
          - "oauth_issue_token" 
        access:
          allow:
            - acl: api_users
    what:
      - "status_list" 
      - "oauth_issue_token" 
      - "send_message"
      - "set_room_affiliation"
``
 

@prefiks
Copy link
Member

prefiks commented Mar 21, 2025

Only thing from that possibly could have that impact, could possibly be using shared_group in acl rule, depending how that group is defined, getting data about that group could somehow trigger allocation of big amount of data? I would try to switch that to explicit list of users, and see if this helps. The other alternative for excess allocations would be just api call having lot of data sent in request, but i think this will cause memory problems earlier in that process life time.

@ismaelagra
Copy link
Author

We have the same configuration on a Staging server and we don't get that error, thats why i think that it can be a capacity problem, we just have the problem in production, same config on staging works fine. What is the dimension aprox for a server with average 750 users connected? Is ok the resources that we configured

@prefiks
Copy link
Member

prefiks commented Mar 21, 2025

As i said i never seen anything like that before, and i think in general your current setup should be ok. But shared rosters can have impacts on bigger installations, it has rules that can use "all users registered on server", or "all currently online users" and as you can guess, with larger user base, this can be a issue. That's why i mentioned shared_roster. The fact that it works on staging server, where i guess is less users, it may not be directly comparable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants