Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To deal with sn-miner's OOM #230

Open
jing-git opened this issue Apr 20, 2023 · 5 comments
Open

To deal with sn-miner's OOM #230

jing-git opened this issue Apr 20, 2023 · 5 comments
Assignees
Labels
BDT bucky data transfer protocol Performance about performance issues SN SN Server task This is a task

Comments

@jing-git
Copy link
Collaborator

jing-git commented Apr 20, 2023

During the sn-call stress test (#164 ), simulating 1200 client requests each time, i was found that sn-miner exited probabilistically after multiple attempts. By checking the process status using 'top' and reviewing the system logs, i found that this was caused by sn-miner's OOM (Out Of Memory) error:
kernel: 0 pages HighMem/MovableOnly
kernel: 38158 pages reserved
kernel: 0 pages cma reserved
kernel: 0 pages hwpoisoned
kernel: [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
kernel: [ 253] 0 253 126024 3216 1044480 0 0 systemd-journal
kernel: [ 281] 0 281 8599 413 114688 0 -1000 systemd-udevd
kernel: [ 496] 100 496 20010 168 180224 0 0 systemd-network
kernel: [ 497] 101 497 17656 162 184320 0 0 systemd-resolve
kernel: [ 516] 102 516 65761 486 167936 0 0 rsyslogd
kernel: [ 518] 0 518 7085 52 102400 0 0 atd
kernel: [15962] 0 15962 5656 388 94208 0 0 bash
kernel: [16317] 0 16317 1985850 854368 14987264 0 0 sn-miner-rust
kernel: [16359] 0 16359 10381 105 122880 0 0 top
kernel: Out of memory: Kill process 16317 (sn-miner-rust) score 855 or sacrifice child
kernel: Killed process 16317 (sn-miner-rust) total-vm:7943400kB, anon-rss:3417472kB, file-rss:0kB, shmem-rss:0kB
kernel: oom_reaper: reaped process 16317 (sn-miner-rust), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

So i want to identify the cause of the oom issue and fix it :)

@jing-git jing-git added task This is a task BDT bucky data transfer protocol labels Apr 20, 2023
@lurenpluto
Copy link
Member

Regarding the OOM of sn in stress test, it can be analyzed from the following perspectives:

1. Theoretical memory occupation value

This should be done from the internal implementation of SN, looking at the cache design of the ping/call related logic, the theoretical value of memory occupation under a specific amount of data, such as the following key indicators

  • Number of devices, e.g. 10w peers
  • The ping interval of a single device, 15s per device
  • Call frequency of a single device (same device and different devices may also need to be treated differently), e.g., one call request per second (different devices)

The theoretical memory usage in the above determined scenario. Then, based on this, the upper limit of the devices that the sn service can host under a specific server memory size can be inferred as a key indicator of the sn server

2. Whether there is memory leakage

It can be seen from several perspectives

  • Analyze whether the theoretical value of memory occupation and the actual value of memory occupation match
    If there is no match, there may be a logical problem, either the theoretical value is incorrect or there is a problem in the implementation
  • Analyze whether the memory usage will fall back
    According to the key indicators in 1, after the number of devices/ping/call drops, will the previously occupied memory be released, if there is a problem with the release, there may be a memory leak; for the release speed, see if it matches the theoretical value (mainly depends on the design of some spend cache time)

@lurenpluto lurenpluto added Performance about performance issues SN SN Server labels Apr 20, 2023
@streetycat
Copy link
Collaborator

The biggest difference between Call and Ping is that there is a cache for Call to resend packages to remote peer.

I think we should review the code.

@lurenpluto
Copy link
Member

The biggest difference between Call and Ping is that there is a cache for Call to resend packages to remote peer.

I think we should review the code.

If the sn server caches a package for each peer's call, then in a stress test environment, multiple independent peers initiate a large number of calls, which may cause the sn server to backlog a lot of call packages

So this may need to add some statistical logs on the sn server side, such as printing the total number of call packages and the total size, to assist in the stress test

@lurenpluto
Copy link
Member

Regarding SN performance optimization, we can use this issue as an entry point to have a systematic development of SN server, and also facilitate other people to understand the logic of our SN server

We can start from the analysis of the existing code of SN and follow the following steps

  • The key protocols of SN]
  • The module composition and relationship of SN server
  • The data flow inside SN server
  • Theoretical analysis of SN server's memory and cpu usage
  • The first phase is based on memory usage analysis
  • Writing individual test cases to analyze and verify the key components that may have problems
  • Propose problems and suggested modifications
  • Discuss the solution and start implementation after approval

@jing-git
Copy link
Collaborator Author

There are several sub-issues here, and I've opened some discussions to track them:
SN Server implementation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BDT bucky data transfer protocol Performance about performance issues SN SN Server task This is a task
Projects
Status: 📝Todo
Development

No branches or pull requests

3 participants