You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I came across this issue while running batchspawner in a production environment - at scale, after spawning sessions, batchspawner uses the exec_prefix in conjunction with the batch_query_cmd to check job status, which puts a heavy load on sudo sessions being instantiated on the login node. The issue comes up when the number of users on the server gets high (200+) and gets overloaded with sudo sessions, and the server can start getting bogged down on CPU or over-logging when the max number of sudo sessions gets hit. The thing is, SLURM doesnt (always) require that the user who started the job be the one to check it, i.e. no need for the exec_prefix for the batch_query_cmd, since any user can check any other users job status. I ended up forking the current release and modifying the query_job_status function by overwriting it inside the SlurmSpawner class and removing the use of the exec_prefix to prevent sudo sessions being opened every 30 seconds for each user, however this is not the most ideal means to solve this problem. I have given a better proposal below, as well as my (hacky) alternative solution that I implemented while prod was slowly burning to the ground.
Proposed change
Create an option in the Jupyterhub configuration could be designed to specify what the exec_prefix will be for the three main functions (submit_batch_script, query_job_status, cancel_batch_job), allowing admins to configure which prefix is used where, or specify it as "" to turn off its use entirely.
This could be a dictionary with three keys, such as: {"submit_exec_prefix": "sudo -E -u {username}", "query_exec_prefix": "", "cancel_exec_prefix": "sudo -E -u {username}"}
Alternative options
Alternatively, overwrite the outside class definition of query_job_status function with a definition inside the SlurmSpawner class and remove the use of exec_prefix for the query_job_status function (or adding an option to allow the prefix to be turned off). This was my spur of the moment solution earlier today.
Who would use this feature?
I believe this feature would be useful for those who need to customize which commands use the exec_prefix. I use SlurmSpawner specifically, but im sure this feature would be useful in the other resource manager spawners as well.
(Optional): Suggest a solution
I feel like my proposed change is the most practical solution, at least from what I can see.
Thanks for reading :)
The text was updated successfully, but these errors were encountered:
Thank you for opening your first issue in this project! Engagement like this is essential for open source projects! 🤗
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please try to follow the issue template as it helps other other community members to contribute more effectively.
You can meet the other Jovyans by joining our Discourse forum. There is also an intro thread there where you can stop by and say Hi! 👋
I came across this issue while running batchspawner in a production environment - at scale, after spawning sessions, batchspawner uses the exec_prefix in conjunction with the batch_query_cmd to check job status, which puts a heavy load on sudo sessions being instantiated on the login node. The issue comes up when the number of users on the server gets high (200+) and gets overloaded with sudo sessions, and the server can start getting bogged down on CPU or over-logging when the max number of sudo sessions gets hit. The thing is, SLURM doesnt (always) require that the user who started the job be the one to check it, i.e. no need for the exec_prefix for the batch_query_cmd, since any user can check any other users job status. I ended up forking the current release and modifying the query_job_status function by overwriting it inside the SlurmSpawner class and removing the use of the exec_prefix to prevent sudo sessions being opened every 30 seconds for each user, however this is not the most ideal means to solve this problem. I have given a better proposal below, as well as my (hacky) alternative solution that I implemented while prod was slowly burning to the ground.
Proposed change
Create an option in the Jupyterhub configuration could be designed to specify what the exec_prefix will be for the three main functions (submit_batch_script, query_job_status, cancel_batch_job), allowing admins to configure which prefix is used where, or specify it as "" to turn off its use entirely.
This could be a dictionary with three keys, such as: {"submit_exec_prefix": "sudo -E -u {username}", "query_exec_prefix": "", "cancel_exec_prefix": "sudo -E -u {username}"}
Alternative options
Alternatively, overwrite the outside class definition of query_job_status function with a definition inside the SlurmSpawner class and remove the use of exec_prefix for the query_job_status function (or adding an option to allow the prefix to be turned off). This was my spur of the moment solution earlier today.
Who would use this feature?
I believe this feature would be useful for those who need to customize which commands use the exec_prefix. I use SlurmSpawner specifically, but im sure this feature would be useful in the other resource manager spawners as well.
(Optional): Suggest a solution
I feel like my proposed change is the most practical solution, at least from what I can see.
Thanks for reading :)
The text was updated successfully, but these errors were encountered: