Skip to content

Commit 325aab9

Browse files
authored
Merge pull request #144 from rkdarst/dev
Integration: #143 (inc. #141, #139), #130, #86, #90
2 parents aec79ac + 7a28923 commit 325aab9

File tree

4 files changed

+136
-53
lines changed

4 files changed

+136
-53
lines changed

.travis.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@ before_install:
2222
install:
2323
- pip install --pre -r jupyterhub/dev-requirements.txt
2424
- pip install --pre -e jupyterhub
25+
- pip install --pre -f travis-wheels/wheelhouse -r requirements.txt
2526

2627
script:
2728
- travis_retry py.test --lf --cov batchspawner batchspawner/tests -v

README.md

Lines changed: 55 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,7 @@ This package formerly included WrapSpawner and ProfilesSpawner, which provide me
1919
```python
2020
c = get_config()
2121
c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
22+
import batchspawner # Even though not used, needed to register batchspawner interface
2223
```
2324
3. Depending on the spawner, additional configuration will likely be needed.
2425

@@ -52,6 +53,7 @@ to run Jupyter notebooks on an academic supercomputer cluster.
5253

5354
```python
5455
# Select the Torque backend and increase the timeout since batch jobs may take time to start
56+
import batchspawner
5557
c.JupyterHub.spawner_class = 'batchspawner.TorqueSpawner'
5658
c.Spawner.http_timeout = 120
5759

@@ -117,6 +119,7 @@ clusters, as well as an option to run a local notebook directly on the jupyterhu
117119

118120
```python
119121
# Same initial setup as the previous example
122+
import batchspawner
120123
c.JupyterHub.spawner_class = 'wrapspawner.ProfilesSpawner'
121124
c.Spawner.http_timeout = 120
122125
#------------------------------------------------------------------------------
@@ -152,29 +155,78 @@ clusters, as well as an option to run a local notebook directly on the jupyterhu
152155
```
153156

154157

158+
## Debugging batchspawner
159+
160+
Sometimes it can be hard to debug batchspawner, but it's not really
161+
once you know how the pieces interact. Check the following places for
162+
error messages:
163+
164+
* Check the JupyterHub logs for errors.
165+
166+
* Check the JupyterHub logs for the batch script that got submitted
167+
and the command used to submit it. Are these correct? (Note that
168+
there are submission environment variables too, which aren't
169+
displayed.)
170+
171+
* At this point, it's a matter of checking the batch system. Is the
172+
job ever scheduled? Does it run? Does it succeed? Check the batch
173+
system status and output of the job. The most comon failure
174+
patterns are a) job never starting due to bad scheduler options, b)
175+
job waiting in the queue beyond the `start_timeout`, causing
176+
JupyterHub to kill the job.
177+
178+
* At this point the job starts. Does it fail immediately, or before
179+
Jupyter starts? Check the scheduler output files (stdout/stderr of
180+
the job), wherever it is stored. To debug the job script, you can
181+
add debugging into the batch script, such as an `env` or `set
182+
-x`.
183+
184+
* At this point Jupyter itself starts - check its error messages. Is
185+
it starting with the right options? Can it communicate with the
186+
hub? At this point there usually isn't anything
187+
batchspawner-specific, with the one exception below. The error log
188+
would be in the batch script output (same file as above). There may
189+
also be clues in the JupyterHub logfile.
190+
191+
Common problems:
192+
193+
* Did you `import batchspawner` in the `jupyterhub_config.py` file?
194+
This is needed in order to activate the batchspawer API in
195+
JupyterHub.
196+
197+
198+
155199
## Changelog
156200

157-
### dev (requires minimum JupyterHub 0.7.2 and Python 3.4)
201+
### dev (requires minimum JupyterHub 0.9 and Python 3.5)
158202

159203
Added (user)
160204

161205
* Add Jinja2 templating as an option for all scripts and commands. If '{{' or `{%` is used anywhere in the string, it is used as a jinja2 template.
162206
* Add new option exec_prefix, which defaults to `sudo -E -u {username}`. This replaces explicit `sudo` in every batch command - changes in local commands may be needed.
163207
* New option: `req_keepvars_extra`, which allows keeping extra variables in addition to what is defined by JupyterHub itself (addition of variables to keep instead of replacement). #99
164208
* Add `req_prologue` and `req_epilogue` options to scripts which are inserted before/after the main jupyterhub-singleuser command, which allow for generic setup/cleanup without overriding the entire script. #96
165-
* SlurmSpawner: add the `req_reservation` option. #
209+
* SlurmSpawner: add the `req_reservation` option. #91
210+
* Add basic support for JupyterHub progress updates, but this is not used much yet. #86
166211

167212
Added (developer)
168213

169214
* Add many more tests.
170215
* Add a new page `SPAWNERS.md` which information on specific spawners. Begin trying to collect a list of spawner-specific contacts. #97
216+
* Rename `current_ip` and `current_port` commands to `ip` and `port`. No user impact. #139
217+
* Update to Python 3.5 `async` / `await` syntax to support JupyterHub progress updates. #90
171218

172219
Changed
173220

174-
* Update minimum requirements to JupyterHub 0.8.1 and Python 3.4.
221+
* PR #58 and #141 changes logic of port selection, so that it is selected *after* the singleuser server starts. This means that the port number has to be conveyed back to JupyterHub. This requires the following changes:
222+
- `jupyterhub_config.py` *must* explicitely import `batchspawner`
223+
- Add a new option `batchspawner_singleuser_cmd` which is used as a wrapper in the single-user servers, which conveys the remote port back to JupyterHub. This is now an integral part of the spawn process.
224+
- If you have installed with `pip install -e`, you will have to re-install so that the new script `batchspawner-singleuser` is added to `$PATH`.
225+
* Update minimum requirements to JupyterHub 0.9 and Python 3.5. #143
175226
* Update Slurm batch script. Now, the single-user notebook is run in a job step, with a wrapper of `srun`. This may need to be removed using `req_srun=''` if you don't want environment variables limited.
176227
* Pass the environment dictionary to the queue and cancel commands as well. This is mostly user environment, but may be useful to these commands as well in some cases. #108, #111 If these environment variables were used for authentication as an admin, be aware that there are pre-existing security issues because they may be passed to the user via the batch submit command, see #82.
177228

229+
178230
Fixed
179231

180232
* Improve debugging on failed submission by raising errors including error messages from the commands. #106

batchspawner/batchspawner.py

Lines changed: 79 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -15,9 +15,12 @@
1515
* remote execution via submission of templated scripts
1616
* job names instead of PIDs
1717
"""
18+
import asyncio
19+
from async_generator import async_generator, yield_, yield_from_
1820
import pwd
1921
import os
2022
import re
23+
import sys
2124

2225
import xml.etree.ElementTree as ET
2326

@@ -156,6 +159,12 @@ def _req_keepvars_default(self):
156159
"Must include {cmd} which will be replaced with the jupyterhub-singleuser command line."
157160
).tag(config=True)
158161

162+
batchspawner_singleuser_cmd = Unicode('batchspawner-singleuser',
163+
help="A wrapper which is capable of special batchspawner setup: currently sets the port on "
164+
"the remote host. Not needed to be set under normal circumstances, unless path needs "
165+
"specification."
166+
).tag(config=True)
167+
159168
# Raw output of job submission command unless overridden
160169
job_id = Unicode()
161170

@@ -181,58 +190,64 @@ def parse_job_id(self, output):
181190
return output
182191

183192
def cmd_formatted_for_batch(self):
184-
return ' '.join(['batchspawner-singleuser'] + self.cmd + self.get_args())
193+
"""The command which is substituted inside of the batch script"""
194+
return ' '.join([self.batchspawner_singleuser_cmd] + self.cmd + self.get_args())
195+
196+
async def run_command(self, cmd, input=None, env=None):
197+
proc = await asyncio.create_subprocess_shell(cmd, env=env,
198+
stdin=asyncio.subprocess.PIPE,
199+
stdout=asyncio.subprocess.PIPE,
200+
stderr=asyncio.subprocess.PIPE)
201+
inbytes=None
185202

186-
@gen.coroutine
187-
def run_command(self, cmd, input=None, env=None):
188-
proc = Subprocess(cmd, shell=True, env=env, stdin=Subprocess.STREAM, stdout=Subprocess.STREAM,stderr=Subprocess.STREAM)
189-
inbytes = None
190203
if input:
191-
inbytes = input.encode()
192-
try:
193-
yield proc.stdin.write(inbytes)
194-
except StreamClosedError as exp:
195-
# Apparently harmless
196-
pass
197-
proc.stdin.close()
198-
out, eout = yield [proc.stdout.read_until_close(),
199-
proc.stderr.read_until_close()]
200-
proc.stdout.close()
201-
proc.stderr.close()
202-
eout = eout.decode().strip()
204+
inbytes=input.encode()
205+
203206
try:
204-
err = yield proc.wait_for_exit()
205-
except CalledProcessError:
207+
out, eout = await proc.communicate(input=inbytes)
208+
except:
209+
self.log.debug("Exception raised when trying to run command: %s" % command)
210+
proc.kill()
211+
self.log.debug("Running command failed done kill")
212+
out, eout = await proc.communicate()
213+
out = out.decode.strip()
214+
eout = eout.decode.strip()
206215
self.log.error("Subprocess returned exitcode %s" % proc.returncode)
207216
self.log.error('Stdout:')
208217
self.log.error(out)
209218
self.log.error('Stderr:')
210219
self.log.error(eout)
211220
raise RuntimeError('{} exit status {}: {}'.format(cmd, proc.returncode, eout))
212-
if err != 0:
213-
return err # exit error?
214221
else:
215-
out = out.decode().strip()
216-
return out
222+
eout = eout.decode().strip()
223+
err = proc.returncode
224+
if err != 0:
225+
self.log.error("Subprocess returned exitcode %s" % err)
226+
self.log.error(eout)
227+
raise RuntimeError(eout)
217228

218-
@gen.coroutine
219-
def _get_batch_script(self, **subvars):
229+
out = out.decode().strip()
230+
return out
231+
232+
async def _get_batch_script(self, **subvars):
220233
"""Format batch script from vars"""
221-
# Colud be overridden by subclasses, but mainly useful for testing
234+
# Could be overridden by subclasses, but mainly useful for testing
222235
return format_template(self.batch_script, **subvars)
223236

224-
@gen.coroutine
225-
def submit_batch_script(self):
237+
async def submit_batch_script(self):
226238
subvars = self.get_req_subvars()
239+
# `cmd` is submitted to the batch system
227240
cmd = ' '.join((format_template(self.exec_prefix, **subvars),
228241
format_template(self.batch_submit_cmd, **subvars)))
242+
# `subvars['cmd']` is what is run _inside_ the batch script,
243+
# put into the template.
229244
subvars['cmd'] = self.cmd_formatted_for_batch()
230245
if hasattr(self, 'user_options'):
231246
subvars.update(self.user_options)
232-
script = yield self._get_batch_script(**subvars)
247+
script = await self._get_batch_script(**subvars)
233248
self.log.info('Spawner submitting job using ' + cmd)
234249
self.log.info('Spawner submitted script:\n' + script)
235-
out = yield self.run_command(cmd, input=script, env=self.get_env())
250+
out = await self.run_command(cmd, input=script, env=self.get_env())
236251
try:
237252
self.log.info('Job submitted. cmd: ' + cmd + ' output: ' + out)
238253
self.job_id = self.parse_job_id(out)
@@ -247,8 +262,7 @@ def submit_batch_script(self):
247262
"and self.job_id as {job_id}."
248263
).tag(config=True)
249264

250-
@gen.coroutine
251-
def read_job_state(self):
265+
async def read_job_state(self):
252266
if self.job_id is None or len(self.job_id) == 0:
253267
# job not running
254268
self.job_status = ''
@@ -259,7 +273,7 @@ def read_job_state(self):
259273
format_template(self.batch_query_cmd, **subvars)))
260274
self.log.debug('Spawner querying job: ' + cmd)
261275
try:
262-
out = yield self.run_command(cmd, env=self.get_env())
276+
out = await self.run_command(cmd)
263277
self.job_status = out
264278
except Exception as e:
265279
self.log.error('Error querying job ' + self.job_id)
@@ -271,14 +285,13 @@ def read_job_state(self):
271285
help="Command to stop/cancel a previously submitted job. Formatted like batch_query_cmd."
272286
).tag(config=True)
273287

274-
@gen.coroutine
275-
def cancel_batch_job(self):
288+
async def cancel_batch_job(self):
276289
subvars = self.get_req_subvars()
277290
subvars['job_id'] = self.job_id
278291
cmd = ' '.join((format_template(self.exec_prefix, **subvars),
279292
format_template(self.batch_cancel_cmd, **subvars)))
280293
self.log.info('Cancelling job ' + self.job_id + ': ' + cmd)
281-
yield self.run_command(cmd, env=self.get_env())
294+
await self.run_command(cmd)
282295

283296
def load_state(self, state):
284297
"""load job_id from state"""
@@ -317,11 +330,10 @@ def state_gethost(self):
317330
"Return string, hostname or addr of running job, likely by parsing self.job_status"
318331
raise NotImplementedError("Subclass must provide implementation")
319332

320-
@gen.coroutine
321-
def poll(self):
333+
async def poll(self):
322334
"""Poll the process"""
323335
if self.job_id is not None and len(self.job_id) > 0:
324-
yield self.read_job_state()
336+
await self.read_job_state()
325337
if self.state_isrunning() or self.state_ispending():
326338
return None
327339
else:
@@ -337,16 +349,15 @@ def poll(self):
337349
help="Polling interval (seconds) to check job state during startup"
338350
).tag(config=True)
339351

340-
@gen.coroutine
341-
def start(self):
352+
async def start(self):
342353
"""Start the process"""
343354
self.ip = self.traits()['ip'].default_value
344355
self.port = self.traits()['port'].default_value
345356

346357
if jupyterhub.version_info >= (0,8) and self.server:
347358
self.server.port = self.port
348359

349-
job = yield self.submit_batch_script()
360+
job = await self.submit_batch_script()
350361

351362
# We are called with a timeout, and if the timeout expires this function will
352363
# be interrupted at the next yield, and self.stop() will be called.
@@ -355,7 +366,7 @@ def start(self):
355366
if len(self.job_id) == 0:
356367
raise RuntimeError("Jupyter batch job submission failure (no jobid in output)")
357368
while True:
358-
yield self.poll()
369+
await self.poll()
359370
if self.state_isrunning():
360371
break
361372
else:
@@ -367,11 +378,11 @@ def start(self):
367378
raise RuntimeError('The Jupyter batch job has disappeared'
368379
' while pending in the queue or died immediately'
369380
' after starting.')
370-
yield gen.sleep(self.startup_poll_interval)
381+
await gen.sleep(self.startup_poll_interval)
371382

372383
self.ip = self.state_gethost()
373384
while self.port == 0:
374-
yield gen.sleep(self.startup_poll_interval)
385+
await gen.sleep(self.startup_poll_interval)
375386
# Test framework: For testing, mock_port is set because we
376387
# don't actually run the single-user server yet.
377388
if hasattr(self, 'mock_port'):
@@ -388,27 +399,43 @@ def start(self):
388399

389400
return self.ip, self.port
390401

391-
@gen.coroutine
392-
def stop(self, now=False):
402+
async def stop(self, now=False):
393403
"""Stop the singleuser server job.
394404
395405
Returns immediately after sending job cancellation command if now=True, otherwise
396406
tries to confirm that job is no longer running."""
397407

398408
self.log.info("Stopping server job " + self.job_id)
399-
yield self.cancel_batch_job()
409+
await self.cancel_batch_job()
400410
if now:
401411
return
402412
for i in range(10):
403-
yield self.poll()
413+
await self.poll()
404414
if not self.state_isrunning():
405415
return
406-
yield gen.sleep(1.0)
416+
await gen.sleep(1.0)
407417
if self.job_id:
408418
self.log.warn("Notebook server job {0} at {1}:{2} possibly failed to terminate".format(
409419
self.job_id, self.ip, self.port)
410420
)
411421

422+
@async_generator
423+
async def progress(self):
424+
while True:
425+
if self.state_ispending():
426+
await yield_({
427+
"message": "Pending in queue...",
428+
})
429+
elif self.state_isrunning():
430+
await yield_({
431+
"message": "Cluster job running... waiting to connect",
432+
})
433+
return
434+
else:
435+
await yield_({
436+
"message": "Unknown status...",
437+
})
438+
await gen.sleep(1)
412439

413440
class BatchSpawnerRegexStates(BatchSpawnerBase):
414441
"""Subclass of BatchSpawnerBase that uses config-supplied regular expressions
@@ -612,6 +639,8 @@ class SlurmSpawner(UserEnvMixin,BatchSpawnerRegexStates):
612639
def parse_job_id(self, output):
613640
# make sure jobid is really a number
614641
try:
642+
# use only last line to circumvent slurm bug
643+
output = output.splitlines()[-1]
615644
id = output.split(';')[0]
616645
int(id)
617646
except Exception as e:

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
1+
async_generator>=1.8
12
jinja2
23
jupyterhub>=0.5

0 commit comments

Comments
 (0)