Skip to content

Commit 5cf415f

Browse files
committed
Merge branch 'develop' into main
Updating to version 1.0.4 from develop
2 parents c6e4cd4 + b741b2d commit 5cf415f

File tree

131 files changed

+11621
-1297
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

131 files changed

+11621
-1297
lines changed

.github/workflows/integration.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
# available on 20.04
2121
runs-on: ubuntu-22.04
2222
steps:
23-
- uses: actions/checkout@v2
23+
- uses: actions/checkout@v3
2424
- name: Dependency Install
2525
run: ./ci/deps_install.sh
2626
- name: Slurm Setup and Install

.github/workflows/pylama.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,8 +15,8 @@ jobs:
1515
name: PyLama Lint
1616
runs-on: ubuntu-latest
1717
steps:
18-
- uses: actions/checkout@v2
18+
- uses: actions/checkout@v3
1919
- name: Lint
2020
run: |
21-
pip install pylama pylint 2>&1 >/dev/null
21+
pip install pylama==8.4.1 pyflakes==3.0.1 pylint==2.15.9 pydocstyle==6.1.1 2>&1 >/dev/null
2222
pylama beeflow/

.github/workflows/unit-tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@ jobs:
2020
# available on 20.04
2121
runs-on: ubuntu-22.04
2222
steps:
23-
- uses: actions/checkout@v2
23+
- uses: actions/checkout@v3
2424
- name: Dependency Install
2525
run: ./ci/deps_install.sh
2626
- name: Slurm Setup and Install

.gitignore

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,5 +12,7 @@ poetry.lock
1212
.vscode
1313
.env
1414
.venv
15+
docs/sphinx/_build
16+
src/beeflow/enhanced_client/node_modules
17+
**/dist
1518
_build
16-
dist

README.rst

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,29 @@ Contact
3535
For bugs and problems report, suggestions and other general questions regarding the BEE project, email questions to `[email protected] <[email protected]>`_.
3636

3737

38+
Contributors:
39+
==========================
40+
41+
* Steven Anaya - `Boogie3D <https://github.com/Boogie3D>`_
42+
* Paul Bryant - `paulbry <https://github.com/paulbry>`_
43+
* Rusty Davis - `rstyd <https://github.com/rstyd>`_
44+
* Jieyang Chen - `JieyangChen7 <https://github.com/JieyangChen7>`_
45+
* Patricia Grubel - `pagrubel <https://github.com/pagrubel>`_
46+
* Qiang Guan - `guanxyz <https://github.com/guanxyz>`_
47+
* Ragini Gupta - `raginigupta6 <https://github.com/raginigupta6>`_
48+
* Andres Quan - `aquan9 <https://github.com/aquan9>`_
49+
* Quincy Wofford - `qwofford <https://github.com/qwofford>`_
50+
* Tim Randles - `trandles-lanl <https://github.com/trandles-lanl>`_
51+
* Jacob Tronge - `jtronge <https://github.com/jtronge>`_
52+
53+
Concept and Design Contributors
54+
55+
* James Ahrens
56+
* Allen McPherson
57+
* Li-Ta Lo
58+
* Louis Vernon
59+
60+
3861
Contributing
3962
==========================
4063

RELEASE.rst

Lines changed: 13 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,27 @@ Publishing a new release
22
************************
33

44
1. Change the version in pyproject.toml and verify docs build;
5-
and get this change merged into develop.
6-
5+
and get this change merged into develop. (You may want to set the bypass as in step 2)
76
2. On github site go to Settings; on the left under Code and Automation
87
click on Branches; under Branch protection rules edit main;
98
check Allow specified actors to bypass required pull requests; add yourself
10-
and don'forget to save the setting
11-
3. Checkout main and merge develop into main. Make sure documentation will be
12-
published upon push to main (.github/workflows/docs.yml) and push.
13-
4. Once merged, create a tag
14-
with something like ``git tag -a 0.1.1 -m "BEE version 0.1.1"``. You can see
15-
existing tags with ``git tag``. Finally do ``git push origin --tags`` to
16-
push the new tag.
17-
5. Create release.
9+
and don't forget to save the setting
10+
3 Make sure documentation will be published upon push to main.
11+
See: .github/workflows/docs.yml
12+
4. Checkout develop and pull for latest version then
13+
checkout main and merge develop into main. Verify documentation was published.
14+
See actions and site below.
15+
5. Once merged, on github web interface create a release and tag based on main branch
16+
that matches the version in pyproject.toml
1817
6. Follow step 2 but uncheck Allow specified actors to bypass and don't forget save
1918
7. Finally, on the main branch, first run a ``poetry build`` and then a
2019
``poetry publish``. The second command will ask for a username and password
2120
for PyPI.
2221

22+
Check the documentation at: `https://lanl.github.io/BEE/ <https://lanl.github.io/BEE/>`_
23+
Also upgrade the pip version in your python or anaconda environment and check the version:
24+
`` pip install --upgrade hpc-beeflow``
25+
2326
**WARNING**: Once a version is pushed to PyPI, it cannot be undone. You can
2427
'delete' the version from the package settings, but you can no longer publish
2528
an update to that same version.

beeflow/cli.py

Lines changed: 69 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,11 @@
2424
from beeflow.common import cli_connection
2525

2626

27+
bc.init()
28+
# Max number of times a component can be restarted
29+
MAX_RESTARTS = bc.get('DEFAULT', 'max_restarts')
30+
31+
2732
class ComponentManager:
2833
"""Component manager class."""
2934

@@ -40,6 +45,8 @@ def wrap(fn):
4045
self.components[name] = {
4146
'fn': fn,
4247
'deps': deps,
48+
'restart_count': 0,
49+
'failed': False,
4350
}
4451

4552
return wrap
@@ -87,12 +94,24 @@ def run(self, base_components):
8794
self.procs[name] = component['fn']()
8895

8996
def poll(self):
90-
"""Poll each process to check for errors."""
91-
for name, proc in self.procs.items():
92-
returncode = proc.poll()
97+
"""Poll each process to check for errors, restart failed processes."""
98+
for name in self.procs: # noqa no need to iterate with items() since self.procs may be set
99+
component = self.components[name]
100+
if component['failed']:
101+
continue
102+
returncode = self.procs[name].poll()
93103
if returncode is not None:
94104
log = log_fname(name)
95105
print(f'Component "{name}" failed, check log "{log}"')
106+
if component['restart_count'] >= MAX_RESTARTS:
107+
print(f'Component "{name}" has been restarted {MAX_RESTARTS} '
108+
'times, not restarting again')
109+
component['failed'] = True
110+
else:
111+
restart_count = component['restart_count']
112+
print(f'Attempting restart {restart_count} of "{name}"...')
113+
self.procs[name] = component['fn']()
114+
component['restart_count'] += 1
96115

97116
def status(self):
98117
"""Return the statuses for each process in a dict."""
@@ -140,6 +159,12 @@ def open_log(component):
140159
return open(log, 'a', encoding='utf-8')
141160

142161

162+
# Slurmrestd will be started only if we're running with Slurm and
163+
# slurm::use_commands is not True
164+
NEED_SLURMRESTD = (bc.get('DEFAULT', 'workload_scheduler') == 'Slurm'
165+
and not bc.get('slurm', 'use_commands'))
166+
167+
143168
@MGR.component('wf_manager', ('scheduler',))
144169
def start_wfm():
145170
"""Start the WFM."""
@@ -149,7 +174,12 @@ def start_wfm():
149174
sock_path, stdout=fp, stderr=fp)
150175

151176

152-
@MGR.component('task_manager', ('slurmrestd',))
177+
TM_DEPS = []
178+
if NEED_SLURMRESTD:
179+
TM_DEPS.append('slurmrestd')
180+
181+
182+
@MGR.component('task_manager', TM_DEPS)
153183
def start_task_manager():
154184
"""Start the TM."""
155185
fp = open_log('task_manager')
@@ -168,21 +198,22 @@ def start_scheduler():
168198

169199

170200
# Workflow manager and task manager need to be opened with PIPE for their stdout/stderr
171-
@MGR.component('slurmrestd')
172-
def start_slurm_restd():
173-
"""Start BEESlurmRestD. Returns a Popen process object."""
174-
bee_workdir = bc.get('DEFAULT', 'bee_workdir')
175-
slurmrestd_log = '/'.join([bee_workdir, 'logs', 'restd.log'])
176-
slurm_socket = bc.get('slurmrestd', 'slurm_socket')
177-
slurm_args = bc.get('slurmrestd', 'slurm_args')
178-
slurm_args = slurm_args if slurm_args is not None else ''
179-
subprocess.run(['rm', '-f', slurm_socket], check=True)
180-
# log.info("Attempting to open socket: {}".format(slurm_socket))
181-
fp = open(slurmrestd_log, 'w', encoding='utf-8') # noqa
182-
cmd = ['slurmrestd']
183-
cmd.extend(slurm_args.split())
184-
cmd.append(f'unix:{slurm_socket}')
185-
return subprocess.Popen(cmd, stdout=fp, stderr=fp)
201+
if NEED_SLURMRESTD:
202+
@MGR.component('slurmrestd')
203+
def start_slurm_restd():
204+
"""Start BEESlurmRestD. Returns a Popen process object."""
205+
bee_workdir = bc.get('DEFAULT', 'bee_workdir')
206+
slurmrestd_log = '/'.join([bee_workdir, 'logs', 'restd.log'])
207+
slurm_socket = bc.get('slurm', 'slurmrestd_socket')
208+
openapi_version = bc.get('slurm', 'openapi_version')
209+
slurm_args = f'-s openapi/{openapi_version}'
210+
subprocess.run(['rm', '-f', slurm_socket], check=True)
211+
# log.info("Attempting to open socket: {}".format(slurm_socket))
212+
fp = open(slurmrestd_log, 'w', encoding='utf-8') # noqa
213+
cmd = ['slurmrestd']
214+
cmd.extend(slurm_args.split())
215+
cmd.append(f'unix:{slurm_socket}')
216+
return subprocess.Popen(cmd, stdout=fp, stderr=fp)
186217

187218

188219
def handle_terminate(signum, stack): # noqa
@@ -192,7 +223,7 @@ def handle_terminate(signum, stack): # noqa
192223
sys.exit(1)
193224

194225

195-
MIN_CHARLIECLOUD_VERSION = (0, 27)
226+
MIN_CHARLIECLOUD_VERSION = (0, 32)
196227

197228

198229
def version_str(version):
@@ -288,13 +319,21 @@ def start(foreground: bool = typer.Option(False, '--foreground', '-F',
288319
beeflow_log = log_fname('beeflow')
289320
check_dependencies()
290321
sock_path = bc.get('DEFAULT', 'beeflow_socket')
322+
if bc.get('DEFAULT', 'workload_scheduler') == 'Slurm' and not NEED_SLURMRESTD:
323+
warn('Not using slurmrestd. Command-line interface will be used.')
291324
# Note: there is a possible race condition here, however unlikely
292325
if os.path.exists(sock_path):
293326
# Try to contact for a status
294-
resp = cli_connection.send(sock_path, {'type': 'status'})
327+
try:
328+
resp = cli_connection.send(sock_path, {'type': 'status'})
329+
except (ConnectionResetError, ConnectionRefusedError):
330+
resp = None
295331
if resp is None:
296332
# Must be dead, so remove the socket path
297-
os.remove(sock_path)
333+
try:
334+
os.remove(sock_path)
335+
except FileNotFoundError:
336+
pass
298337
else:
299338
# It's already running, so print an error and exit
300339
warn(f'Beeflow appears to be running. Check the beeflow log: "{beeflow_log}"')
@@ -352,6 +391,14 @@ def stop():
352391
print(f'Beeflow has stopped. Check the log at "{beeflow_log}".')
353392

354393

394+
@app.command()
395+
def restart(foreground: bool = typer.Option(False, '--foreground', '-F',
396+
help='run in the foreground')):
397+
"""Attempt to stop and restart the beeflow daemon."""
398+
stop()
399+
start(foreground)
400+
401+
355402
@app.callback(invoke_without_command=True)
356403
def version_callback(version: bool = False):
357404
"""Beeflow."""
@@ -364,7 +411,6 @@ def version_callback(version: bool = False):
364411

365412
def main():
366413
"""Start the beeflow app."""
367-
bc.init()
368414
app()
369415

370416

0 commit comments

Comments
 (0)