Skip to content

Commit

Permalink
Moved GPU driver update to past
Browse files Browse the repository at this point in the history
  • Loading branch information
heatherkellyucl committed Nov 18, 2024
1 parent 6d3605a commit d6aeccc
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions mkdocs-project-dir/docs/Planned_Outages.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,15 @@ After an outage, the first day or two back should be considered 'at risk'; that

Date | Service | Status | Reason
--------------------|---------|--------|--------
4-8 November 2024 | Myriad, Young | Planned | New drivers being deployed on the GPU nodes in a rolling fashion. No outage of all nodes at once.

## Previous Outages

Date | Service | Status | Reason
--------------------|---------|--------|--------
8 October 2024 | Kathleen | Planned | There will be a brief ACFS outage while we switch the network gateway from one util node to another. If all goes well it will recover quickly, but reading and writing to the ACFS will hang while it is in progress. We will do this on 8 Oct maintenance day.
7 October 2024 | Young, Michael | Planned | Migration to new filesystem. Jobs will be drained and logins prevented for 9am. Expected to be back up the following day. Once back up, all jobs will be in `hqw` status and users will need to migrate their data from `/old_lustre/home/username` and `/old_lustre/scratch/username` which will be read-only. Users with accounts on both systems will need to log into each to see their old data, but will be copying it to the same shared home and Scratch. SSH keys will be copied to the new filesystem so login is possible. The two `/old_lustre` will remain for three months + 7 days and be removed on Tues 14 Jan 2025. [Logins enabled 8 Oct 13:30]
2-4 October 2024 | Young, Michael | Planned | Young login02 and Michael login11 will each be out of service for a day during this period for testing updates before filesystem migration. No interruption to jobs or logins to the general addresses `young.rc.ucl.ac.uk` and `michael.rc.ucl.ac.uk`.
4-8 November 2024 | Myriad, Young | Completed | New drivers being deployed on the GPU nodes in a rolling fashion. No outage of all nodes at once.
8 October 2024 | Kathleen | Completed | There will be a brief ACFS outage while we switch the network gateway from one util node to another. If all goes well it will recover quickly, but reading and writing to the ACFS will hang while it is in progress. We will do this on 8 Oct maintenance day.
7 October 2024 | Young, Michael | Completed | Migration to new filesystem. Jobs will be drained and logins prevented for 9am. Expected to be back up the following day. Once back up, all jobs will be in `hqw` status and users will need to migrate their data from `/old_lustre/home/username` and `/old_lustre/scratch/username` which will be read-only. Users with accounts on both systems will need to log into each to see their old data, but will be copying it to the same shared home and Scratch. SSH keys will be copied to the new filesystem so login is possible. The two `/old_lustre` will remain for three months + 7 days and be removed on Tues 14 Jan 2025. [Logins enabled 8 Oct 13:30]
2-4 October 2024 | Young, Michael | Completed | Young login02 and Michael login11 will each be out of service for a day during this period for testing updates before filesystem migration. No interruption to jobs or logins to the general addresses `young.rc.ucl.ac.uk` and `michael.rc.ucl.ac.uk`.
10 September 2024 | Kathleen | Completed | Migration to new Lustre filesystem and mounting of ACFS (ARC Cluster File System) as the new backed-up location. Jobs will be drained for 8am and logins will be prevented from 9am. Expected to be back up the following day. Once back up, all jobs will be in `hqw` status and users will need to migrate their data from `/old_lustre/home/username` and `/old_lustre/scratch/username` which will be read-only. Home will no longer be backed up. `/old_lustre` will remain for three months and be removed on 11 Dec 2024.
11 June 2024 | Young | Completed | Drain of one rack for 8am to physically install new hardware. The rest of the cluster will be operating as usual. The new hardware won't be available for use for some time - will still need testing and configuring.
10 June 2024 | Michael | Completed | Full shutdown of Michael for network config alterations in preparation for new filesystem. Jobs will be drained for 8am. No logins and file access during outage. Expected to be back up later that day.
Expand Down

0 comments on commit d6aeccc

Please sign in to comment.