Skip to content

Commit 204ea6e

Browse files
authored
Merge pull request #58 from NAG-DevOps/manual-release7.3
Manual release 7.3 updates
2 parents c15f027 + af95fb9 commit 204ea6e

29 files changed

+4230
-2615
lines changed

doc/appendix/faq.tex

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
% -----------------------------------------------------------------------------
2+
% B Frequently Asked Questions
3+
% -----------------------------------------------------------------------------
4+
\section{Frequently Asked Questions}
5+
\label{sect:faqs}
6+
7+
% B.1 Where do I learn about Linux?
8+
% -------------------------------------------------------------
9+
\subsection{Where do I learn about Linux?}
10+
\label{sect:faqs-linux}
11+
12+
All Speed users are expected to have a basic understanding of Linux and its commonly used commands.
13+
Here are some recommended resources:
14+
15+
\paragraph*{Software Carpentry}:
16+
Software Carpentry provides free resources to learn software, including a workshop on the Unix shell.
17+
Visit \href{https://software-carpentry.org/lessons/}{Software Carpentry Lessons} to learn more.
18+
19+
\paragraph*{Udemy}:
20+
There are numerous Udemy courses, including free ones, that will help you learn Linux.
21+
Active Concordia faculty, staff and students have access to Udemy courses.
22+
A recommended starting point for beginners is the course ``Linux Mastery: Master the Linux Command Line in 11.5 Hours''.
23+
Visit \href{https://www.concordia.ca/it/services/udemy.html}{Concordia's Udemy page} to learn how Concordians can access Udemy.
24+
25+
% B.2 How to bash shell on Speed?
26+
% -------------------------------------------------------------
27+
\subsection{How to use bash shell on Speed?}
28+
\label{sect:faqs-bash}
29+
30+
This section provides comprehensive instructions on how to utilize the bash shell on the Speed cluster.
31+
32+
\subsubsection{How do I set bash as my login shell?}
33+
To set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash.
34+
To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to
35+
request that bash become your default login shell for your ENCS user account on all GCS servers.
36+
37+
\subsubsection{How do I move into a bash shell on Speed?}
38+
To move to the bash shell, type \textbf{bash} at the command prompt:
39+
\begin{verbatim}
40+
[speed-submit] [/home/a/a_user] > bash
41+
bash-4.4$ echo $0
42+
bash
43+
\end{verbatim}
44+
\noindent\textbf{Note} how the command prompt changes from
45+
``\verb![speed-submit] [/home/a/a_user] >!'' to ``\verb!bash-4.4$!'' after entering the bash shell.
46+
47+
\subsubsection{How do I use the bash shell in an interactive session on Speed?}
48+
Below are examples of how to use \tool{bash} as a shell in your interactive job sessions
49+
with both the \tool{salloc} and \tool{srun} commands.
50+
\begin{itemize}
51+
\item \texttt{salloc -ppt --mem=100G -N 1 -n 10 /encs/bin/bash}
52+
\item \texttt{srun --mem=50G -n 5 --pty /encs/bin/bash}
53+
\end{itemize}
54+
\noindent\textbf{Note:} Make sure the interactive job requests memory, cores, etc.
55+
56+
\subsubsection{How do I run scripts written in bash on \tool{Speed}?}
57+
To execute bash scripts on Speed:
58+
\begin{enumerate}
59+
\item Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+
60+
\item Use the \tool{sbatch} command to submit your job script to the scheduler.
61+
\end{enumerate}
62+
\noindent Check Speed GitHub for a \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{sample bash job script}.
63+
64+
% B.3 How to resolve “Disk quota exceeded” errors?
65+
% -------------------------------------------------------------
66+
\subsection{How to resolve ``Disk quota exceeded'' errors?}
67+
\label{sect:quota-exceeded}
68+
69+
\subsubsection{Probable Cause}
70+
The ``\texttt{Disk quota exceeded}'' error occurs when your application has
71+
run out of disk space to write to. On \tool{Speed}, this error can be returned when:
72+
\begin{enumerate}
73+
\item The NFS-provided home is full and cannot be written to.
74+
You can verify this using the \tool{quota} and \tool{bigfiles} commands.
75+
\item The ``\texttt{/tmp}'' directory on the speed node where your application is running is full and cannot be written to.
76+
\end{enumerate}
77+
78+
\subsubsection{Possible Solutions}
79+
\begin{enumerate}
80+
\item Use the \option{--chdir} job script option to set the job working directory.
81+
This is the directory where the job will write output files.
82+
83+
\item Although local disk space is recommended for IO-intensive operations, the
84+
`\texttt{/tmp}' directory on \tool{Speed} nodes is limited to 1TB, so it may be necessary
85+
to store temporary data elsewhere. Review the documentation for each module
86+
used in your script to determine how to set working directories.
87+
The basic steps are:
88+
\begin{itemize}
89+
\item
90+
Determine how to set working directories for each module used in your job script.
91+
\item
92+
Create a working directory in \tool{speed-scratch} for output files:
93+
\begin{verbatim}
94+
mkdir -m 750 /speed-scratch/$USER/output
95+
\end{verbatim}
96+
\item
97+
Create a subdirectory for recovery files:
98+
\begin{verbatim}
99+
mkdir -m 750 /speed-scratch/$USER/recovery
100+
\end{verbatim}
101+
\item
102+
Update the job script to write output to the directories created in your \tool{speed-scratch} directory,
103+
e.g., \verb!/speed-scratch/$USER/output!.
104+
\end{itemize}
105+
\end{enumerate}
106+
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
107+
108+
\subsubsection{Example of setting working directories for \tool{COMSOL}}
109+
\begin{itemize}
110+
\item Create directories for recovery, temporary, and configuration files.
111+
\begin{verbatim}
112+
mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
113+
\end{verbatim}
114+
\item Add the following command switches to the COMSOL command to use the directories created above:
115+
\begin{verbatim}
116+
-recoverydir /speed-scratch/$USER/comsol/recovery
117+
-tmpdir /speed-scratch/$USER/comsol/tmp
118+
-configuration/speed-scratch/$USER/comsol/config
119+
\end{verbatim}
120+
\end{itemize}
121+
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
122+
123+
\subsubsection{Example of setting working directories for \tool{Python Modules}}
124+
By default when adding a Python module, the \texttt{/tmp} directory is set as the temporary repository for files downloads.
125+
The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for PyTorch.
126+
To add a Python module
127+
\begin{itemize}
128+
\item Create your own tmp directory in your \verb!speed-scratch! directory:
129+
\begin{verbatim}
130+
mkdir /speed-scratch/$USER/tmp
131+
\end{verbatim}
132+
\item Use the temporary directory you created
133+
\begin{verbatim}
134+
setenv TMPDIR /speed-scratch/$USER/tmp
135+
\end{verbatim}
136+
\item Attempt the installation of PyTorch
137+
\end{itemize}
138+
\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
139+
140+
% B.4 How do I check my job's status?
141+
% -------------------------------------------------------------
142+
\subsection{How do I check my job's status?}
143+
\label{sect:faq-job-status}
144+
145+
When a job with a job ID of 1234 is running or terminated, you can track its status using the following commands to check its status:
146+
\begin{itemize}
147+
\item Use the ``sacct'' command to view the status of a job:
148+
\begin{verbatim}
149+
sacct -j 1234
150+
\end{verbatim}
151+
\item Use the ``squeue'' command to see if the job is sitting in the queue:
152+
\begin{verbatim}
153+
squeue -j 1234
154+
\end{verbatim}
155+
\item Use the ``sstat'' command to find long-term statistics on the job after it has terminated
156+
and the \tool{slurmctld} has purged it from its tracking state into the database:
157+
\begin{verbatim}
158+
sstat -j 1234
159+
\end{verbatim}
160+
\end{itemize}
161+
162+
% B.5 Why is my job pending when nodes are empty?
163+
% -------------------------------------------------------------
164+
\subsection{Why is my job pending when nodes are empty?}
165+
166+
\subsubsection{Disabled nodes}
167+
It is possible that one or more of the Speed nodes are disabled for maintenance.
168+
To verify if Speed nodes are disabled, check if they are in a draining or drained state:
169+
170+
\small
171+
\begin{verbatim}
172+
[serguei@speed-submit src] % sinfo --long --Node
173+
Thu Oct 19 21:25:12 2023
174+
NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
175+
speed-01 1 pa idle 32 2:16:1 257458 0 1 gpu16 none
176+
speed-03 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
177+
speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none
178+
speed-07 1 ps* mixed 32 2:16:1 515490 0 1 cpu32 none
179+
speed-08 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
180+
speed-09 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
181+
speed-10 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
182+
speed-11 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
183+
speed-12 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
184+
speed-15 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
185+
speed-16 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
186+
speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE
187+
speed-19 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
188+
speed-20 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
189+
speed-21 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
190+
speed-22 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
191+
speed-23 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
192+
speed-24 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
193+
speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
194+
speed-25 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
195+
speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none
196+
speed-27 1 pa idle 32 2:16:1 257458 0 1 gpu32 none
197+
speed-29 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
198+
speed-30 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
199+
speed-31 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
200+
speed-32 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
201+
speed-33 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
202+
speed-34 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none
203+
speed-35 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
204+
speed-36 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE
205+
speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
206+
speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
207+
speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
208+
speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
209+
speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
210+
speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
211+
speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none
212+
\end{verbatim}
213+
\normalsize
214+
215+
\noindent Note which nodes are in the state of \textbf{drained}.
216+
The reason for the drained state can be found in the \textbf{reason} column.
217+
Your job will run once an occupied node becomes availble or the maintenance is completed,
218+
and the disabled nodes have a state of \textbf{idle}.
219+
220+
\subsubsection{Error in job submit request.}
221+
It is possible that your job is pending because it requested resources that are not available within Speed.
222+
To verify why job ID 1234 is not running, execute:
223+
\begin{verbatim}
224+
sacct -j 1234
225+
\end{verbatim}
226+
227+
\noindent A summary of the reasons can be obtained via the \tool{squeue} command.

doc/appendix/history.tex

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
% -----------------------------------------------------------------------------
2+
% A History
3+
% -----------------------------------------------------------------------------
4+
\section{History}
5+
\label{sect:history}
6+
7+
% A.1 Acknowledgments
8+
% -------------------------------------------------------------
9+
\subsection{Acknowledgments}
10+
\label{sect:acks}
11+
12+
\begin{itemize}
13+
\item
14+
The first 6 to 6.5 versions of this manual and early UGE job script samples, Singularity testing,and user support
15+
were produced/done by Dr.~Scott Bunnell during his time at Concordia as a part of the NAG/HPC group.
16+
We thank him for his contributions.
17+
\item
18+
The HTML version with devcontainer support was contributed by Anh H Nguyen.
19+
\item
20+
Dr.~Tariq Daradkeh, was our IT Instructional Specialist from August 2022 to September 2023;
21+
working on the scheduler, scheduling research, end user support, and integration of
22+
examples, such as YOLOv3 in \xs{sect:openiss-yolov3} and other tasks. We have a continued
23+
collaboration on HPC/scheduling research (see~\cite{job-failure-prediction-compsysarch2024}).
24+
\end{itemize}
25+
26+
% A.2 Migration from UGE to SLURM
27+
% -------------------------------------------------------------
28+
\subsection{Migration from UGE to SLURM}
29+
\label{appdx:uge-to-slurm}
30+
31+
For long term users who started off with Grid Engine here are some resources
32+
to make a transition and mapping to the job submission process.
33+
34+
\begin{itemize}
35+
\item
36+
Queues are called ``partitions'' in SLURM. Our mapping from the GE queues to SLURM partitions is as follows:
37+
\begin{verbatim}
38+
GE => SLURM
39+
s.q ps
40+
g.q pg
41+
a.q pa
42+
\end{verbatim}
43+
We also have a new partition \texttt{pt} that covers SPEED2 nodes, which previously did not exist.
44+
45+
\item
46+
Commands and command options mappings are found in \xf{fig:rosetta-mappings} from:\\
47+
\url{https://slurm.schedmd.com/rosetta.pdf}\\
48+
\url{https://slurm.schedmd.com/pdfs/summary.pdf}\\
49+
Other related helpful resources from similar organizations who either used SLURM for a while or also transitioned to it:\\
50+
\url{https://docs.alliancecan.ca/wiki/Running_jobs}\\
51+
\url{https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf}\\
52+
\url{https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm}
53+
54+
\begin{figure}[htpb]
55+
\includegraphics[width=\columnwidth]{images/rosetta-mapping}
56+
\caption{Rosetta Mappings of Scheduler Commands from SchedMD}
57+
\label{fig:rosetta-mappings}
58+
\end{figure}
59+
60+
\item
61+
\textbf{NOTE:} If you have used UGE commands in the past you probably still have these
62+
lines there; \textbf{they should now be removed}, as they have no use in SLURM and
63+
will start giving ``command not found'' errors on login when the software is removed:
64+
65+
csh/\tool{tcsh}: sample \file{.tcshrc} file:
66+
\begin{verbatim}
67+
# Speed environment set up
68+
if ($HOSTNAME == speed-submit.encs.concordia.ca) then
69+
source /local/pkg/uge-8.6.3/root/default/common/settings.csh
70+
endif
71+
\end{verbatim}
72+
73+
Bourne shell/\tool{bash}: sample \file{.bashrc} file:
74+
\begin{verbatim}
75+
# Speed environment set up
76+
if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
77+
. /local/pkg/uge-8.6.3/root/default/common/settings.sh
78+
printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
79+
fi
80+
\end{verbatim}
81+
82+
\textbf{IMPORTANT NOTE:} you will need to either log out and back in, or execute a new shell,
83+
for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied.
84+
\end{itemize}
85+
86+
% A.3 Phases
87+
% -------------------------------------------------------------
88+
\subsection{Phases}
89+
\label{sect:phases}
90+
91+
Brief summary of Speed evolution phases:
92+
93+
\subsubsection{Phase 5}
94+
Phase 5 saw incorporation of the Salus, Magic, and Nebular
95+
subclusters (see \xf{fig:speed-architecture-full}).
96+
97+
\subsubsection{Phase 4}
98+
Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added,
99+
dubbed as ``SPEED2''. We also moved from Grid Engine to SLURM.
100+
101+
\subsubsection{Phase 3}
102+
Phase 3 had 4 vidpro nodes added from Dr.~Amer totalling 6x P6 and 6x V100
103+
GPUs added.
104+
105+
\subsubsection{Phase 2}
106+
Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes.
107+
The P6s replaced 4x of FirePro S7150.
108+
109+
\subsubsection{Phase 1}
110+
Phase 1 of Speed was of the following configuration:
111+
\begin{itemize}
112+
\item
113+
Sixteen, 32-core nodes, each with 512~GB of memory and approximately 1~TB of volatile-scratch disk space.
114+
\item
115+
Five AMD FirePro S7150 GPUs, with 8~GB of memory (compatible with the Direct X, OpenGL, OpenCL, and Vulkan APIs).
116+
\end{itemize}

0 commit comments

Comments
 (0)