|
| 1 | +% ----------------------------------------------------------------------------- |
| 2 | +% B Frequently Asked Questions |
| 3 | +% ----------------------------------------------------------------------------- |
| 4 | +\section{Frequently Asked Questions} |
| 5 | +\label{sect:faqs} |
| 6 | + |
| 7 | +% B.1 Where do I learn about Linux? |
| 8 | +% ------------------------------------------------------------- |
| 9 | +\subsection{Where do I learn about Linux?} |
| 10 | +\label{sect:faqs-linux} |
| 11 | + |
| 12 | +All Speed users are expected to have a basic understanding of Linux and its commonly used commands. |
| 13 | +Here are some recommended resources: |
| 14 | + |
| 15 | +\paragraph*{Software Carpentry}: |
| 16 | +Software Carpentry provides free resources to learn software, including a workshop on the Unix shell. |
| 17 | +Visit \href{https://software-carpentry.org/lessons/}{Software Carpentry Lessons} to learn more. |
| 18 | + |
| 19 | +\paragraph*{Udemy}: |
| 20 | +There are numerous Udemy courses, including free ones, that will help you learn Linux. |
| 21 | +Active Concordia faculty, staff and students have access to Udemy courses. |
| 22 | +A recommended starting point for beginners is the course ``Linux Mastery: Master the Linux Command Line in 11.5 Hours''. |
| 23 | +Visit \href{https://www.concordia.ca/it/services/udemy.html}{Concordia's Udemy page} to learn how Concordians can access Udemy. |
| 24 | + |
| 25 | +% B.2 How to bash shell on Speed? |
| 26 | +% ------------------------------------------------------------- |
| 27 | +\subsection{How to use bash shell on Speed?} |
| 28 | +\label{sect:faqs-bash} |
| 29 | + |
| 30 | +This section provides comprehensive instructions on how to utilize the bash shell on the Speed cluster. |
| 31 | + |
| 32 | +\subsubsection{How do I set bash as my login shell?} |
| 33 | +To set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash. |
| 34 | +To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to |
| 35 | +request that bash become your default login shell for your ENCS user account on all GCS servers. |
| 36 | + |
| 37 | +\subsubsection{How do I move into a bash shell on Speed?} |
| 38 | +To move to the bash shell, type \textbf{bash} at the command prompt: |
| 39 | +\begin{verbatim} |
| 40 | + [speed-submit] [/home/a/a_user] > bash |
| 41 | + bash-4.4$ echo $0 |
| 42 | + bash |
| 43 | +\end{verbatim} |
| 44 | +\noindent\textbf{Note} how the command prompt changes from |
| 45 | +``\verb![speed-submit] [/home/a/a_user] >!'' to ``\verb!bash-4.4$!'' after entering the bash shell. |
| 46 | + |
| 47 | +\subsubsection{How do I use the bash shell in an interactive session on Speed?} |
| 48 | +Below are examples of how to use \tool{bash} as a shell in your interactive job sessions |
| 49 | +with both the \tool{salloc} and \tool{srun} commands. |
| 50 | +\begin{itemize} |
| 51 | + \item \texttt{salloc -ppt --mem=100G -N 1 -n 10 /encs/bin/bash} |
| 52 | + \item \texttt{srun --mem=50G -n 5 --pty /encs/bin/bash} |
| 53 | +\end{itemize} |
| 54 | +\noindent\textbf{Note:} Make sure the interactive job requests memory, cores, etc. |
| 55 | + |
| 56 | +\subsubsection{How do I run scripts written in bash on \tool{Speed}?} |
| 57 | +To execute bash scripts on Speed: |
| 58 | +\begin{enumerate} |
| 59 | + \item Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+ |
| 60 | + \item Use the \tool{sbatch} command to submit your job script to the scheduler. |
| 61 | +\end{enumerate} |
| 62 | +\noindent Check Speed GitHub for a \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{sample bash job script}. |
| 63 | + |
| 64 | +% B.3 How to resolve “Disk quota exceeded” errors? |
| 65 | +% ------------------------------------------------------------- |
| 66 | +\subsection{How to resolve ``Disk quota exceeded'' errors?} |
| 67 | +\label{sect:quota-exceeded} |
| 68 | + |
| 69 | +\subsubsection{Probable Cause} |
| 70 | +The ``\texttt{Disk quota exceeded}'' error occurs when your application has |
| 71 | +run out of disk space to write to. On \tool{Speed}, this error can be returned when: |
| 72 | +\begin{enumerate} |
| 73 | + \item The NFS-provided home is full and cannot be written to. |
| 74 | + You can verify this using the \tool{quota} and \tool{bigfiles} commands. |
| 75 | + \item The ``\texttt{/tmp}'' directory on the speed node where your application is running is full and cannot be written to. |
| 76 | +\end{enumerate} |
| 77 | + |
| 78 | +\subsubsection{Possible Solutions} |
| 79 | +\begin{enumerate} |
| 80 | + \item Use the \option{--chdir} job script option to set the job working directory. |
| 81 | + This is the directory where the job will write output files. |
| 82 | + |
| 83 | + \item Although local disk space is recommended for IO-intensive operations, the |
| 84 | + `\texttt{/tmp}' directory on \tool{Speed} nodes is limited to 1TB, so it may be necessary |
| 85 | + to store temporary data elsewhere. Review the documentation for each module |
| 86 | + used in your script to determine how to set working directories. |
| 87 | + The basic steps are: |
| 88 | + \begin{itemize} |
| 89 | + \item |
| 90 | + Determine how to set working directories for each module used in your job script. |
| 91 | + \item |
| 92 | + Create a working directory in \tool{speed-scratch} for output files: |
| 93 | + \begin{verbatim} |
| 94 | + mkdir -m 750 /speed-scratch/$USER/output |
| 95 | + \end{verbatim} |
| 96 | + \item |
| 97 | + Create a subdirectory for recovery files: |
| 98 | + \begin{verbatim} |
| 99 | + mkdir -m 750 /speed-scratch/$USER/recovery |
| 100 | + \end{verbatim} |
| 101 | + \item |
| 102 | + Update the job script to write output to the directories created in your \tool{speed-scratch} directory, |
| 103 | + e.g., \verb!/speed-scratch/$USER/output!. |
| 104 | + \end{itemize} |
| 105 | +\end{enumerate} |
| 106 | +\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username. |
| 107 | + |
| 108 | +\subsubsection{Example of setting working directories for \tool{COMSOL}} |
| 109 | +\begin{itemize} |
| 110 | + \item Create directories for recovery, temporary, and configuration files. |
| 111 | + \begin{verbatim} |
| 112 | + mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config} |
| 113 | + \end{verbatim} |
| 114 | + \item Add the following command switches to the COMSOL command to use the directories created above: |
| 115 | + \begin{verbatim} |
| 116 | + -recoverydir /speed-scratch/$USER/comsol/recovery |
| 117 | + -tmpdir /speed-scratch/$USER/comsol/tmp |
| 118 | + -configuration/speed-scratch/$USER/comsol/config |
| 119 | + \end{verbatim} |
| 120 | +\end{itemize} |
| 121 | +\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username. |
| 122 | + |
| 123 | +\subsubsection{Example of setting working directories for \tool{Python Modules}} |
| 124 | +By default when adding a Python module, the \texttt{/tmp} directory is set as the temporary repository for files downloads. |
| 125 | +The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for PyTorch. |
| 126 | +To add a Python module |
| 127 | +\begin{itemize} |
| 128 | + \item Create your own tmp directory in your \verb!speed-scratch! directory: |
| 129 | + \begin{verbatim} |
| 130 | + mkdir /speed-scratch/$USER/tmp |
| 131 | + \end{verbatim} |
| 132 | + \item Use the temporary directory you created |
| 133 | + \begin{verbatim} |
| 134 | + setenv TMPDIR /speed-scratch/$USER/tmp |
| 135 | + \end{verbatim} |
| 136 | + \item Attempt the installation of PyTorch |
| 137 | +\end{itemize} |
| 138 | +\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username. |
| 139 | + |
| 140 | +% B.4 How do I check my job's status? |
| 141 | +% ------------------------------------------------------------- |
| 142 | +\subsection{How do I check my job's status?} |
| 143 | +\label{sect:faq-job-status} |
| 144 | + |
| 145 | +When a job with a job ID of 1234 is running or terminated, you can track its status using the following commands to check its status: |
| 146 | +\begin{itemize} |
| 147 | + \item Use the ``sacct'' command to view the status of a job: |
| 148 | + \begin{verbatim} |
| 149 | + sacct -j 1234 |
| 150 | + \end{verbatim} |
| 151 | + \item Use the ``squeue'' command to see if the job is sitting in the queue: |
| 152 | + \begin{verbatim} |
| 153 | + squeue -j 1234 |
| 154 | + \end{verbatim} |
| 155 | + \item Use the ``sstat'' command to find long-term statistics on the job after it has terminated |
| 156 | + and the \tool{slurmctld} has purged it from its tracking state into the database: |
| 157 | + \begin{verbatim} |
| 158 | + sstat -j 1234 |
| 159 | + \end{verbatim} |
| 160 | +\end{itemize} |
| 161 | + |
| 162 | +% B.5 Why is my job pending when nodes are empty? |
| 163 | +% ------------------------------------------------------------- |
| 164 | +\subsection{Why is my job pending when nodes are empty?} |
| 165 | + |
| 166 | +\subsubsection{Disabled nodes} |
| 167 | +It is possible that one or more of the Speed nodes are disabled for maintenance. |
| 168 | +To verify if Speed nodes are disabled, check if they are in a draining or drained state: |
| 169 | + |
| 170 | +\small |
| 171 | +\begin{verbatim} |
| 172 | +[serguei@speed-submit src] % sinfo --long --Node |
| 173 | +Thu Oct 19 21:25:12 2023 |
| 174 | +NODELIST NODES PARTITION STATE CPUS S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON |
| 175 | +speed-01 1 pa idle 32 2:16:1 257458 0 1 gpu16 none |
| 176 | +speed-03 1 pa idle 32 2:16:1 257458 0 1 gpu32 none |
| 177 | +speed-05 1 pg idle 32 2:16:1 515490 0 1 gpu16 none |
| 178 | +speed-07 1 ps* mixed 32 2:16:1 515490 0 1 cpu32 none |
| 179 | +speed-08 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 180 | +speed-09 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 181 | +speed-10 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 182 | +speed-11 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 183 | +speed-12 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 184 | +speed-15 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 185 | +speed-16 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 186 | +speed-17 1 pg drained 32 2:16:1 515490 0 1 gpu16 UGE |
| 187 | +speed-19 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 188 | +speed-20 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 189 | +speed-21 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 190 | +speed-22 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 191 | +speed-23 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 192 | +speed-24 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 193 | +speed-25 1 pg idle 32 2:16:1 257458 0 1 gpu32 none |
| 194 | +speed-25 1 pa idle 32 2:16:1 257458 0 1 gpu32 none |
| 195 | +speed-27 1 pg idle 32 2:16:1 257458 0 1 gpu32 none |
| 196 | +speed-27 1 pa idle 32 2:16:1 257458 0 1 gpu32 none |
| 197 | +speed-29 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 198 | +speed-30 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 199 | +speed-31 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 200 | +speed-32 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 201 | +speed-33 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 202 | +speed-34 1 ps* idle 32 2:16:1 515490 0 1 cpu32 none |
| 203 | +speed-35 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 204 | +speed-36 1 ps* drained 32 2:16:1 515490 0 1 cpu32 UGE |
| 205 | +speed-37 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 206 | +speed-38 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 207 | +speed-39 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 208 | +speed-40 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 209 | +speed-41 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 210 | +speed-42 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 211 | +speed-43 1 pt idle 256 2:64:2 980275 0 1 gpu20,mi none |
| 212 | +\end{verbatim} |
| 213 | +\normalsize |
| 214 | + |
| 215 | +\noindent Note which nodes are in the state of \textbf{drained}. |
| 216 | +The reason for the drained state can be found in the \textbf{reason} column. |
| 217 | +Your job will run once an occupied node becomes availble or the maintenance is completed, |
| 218 | +and the disabled nodes have a state of \textbf{idle}. |
| 219 | + |
| 220 | +\subsubsection{Error in job submit request.} |
| 221 | +It is possible that your job is pending because it requested resources that are not available within Speed. |
| 222 | +To verify why job ID 1234 is not running, execute: |
| 223 | +\begin{verbatim} |
| 224 | + sacct -j 1234 |
| 225 | +\end{verbatim} |
| 226 | + |
| 227 | +\noindent A summary of the reasons can be obtained via the \tool{squeue} command. |
0 commit comments