NAG-DevOps
diff --git a/‎doc/appendix/faq.tex
Lines changed: 227 additions & 0 deletions b/‎doc/appendix/faq.tex
Lines changed: 227 additions & 0 deletions
diff --git a/‎doc/appendix/history.tex
Lines changed: 116 additions & 0 deletions b/‎doc/appendix/history.tex
Lines changed: 116 additions & 0 deletions
@@ -0,0 +1,227 @@
+% -----------------------------------------------------------------------------
+%						B Frequently Asked Questions
+% -----------------------------------------------------------------------------
+\section{Frequently Asked Questions}
+\label{sect:faqs}
+
+% B.1 Where do I learn about Linux?
+% -------------------------------------------------------------
+\subsection{Where do I learn about Linux?}
+\label{sect:faqs-linux}
+
+All Speed users are expected to have a basic understanding of Linux and its commonly used commands.
+Here are some recommended resources:
+
+\paragraph*{Software Carpentry}:
+Software Carpentry provides free resources to learn software, including a workshop on the Unix shell.
+Visit \href{https://software-carpentry.org/lessons/}{Software Carpentry Lessons} to learn more.
+
+\paragraph*{Udemy}:
+There are numerous Udemy courses, including free ones, that will help you learn Linux.
+Active Concordia faculty, staff and students have access to Udemy courses.
+A recommended starting point for beginners is the course ``Linux Mastery: Master the Linux Command Line in 11.5 Hours''.
+Visit \href{https://www.concordia.ca/it/services/udemy.html}{Concordia's Udemy page} to learn how Concordians can access Udemy.
+
+% B.2 How to bash shell on Speed?
+% -------------------------------------------------------------
+\subsection{How to use bash shell on Speed?}
+\label{sect:faqs-bash}
+
+This section provides comprehensive instructions on how to utilize the bash shell on the Speed cluster.
+
+\subsubsection{How do I set bash as my login shell?}
+To set your default login shell to bash on Speed, your login shell on all GCS servers must be changed to bash.
+To make this change, create a ticket with the Service Desk (or email \texttt{help at concordia.ca}) to
+request that bash become your default login shell for your ENCS user account on all GCS servers.
+
+\subsubsection{How do I move into a bash shell on Speed?}
+To move to the bash shell, type \textbf{bash} at the command prompt:
+\begin{verbatim}
+	[speed-submit] [/home/a/a_user] > bash
+	bash-4.4$ echo $0
+	bash
+\end{verbatim}
+\noindent\textbf{Note} how the command prompt changes from
+``\verb![speed-submit] [/home/a/a_user] >!'' to ``\verb!bash-4.4$!'' after entering the bash shell.
+
+\subsubsection{How do I use the bash shell in an interactive session on Speed?}
+Below are examples of how to use \tool{bash} as a shell in your interactive job sessions
+with both the \tool{salloc} and \tool{srun} commands.
+\begin{itemize}
+	\item \texttt{salloc -ppt --mem=100G -N 1 -n 10 /encs/bin/bash}
+	\item \texttt{srun --mem=50G -n 5 --pty /encs/bin/bash}
+\end{itemize}
+\noindent\textbf{Note:} Make sure the interactive job requests memory, cores, etc.
+
+\subsubsection{How do I run scripts written in bash on \tool{Speed}?}
+To execute bash scripts on Speed:
+\begin{enumerate}
+	\item Ensure that the shebang of your bash job script is \verb+#!/encs/bin/bash+
+	\item Use the \tool{sbatch} command to submit your job script to the scheduler.
+\end{enumerate}
+\noindent Check Speed GitHub for a \href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/bash.sh}{sample bash job script}.
+
+% B.3 How to resolve “Disk quota exceeded” errors?
+% -------------------------------------------------------------
+\subsection{How to resolve ``Disk quota exceeded'' errors?}
+\label{sect:quota-exceeded}
+
+\subsubsection{Probable Cause}
+The ``\texttt{Disk quota exceeded}'' error occurs when your application has
+run out of disk space to write to. On \tool{Speed}, this error can be returned when:
+\begin{enumerate}
+	\item The NFS-provided home is full and cannot be written to.
+	You can verify this using the \tool{quota} and \tool{bigfiles} commands.
+	\item The ``\texttt{/tmp}'' directory on the speed node where your application is running is full and cannot be written to.
+\end{enumerate}
+
+\subsubsection{Possible Solutions}
+\begin{enumerate}
+	\item Use the \option{--chdir} job script option to set the job working directory.
+	This is the directory where the job will write output files.
+
+ 	\item Although local disk space is recommended for IO-intensive operations, the 
+ 	`\texttt{/tmp}' directory on \tool{Speed} nodes is limited to 1TB, so it may be necessary 
+	to store temporary data elsewhere. Review the documentation for each module
+	used in your script to determine how to set working directories.
+	The basic steps are:
+	\begin{itemize}
+	\item
+	Determine how to set working directories for each module used in your job script.
+	\item
+	Create a working directory in \tool{speed-scratch} for output files:
+	\begin{verbatim}
+		mkdir -m 750 /speed-scratch/$USER/output
+	\end{verbatim}
+	\item
+	Create a subdirectory for recovery files:
+	\begin{verbatim}
+		mkdir -m 750 /speed-scratch/$USER/recovery
+	\end{verbatim}
+	\item
+	Update the job script to write output to the directories created in your \tool{speed-scratch} directory,
+	e.g., \verb!/speed-scratch/$USER/output!.
+	\end{itemize}
+\end{enumerate}
+\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
+
+\subsubsection{Example of setting working directories for \tool{COMSOL}}
+\begin{itemize}
+	\item Create directories for recovery, temporary, and configuration files.
+	\begin{verbatim}
+		mkdir -m 750 -p /speed-scratch/$USER/comsol/{recovery,tmp,config}
+	\end{verbatim}
+	\item Add the following command switches to the COMSOL command to use the directories created above:
+	\begin{verbatim}
+		-recoverydir /speed-scratch/$USER/comsol/recovery
+		-tmpdir /speed-scratch/$USER/comsol/tmp
+		-configuration/speed-scratch/$USER/comsol/config
+	\end{verbatim}
+\end{itemize}
+\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
+
+\subsubsection{Example of setting working directories for \tool{Python Modules}}
+By default when adding a Python module, the \texttt{/tmp} directory is set as the temporary repository for files downloads.
+The size of the \texttt{/tmp} directory on \verb!speed-submit! is too small for PyTorch.
+To add a Python module
+\begin{itemize}
+    \item Create your own tmp directory in your \verb!speed-scratch! directory:
+	\begin{verbatim}
+  		mkdir /speed-scratch/$USER/tmp
+	\end{verbatim}
+	\item Use the temporary directory you created
+	\begin{verbatim}
+  		setenv TMPDIR /speed-scratch/$USER/tmp
+	\end{verbatim}
+    \item Attempt the installation of PyTorch
+\end{itemize}
+\noindent In the above example, \verb!$USER! is an environment variable containing your ENCS username.
+
+% B.4 How do I check my job's status?
+% -------------------------------------------------------------
+\subsection{How do I check my job's status?}
+\label{sect:faq-job-status}
+
+When a job with a job ID of 1234 is running or terminated, you can track its status using the following commands to check its status:
+\begin{itemize}
+	\item Use the ``sacct'' command to view the status of a job:
+	\begin{verbatim}
+		sacct -j 1234
+	\end{verbatim}
+	\item Use the ``squeue'' command to see if the job is sitting in the queue:
+	\begin{verbatim}
+		squeue -j 1234
+	\end{verbatim}
+	\item Use the ``sstat'' command to find long-term statistics on the job after it has terminated
+	and the \tool{slurmctld} has purged it from its tracking state into the database:
+	\begin{verbatim}
+		sstat -j 1234
+	\end{verbatim}
+\end{itemize}
+
+% B.5 Why is my job pending when nodes are empty?
+% -------------------------------------------------------------
+\subsection{Why is my job pending when nodes are empty?}
+
+\subsubsection{Disabled nodes}
+It is possible that one or more of the Speed nodes are disabled for maintenance.
+To verify if Speed nodes are disabled, check if they are in a draining or drained state:
+
+\small
+\begin{verbatim}
+[serguei@speed-submit src] % sinfo --long --Node
+Thu Oct 19 21:25:12 2023
+NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
+speed-01       1        pa        idle 32     2:16:1 257458        0      1    gpu16 none
+speed-03       1        pa        idle 32     2:16:1 257458        0      1    gpu32 none
+speed-05       1        pg        idle 32     2:16:1 515490        0      1    gpu16 none
+speed-07       1       ps*       mixed 32     2:16:1 515490        0      1    cpu32 none
+speed-08       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-09       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-10       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-11       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-12       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-15       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-16       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-17       1        pg     drained 32     2:16:1 515490        0      1    gpu16 UGE
+speed-19       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-20       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-21       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-22       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-23       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-24       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-25       1        pg        idle 32     2:16:1 257458        0      1    gpu32 none
+speed-25       1        pa        idle 32     2:16:1 257458        0      1    gpu32 none
+speed-27       1        pg        idle 32     2:16:1 257458        0      1    gpu32 none
+speed-27       1        pa        idle 32     2:16:1 257458        0      1    gpu32 none
+speed-29       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-30       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-31       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-32       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-33       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-34       1       ps*        idle 32     2:16:1 515490        0      1    cpu32 none
+speed-35       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-36       1       ps*     drained 32     2:16:1 515490        0      1    cpu32 UGE
+speed-37       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-38       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-39       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-40       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-41       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-42       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+speed-43       1        pt        idle 256    2:64:2 980275        0      1 gpu20,mi none
+\end{verbatim}
+\normalsize
+
+\noindent Note which nodes are in the state of \textbf{drained}.
+The reason for the drained state can be found in the \textbf{reason} column.
+Your job will run once an occupied node becomes availble or the maintenance is completed,
+and the disabled nodes have a state of \textbf{idle}.
+
+\subsubsection{Error in job submit request.}
+It is possible that your job is pending because it requested resources that are not available within Speed. 
+To verify why job ID 1234 is not running, execute:
+\begin{verbatim}
+	sacct -j 1234
+\end{verbatim}
+
+\noindent A summary of the reasons can be obtained via the \tool{squeue} command.
@@ -0,0 +1,116 @@
+% -----------------------------------------------------------------------------
+%						A History
+% -----------------------------------------------------------------------------
+\section{History}
+\label{sect:history}
+
+% A.1 Acknowledgments
+% -------------------------------------------------------------
+\subsection{Acknowledgments}
+\label{sect:acks}
+
+\begin{itemize}
+	\item
+    The first 6 to 6.5 versions of this manual and early UGE job script samples, Singularity testing,and user support
+    were produced/done by Dr.~Scott Bunnell during his time at Concordia as a part of the NAG/HPC group.
+    We thank him for his contributions.
+	\item
+    The HTML version with devcontainer support was contributed by Anh H Nguyen.
+	\item
+    Dr.~Tariq Daradkeh, was our IT Instructional Specialist from August 2022 to September 2023;
+    working on the scheduler, scheduling research, end user support, and integration of
+    examples, such as YOLOv3 in \xs{sect:openiss-yolov3} and other tasks. We have a continued
+    collaboration on HPC/scheduling research (see~\cite{job-failure-prediction-compsysarch2024}).
+\end{itemize}
+
+% A.2 Migration from UGE to SLURM
+% -------------------------------------------------------------
+\subsection{Migration from UGE to SLURM}
+\label{appdx:uge-to-slurm}
+
+For long term users who started off with Grid Engine here are some resources
+to make a transition and mapping to the job submission process.
+
+\begin{itemize}
+\item
+Queues are called ``partitions'' in SLURM. Our mapping from the GE queues to SLURM partitions is as follows:
+\begin{verbatim}
+    GE  => SLURM
+    s.q    ps
+    g.q    pg
+    a.q    pa
+\end{verbatim}
+We also have a new partition \texttt{pt} that covers SPEED2 nodes, which previously did not exist.
+
+\item
+Commands and command options mappings are found in \xf{fig:rosetta-mappings} from:\\
+\url{https://slurm.schedmd.com/rosetta.pdf}\\
+\url{https://slurm.schedmd.com/pdfs/summary.pdf}\\
+Other related helpful resources from similar organizations who either used SLURM for a while or also transitioned to it:\\
+\url{https://docs.alliancecan.ca/wiki/Running_jobs}\\
+\url{https://www.depts.ttu.edu/hpcc/userguides/general_guides/Conversion_Table_1.pdf}\\
+\url{https://docs.mpcdf.mpg.de/doc/computing/clusters/aux/migration-from-sge-to-slurm}
+
+\begin{figure}[htpb]
+    \includegraphics[width=\columnwidth]{images/rosetta-mapping}
+    \caption{Rosetta Mappings of Scheduler Commands from SchedMD}
+    \label{fig:rosetta-mappings}
+\end{figure}
+
+\item
+\textbf{NOTE:} If you have used UGE commands in the past you probably still have these
+lines there; \textbf{they should now be removed}, as they have no use in SLURM and
+will start giving ``command not found'' errors on login when the software is removed:
+
+csh/\tool{tcsh}: sample \file{.tcshrc} file:
+\begin{verbatim}
+    # Speed environment set up
+    if ($HOSTNAME == speed-submit.encs.concordia.ca) then
+    source /local/pkg/uge-8.6.3/root/default/common/settings.csh
+    endif
+\end{verbatim}
+
+Bourne shell/\tool{bash}: sample \file{.bashrc} file:
+\begin{verbatim}
+    # Speed environment set up
+    if [ $HOSTNAME = "speed-submit.encs.concordia.ca" ]; then
+        . /local/pkg/uge-8.6.3/root/default/common/settings.sh
+        printenv ORGANIZATION | grep -qw ENCS || . /encs/Share/bash/profile
+    fi
+\end{verbatim}
+
+\textbf{IMPORTANT NOTE:} you will need to either log out and back in, or execute a new shell, 
+for the environment changes in the updated \file{.tcshrc} or \file{.bashrc} file to be applied.
+\end{itemize}
+
+% A.3 Phases
+% -------------------------------------------------------------
+\subsection{Phases}
+\label{sect:phases}
+
+Brief summary of Speed evolution phases:
+
+\subsubsection{Phase 5}
+Phase 5 saw incorporation of the Salus, Magic, and Nebular
+subclusters (see \xf{fig:speed-architecture-full}).
+
+\subsubsection{Phase 4}
+Phase 4 had 7 SuperMicro servers with 4x A100 80GB GPUs each added,
+dubbed as ``SPEED2''. We also moved from Grid Engine to SLURM.
+
+\subsubsection{Phase 3}
+Phase 3 had 4 vidpro nodes added from Dr.~Amer totalling 6x P6 and 6x V100
+GPUs added.
+
+\subsubsection{Phase 2}
+Phase 2 saw 6x NVIDIA Tesla P6 added and 8x more compute nodes.
+The P6s replaced 4x of FirePro S7150.
+
+\subsubsection{Phase 1}
+Phase 1 of Speed was of the following configuration:
+\begin{itemize}
+    \item
+    Sixteen, 32-core nodes, each with 512~GB of memory and approximately 1~TB of volatile-scratch disk space.
+    \item
+    Five AMD FirePro S7150 GPUs, with 8~GB of memory (compatible with the Direct X, OpenGL, OpenCL, and Vulkan APIs).
+\end{itemize}