This guide explains how to use the SCITAS Kuma GPU cluster (Spec) with:
- Python & Micromamba (faster alternative for conda)
- VS code Remote Window
- PyTorch + CUDA
- Connect to the EPFL Network
- Join the HPC-LAPD Group
- SSH Access to Kuma
- File System on Kuma
- Install Micromamba & Python
- Set Up VS Code and run PyTorch (no GPU yet)
- Set Up Passwordless SSH
- Interactive GPU Access in VS Code
- Running Jobs with GPU on Kuma
Ensure you are on the EPFL network or connected via VPN to access Kuma.
- Visit EPFL Groups
- Search for
hpc-lapd
. - If you are not a member, contact the administrators.
- Open PowerShell (Windows) or Terminal (Linux/macOS).
- Run:
ssh <username>@kuma.hpc.epfl.ch
- Example for Leo Jih-Liang Hsieh:
ssh [email protected]
- Example for Leo Jih-Liang Hsieh:
- Enter your password (characters will not be displayed as you type).
- You are now in the Kuma frontend (
kuma1
orkuma2
), but GPU access is not yet available.
Kuma has different storage locations:
- Limited to 100 GB per user.
- Data is deleted after 2 years of inactivity or 6 months after leaving EPFL.
- Useful commands:
pwd # Show current directory cd /home/<username> # Change to home directory (or `cd ~`) ls # List file/folder with details (or `ls -l`, `ll`, `ls -la`, `ll -a`) du -sh <file/folder> # Check disk usage
- Shared 435 TB of high-speed storage.
- Suitable for computation-heavy tasks.
- Files older than 30 days are automatically deleted.
Micromamba is a fast alternative to Conda. Follow these steps to install it in your scratch
directory:
-
Install Micromamba:
"${SHELL}" <(curl -L micro.mamba.pm/install.sh)
- Binary folder:
/scratch/<username>/.local/bin
- Shell initialization:
y
- Configure conda-forge:
y
- Installation prefix:
/scratch/<username>/micromamba
- Binary folder:
-
Activate Micromamba:
source ~/.bashrc micromamba activate
- You should see
(base)
in your terminal.
- You should see
-
Create a Python environment:
micromamba create -n my_env python=3.13 micromamba activate my_env which python # Note this path for VS Code (Example: `/scratch/jlhsieh/micromamba/envs/my_env/bin/python`)
-
Install packages:
- SCITAS Lmod tool Tool for managing scientific software.
- uv Extremely fast. Leo prefer
uv pip install
overpip install
,conda install
,micromaba install
. - PyTorch Use the PyTorch that match the CDUA on Kuma.
pip install uv # an extremely fast Python package installer module load gcc/13.2.0 cuda/12.4.1 # SCITAS Lmod tool module list # Check what tool have been load nvcc --version # Check if there is CUDA uv pip install torch torchvision torchaudio # Check PyTorch website to match CDUA version if needed uv pip install ipykernel # For Jupyter
- Open VS Code on your local computer.
- Click the blue Open a Remote Window button (bottom left corner).
- Click Connect to Host > Add New SSH Host.
- Enter:
ssh <username>@kuma.hpc.epfl.ch
. - Choose the current user SSH configuration file.
(Example:/home/leohsieh/.ssh/config
orC:\Users\leohsieh\.ssh\config
) - Rename the
Host
in SSH config file. - Connect VS code to Remote Host.
Connect.VS.code.to.Remote.Host.mp4
- Open
/scratch/<username>
folder in VS Code. - Create and test a Python script (
test.py
) to check PyTorch.
# %% test.py
import torch
import sys
def check_gpu() -> None:
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")
if torch.cuda.is_available():
print("CUDA is available. Here are the details of the CUDA devices:")
for i in range(torch.cuda.device_count()):
print(f"Device {i}: {torch.cuda.get_device_name(i)}")
a = f" CUDA Capability: {torch.cuda.get_device_properties(i).major}.{torch.cuda.get_device_properties(i).minor},"
b = f" Multiprocessors: {torch.cuda.get_device_properties(i).multi_processor_count}"
print(f"{a+b}")
print(f" Memory")
print(f" {torch.cuda.get_device_properties(i).total_memory / (1024 ** 3):.2f} GB: Total Memory")
print(f" {torch.cuda.memory_reserved(i) / (1024 ** 3):.2f} GB: PyTorch current Reserved Memory")
print(f" {torch.cuda.memory_allocated(i) / (1024 ** 3):.2f} GB: PyTorch current Allocated Memory")
print(f" {torch.cuda.max_memory_reserved(i) / (1024 ** 3):.2f} GB: PyTorch max ever Reserved Memory")
print(f" {torch.cuda.max_memory_allocated(i) / (1024 ** 3):.2f} GB: PyTorch max ever Allocated Memory")
else:
print("CUDA is NOT available")
check_gpu()
Create.test.py.mp4
- If your Python interpreter in Micromamba is not detected, enter the Python path manually.
Example:/scratch/jlhsieh/micromamba/envs/my_env/bin/python
- Try running
test.py
. You should seePyTorch version: 2.6.0+cu124
.
- (No GPU access yet. You'll see
CUDA is NOT available
).
Select.Python.interpreter.mp4
This is required because we need ProxyJump to GPU node later.
-
Open PowerShell (Windows) or Terminal (Linux/macOS).
-
Generate an SSH key pair in your local computer:
-
Copy the public key to Kuma:
- Linux/macOS
ssh-copy-id -i ${HOME}/.ssh/my_ssh_key.pub <username>@kuma.hpc.epfl.ch`
- Windows PowerShell
type $HOME\.ssh\my_ssh_key.pub | ssh <username>@kuma.hpc.epfl.ch "mkdir -p .ssh && tee -a .ssh/authorized_keys"
- Linux/macOS
-
Modify the SSH config file as:
Host my-kuma-frontend HostName kuma.hpc.epfl.ch User jlhsieh IdentityFile ~/.ssh/my_ssh_key IdentitiesOnly yes Host my-kuma-node HostName kh??? User jlhsieh IdentityFile ~/.ssh/my_ssh_key IdentitiesOnly yes ProxyJump my-kuma-frontend
-
Test the connection without password:
ssh my-kuma-frontend
- If successful, no password is required.
Use interactive sessions for testing/debugging:
-
Open PowerShell (Windows) or Terminal (Linux/macOS) and connect to
my-kuma-frontend
:ssh my-kuma-frontend micromamba activate my_env module load gcc/13.2.0 cuda/12.4.1
-
Request an interactive GPU node:
Sinteract -p h100 -q debug -m 4G -g gpu:1 -t 0-00:10:00
-p h100
(Use H100 GPU) or-p l40s
(Use L40S GPU)-q debug
(free but 1 hour max) or-q normal
or-q long
# See QOS detail and Kuma Pricing & QOS.-m 4G
(4 GB RAM)-g gpu:1
(1 GPU) or-g gpu:2
(2 GPUs)-t 0-00:30:00
(Time duration format:D-HH:MM:SS
)- For more details:
Sinteract --help
or link
-
Modify the SSH config accordingly, connect to the GPU node in VS Code, and start using the GPU.
https://github.com/user-attachments/assets/30150cce-c39e-49b3-9236-986e57b2fcc7- The GPU node will close when the time is up or when the Terminal/PowerShell is closed.
For long-running jobs, submit batch scripts instead of using interactive mode. This way, you don't need to keep your Terminal/PowerShell open.
-
connect to the Kuma frontend node in VS Code (GPU node not needed)
-
Create
myjob.run
in/scratch/<username>
:#!/bin/bash #SBATCH --partition h100 #SBATCH --qos debug #SBATCH --mem 4G #SBATCH --gpus 1 #SBATCH --time 0-00:01:00 echo "==== Start job ===================================" module load gcc/13.2.0 cuda/12.4.1 micromamba run -n my_env python /scratch/jlhsieh/test.py echo "sleep 10 seconds" sleep 10 micromamba run -n my_env python /scratch/jlhsieh/test.py echo "==== End job====================================="
- For more details: link
-
Open a terminal in VS Code and submit the job:
sbatch myjob.run
-
Check job status:
Squeue
Submit.job.mp4
-
You can overwrite job parameters when submitting:
sbatch --partition=l40s --qos normal --mem=8G --gpus=2 --time=0-00:05:00 myjob.run # parameters in `myjob.run` will be overwritten
-
Now your job runs independently, even after you disconnect from Kuma.