Skip to content

EPFL-LAPD/scitas-kuma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

SCITAS Kuma User Guide

This guide explains how to use the SCITAS Kuma GPU cluster (Spec) with:

  • Python & Micromamba (faster alternative for conda)
  • VS code Remote Window
  • PyTorch + CUDA

Table of Contents

  1. Connect to the EPFL Network
  2. Join the HPC-LAPD Group
  3. SSH Access to Kuma
  4. File System on Kuma
  5. Install Micromamba & Python
  6. Set Up VS Code and run PyTorch (no GPU yet)
  7. Set Up Passwordless SSH
  8. Interactive GPU Access in VS Code
  9. Running Jobs with GPU on Kuma

1. Connect to the EPFL Network

Ensure you are on the EPFL network or connected via VPN to access Kuma.

2. Join the HPC-LAPD Group

  1. Visit EPFL Groups
  2. Search for hpc-lapd.
  3. If you are not a member, contact the administrators.

3. SSH Access to Kuma

  1. Open PowerShell (Windows) or Terminal (Linux/macOS).
  2. Run: ssh <username>@kuma.hpc.epfl.ch
  3. Enter your password (characters will not be displayed as you type).
  4. You are now in the Kuma frontend (kuma1 or kuma2), but GPU access is not yet available.

4. File System on Kuma

Kuma has different storage locations:

Home Directory (/home/<username>)

  • Limited to 100 GB per user.
  • Data is deleted after 2 years of inactivity or 6 months after leaving EPFL.
  • Useful commands:
    pwd                  # Show current directory
    cd /home/<username>  # Change to home directory (or `cd ~`)
    ls                   # List file/folder with details (or `ls -l`, `ll`, `ls -la`, `ll -a`)
    du -sh <file/folder> # Check disk usage

Scratch Directory (/scratch/<username>)

  • Shared 435 TB of high-speed storage.
  • Suitable for computation-heavy tasks.
  • Files older than 30 days are automatically deleted.

5. Install Micromamba & Python

Micromamba is a fast alternative to Conda. Follow these steps to install it in your scratch directory:

  1. Install Micromamba:

    "${SHELL}" <(curl -L micro.mamba.pm/install.sh)
    • Binary folder: /scratch/<username>/.local/bin
    • Shell initialization: y
    • Configure conda-forge: y
    • Installation prefix: /scratch/<username>/micromamba
  2. Activate Micromamba:

    source ~/.bashrc
    micromamba activate
    • You should see (base) in your terminal.
  3. Create a Python environment:

    micromamba create -n my_env python=3.13
    micromamba activate my_env
    which python  # Note this path for VS Code (Example: `/scratch/jlhsieh/micromamba/envs/my_env/bin/python`)
  4. Install packages:

    • SCITAS Lmod tool Tool for managing scientific software.
    • uv Extremely fast. Leo prefer uv pip install over pip install, conda install, micromaba install.
    • PyTorch Use the PyTorch that match the CDUA on Kuma.
    pip install uv  # an extremely fast Python package installer
    module load gcc/13.2.0 cuda/12.4.1  # SCITAS Lmod tool
    module list  # Check what tool have been load
    nvcc --version  # Check if there is CUDA
    uv pip install torch torchvision torchaudio  # Check PyTorch website to match CDUA version if needed
    uv pip install ipykernel  # For Jupyter

6. Set Up VS Code and run PyTorch (no GPU yet)

  1. Open VS Code on your local computer.
  2. Click the blue Open a Remote Window button (bottom left corner).
  3. Click Connect to Host > Add New SSH Host.
  4. Enter: ssh <username>@kuma.hpc.epfl.ch.
  5. Choose the current user SSH configuration file.
    (Example: /home/leohsieh/.ssh/config or C:\Users\leohsieh\.ssh\config)
  6. Rename the Host in SSH config file.
  7. Connect VS code to Remote Host.
Connect.VS.code.to.Remote.Host.mp4
  1. Open /scratch/<username> folder in VS Code.
  2. Create and test a Python script (test.py) to check PyTorch.
    # %% test.py
    import torch
    import sys

    def check_gpu() -> None:
        print(f"Python version: {sys.version}")
        print(f"PyTorch version: {torch.__version__}")
        if torch.cuda.is_available():
            print("CUDA is available. Here are the details of the CUDA devices:")
            for i in range(torch.cuda.device_count()):
                print(f"Device {i}: {torch.cuda.get_device_name(i)}")
                a = f"  CUDA Capability: {torch.cuda.get_device_properties(i).major}.{torch.cuda.get_device_properties(i).minor},"
                b = f"  Multiprocessors: {torch.cuda.get_device_properties(i).multi_processor_count}"
                print(f"{a+b}")
                print(f"  Memory")
                print(f"    {torch.cuda.get_device_properties(i).total_memory / (1024 ** 3):.2f} GB: Total Memory")
                print(f"    {torch.cuda.memory_reserved(i) / (1024 ** 3):.2f} GB: PyTorch current Reserved Memory")
                print(f"    {torch.cuda.memory_allocated(i) / (1024 ** 3):.2f} GB: PyTorch current Allocated Memory")
                print(f"    {torch.cuda.max_memory_reserved(i) / (1024 ** 3):.2f} GB: PyTorch max ever Reserved Memory")
                print(f"    {torch.cuda.max_memory_allocated(i) / (1024 ** 3):.2f} GB: PyTorch max ever Allocated Memory")
        else:
            print("CUDA is NOT available")

    check_gpu()
Create.test.py.mp4
  1. If your Python interpreter in Micromamba is not detected, enter the Python path manually.
    Example: /scratch/jlhsieh/micromamba/envs/my_env/bin/python
  2. Try running test.py. You should see PyTorch version: 2.6.0+cu124.
  • (No GPU access yet. You'll see CUDA is NOT available).
Select.Python.interpreter.mp4

7. Set Up Passwordless SSH

This is required because we need ProxyJump to GPU node later.

  1. Open PowerShell (Windows) or Terminal (Linux/macOS).

  2. Generate an SSH key pair in your local computer:

    • Both Linux/macOS and Windows PowerShell
      ssh-keygen -t ed25519 -f ${HOME}/.ssh/my_ssh_key
  3. Copy the public key to Kuma:

    • Linux/macOS
      ssh-copy-id -i ${HOME}/.ssh/my_ssh_key.pub <username>@kuma.hpc.epfl.ch`
    • Windows PowerShell
      type $HOME\.ssh\my_ssh_key.pub | ssh <username>@kuma.hpc.epfl.ch "mkdir -p .ssh && tee -a .ssh/authorized_keys"
  4. Modify the SSH config file as:

    Host my-kuma-frontend
      HostName kuma.hpc.epfl.ch
      User jlhsieh
      IdentityFile ~/.ssh/my_ssh_key
      IdentitiesOnly yes
    
    Host my-kuma-node
      HostName kh???
      User jlhsieh
      IdentityFile ~/.ssh/my_ssh_key
      IdentitiesOnly yes
      ProxyJump my-kuma-frontend
    
  5. Test the connection without password:

    ssh my-kuma-frontend
    • If successful, no password is required.

8. Interactive GPU Access in VS Code

Use interactive sessions for testing/debugging:

  1. Open PowerShell (Windows) or Terminal (Linux/macOS) and connect to my-kuma-frontend:

    ssh my-kuma-frontend
    micromamba activate my_env
    module load gcc/13.2.0 cuda/12.4.1
  2. Request an interactive GPU node:

    Sinteract -p h100 -q debug -m 4G -g gpu:1 -t 0-00:10:00
    • -p h100 (Use H100 GPU) or -p l40s (Use L40S GPU)
    • -q debug (free but 1 hour max) or -q normal or -q long # See QOS detail and Kuma Pricing & QOS.
    • -m 4G (4 GB RAM)
    • -g gpu:1 (1 GPU) or -g gpu:2 (2 GPUs)
    • -t 0-00:30:00 (Time duration format: D-HH:MM:SS)
    • For more details: Sinteract --help or link
  3. A Kuma GPU node will be assigned (Example: kh029).

  4. Modify the SSH config accordingly, connect to the GPU node in VS Code, and start using the GPU.
    https://github.com/user-attachments/assets/30150cce-c39e-49b3-9236-986e57b2fcc7

    • The GPU node will close when the time is up or when the Terminal/PowerShell is closed.

9. Running Jobs with GPU on Kuma

For long-running jobs, submit batch scripts instead of using interactive mode. This way, you don't need to keep your Terminal/PowerShell open.

  1. connect to the Kuma frontend node in VS Code (GPU node not needed)

  2. Create myjob.run in /scratch/<username>:

    #!/bin/bash
    #SBATCH --partition h100
    #SBATCH --qos debug
    #SBATCH --mem 4G
    #SBATCH --gpus 1
    #SBATCH --time 0-00:01:00
    
    echo "==== Start job ==================================="
    module load gcc/13.2.0 cuda/12.4.1
    micromamba run -n my_env python /scratch/jlhsieh/test.py
    echo "sleep 10 seconds"
    sleep 10
    micromamba run -n my_env python /scratch/jlhsieh/test.py
    echo "==== End job====================================="
    • For more details: link
  3. Open a terminal in VS Code and submit the job:

    sbatch myjob.run
  4. Check job status:

    Squeue
    Submit.job.mp4
  5. You can overwrite job parameters when submitting:

    sbatch --partition=l40s --qos normal --mem=8G --gpus=2 --time=0-00:05:00 myjob.run  # parameters in `myjob.run` will be overwritten
  6. Now your job runs independently, even after you disconnect from Kuma.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published