Skip to content

red-hat-data-services/distributed-workloads

This branch is 248 commits ahead of opendatahub-io/distributed-workloads:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

3a076b1 · Mar 27, 2025
Jan 6, 2025
Mar 6, 2025
Jun 21, 2024
Mar 26, 2025
Dec 18, 2024
Mar 24, 2025
Mar 17, 2025
Feb 20, 2025
Aug 21, 2023
Aug 21, 2024
Jan 2, 2024
Dec 18, 2024
Nov 8, 2024
Dec 11, 2024
Mar 17, 2025
Mar 17, 2025
Sep 1, 2023

Repository files navigation

Distributed Workloads

Examples

  • Fine-Tune LLMs with Ray and DeepSpeed on OpenShift AI
  • Fine-Tune Stable Diffusion with DreamBooth and Ray Train
  • Hyperparameters Optimization with Ray Tune on OpenShift AI

Integration Tests

Prerequisites

  • Admin access to an OpenShift cluster (CRC is fine)

  • Installed OpenDataHub or RHOAI, enabled all Distributed Workload components

  • Installed Go 1.21

Common environment variables

  • CODEFLARE_TEST_OUTPUT_DIR - Output directory for test logs

  • CODEFLARE_TEST_TIMEOUT_SHORT - Timeout duration for short tasks

  • CODEFLARE_TEST_TIMEOUT_MEDIUM - Timeout duration for medium tasks

  • CODEFLARE_TEST_TIMEOUT_LONG - Timeout duration for long tasks

  • CODEFLARE_TEST_RAY_IMAGE (Optional) - Ray image used for raycluster configuration

  • MINIO_CLI_IMAGE (Optional) - Minio CLI image used for uploading/downloading data from/into s3 bucket

    NOTE: quay.io/modh/ray:2.35.0-py311-cu121 is the default image used for creating a RayCluster resource. If you have your own custom ray image which suits your purposes, specify it in CODEFLARE_TEST_RAY_IMAGE environment variable.

Environment variables for fms-hf-tuning test suite

  • FMS_HF_TUNING_IMAGE - Image tag used in PyTorchJob CR for model training

Environment variables for fms-hf-tuning GPU test suite

  • TEST_NAMESPACE_NAME (Optional) - Existing namespace where will the Training operator GPU tests be executed
  • HF_TOKEN - HuggingFace token used to pull models which has limited access
  • GPTQ_MODEL_PVC_NAME - Name of PersistenceVolumeClaim containing downloaded GPTQ models

To upload trained model into S3 compatible storage, use the environment variables mentioned below :

  • AWS_DEFAULT_ENDPOINT - Storage bucket endpoint to upload trained dataset to, if set then test will upload model into s3 bucket
  • AWS_ACCESS_KEY_ID - Storage bucket access key
  • AWS_SECRET_ACCESS_KEY - Storage bucket secret key
  • AWS_STORAGE_BUCKET - Storage bucket name
  • AWS_STORAGE_BUCKET_MODEL_PATH (Optional) - Path in the storage bucket where trained model will be stored to

Environment variables for ODH integration test suite

  • ODH_NAMESPACE - Namespace where ODH components are installed to
  • NOTEBOOK_USER_NAME - Username of user used for running Workbench
  • NOTEBOOK_USER_TOKEN - Login token of user used for running Workbench
  • NOTEBOOK_IMAGE - Image used for running Workbench

To download MNIST training script datasets from S3 compatible storage, use the environment variables mentioned below :

  • AWS_DEFAULT_ENDPOINT - Storage bucket endpoint from which to download MNIST datasets
  • AWS_ACCESS_KEY_ID - Storage bucket access key
  • AWS_SECRET_ACCESS_KEY - Storage bucket secret key
  • AWS_STORAGE_BUCKET - Storage bucket name
  • AWS_STORAGE_BUCKET_MNIST_DIR - Storage bucket directory from which to download MNIST datasets.

Running Tests

Execute tests like standard Go unit tests.

go test -timeout 60m ./tests/kfto/

About

Artifacts for installing the Distributed Workloads stack as part of ODH

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Go 64.2%
  • Python 20.5%
  • Dockerfile 7.9%
  • Jupyter Notebook 6.6%
  • Other 0.8%