Skip to content

Releases: run-house/kubetorch

v0.5.0

18 Feb 00:00
b39d0d7

Choose a tag to compare

New Features

Workload CRD (#2195, #2210, #2219, #2222, #2236, #2238, #2251)

Kubetorch now stores workload information in a Kubernetes Workload CRD. This contains metadata for the Kubetorch services, such as labels, paths, images, and can be viewed and managed through standard Kubernetes tools.

To use the latest release, upgrade your kubetorch installation and ensure the CRD is installed.

BYO Manifest for arbitrary K8s resource types (#2133, #2211, #2237)

In release 0.4.0, we introduced bring-your-own (BYO) manifest for deploying custom Kubernetes manifests while maintaining Kubetorch compatible capabilities. In this release, we expand the supported resource types to be arbitrary Kubernetes resource types, in addition to the previously supported Kubetorch resource types. To apply this to new resource types, you will need to specify the pod_template_path parameter when creating the Compute object:

compute = kt.Compute.from_manifest(
    manifest=my_custom_manifest,
    pod_template_path="spec.workload.template",
)

Kt Apply (#2085, #2257)

kt apply is a new CLI command to deploy existing Kubernetes manifests through Kubetorch. This automatically injects the Kubetorch server to the manifest start and supports optional Dockerfile based image setup.

# Apply a deployment manifest
kt apply deployment.yaml

# Apply with Dockerfile for image setup
kt apply deployment.yaml --dockerfile Dockerfile

# Apply with HTTP proxying enabled
kt apply fastapi-deployment.yaml --port 8000 --health-check /health

Code synchronization control (#2228, #2243)

Introduce new parameters sync_dir, remote_dir, and remote_import_path in module initialization for finer grain control over code syncing and remote imports. Use sync_dir to specify a local directory to sync (or set to False to skip module syncing entirely), and use remote_dir and remote_import_path to point to code that already exists on the container (mutually exclusive from sync_dir).

# Sync a specific directory
remote_fn = kt.fn(my_function, sync_dir="./src").to(compute)

# Use code already on container (e.g., from image.copy())
image = kt.Image().copy("./src")
remote_fn = kt.fn(
    my_function,
    sync_dir=False,
    remote_dir="src",
    remote_import_path="mymodule"
).to(compute)

Improvements

  • Kill processes better (#2135)
  • Scale data store (#2185)
  • Update image to include dockerfile contents upon setup and .to (#2127, #2194)
  • kt describe to include the ingress if configured (#2191)
  • Allow setting kt config values to None (#2190)
  • Retry connection when hitting a RemoteProtocolError (#2204)
  • Split data store helm chart resources (#2216)
  • Add configuration for controller uvicorn worker count (#2217)
  • Update app http health check params (#2225)
  • Hide pod names by default for kt list (#2218)
  • Add release namespace for data store deployments (#2248)
  • Split up module pointers (#2227)
  • Add support for serialization=”none” (#2231)
  • Reduce poll interval for faster service readiness detection (#2232)
  • Eagerly load callable at subprocess startup (#2234)
  • Add callable_name property for modules (#2252)
  • Add more helpful logging for rsync errors (#2255)
  • Fix a few user facing type check errors (#2256)

Deprecations

  • Deprecate image rsync in favor of copy (#2127)

BC-Breaking

  • Require rsync 3.2.0+ instead of falling back to manual directory creation
  • Refactor teardown method (#2262)
    • --force/-f flag no longer deletes without confirmation. --force indicates force deleting the resource, but user will still be prompted with a confirmation if a --yes/-y flag is not provided.

Bug Fixes

  • Fix controller connection scaling (#2184)
  • Fix noisy websocket error logging during shutdown (#2193)
  • Fix rerun errors by appending launch_id (#2223)
  • Add markers to support decorating modules (#2235)
  • Fix for single-file rsync and dockerfile absolute rsync check (#2241, #2245)
  • Fix EADDRINUSE errno check for macOS compatibility (#2265)

v0.4.1

21 Jan 19:05
08af535

Choose a tag to compare

Improvements

  • Global helm flags (#2168)
  • Remove service name from data store (#2173)
  • Set logging config level based on env var (#2172)
  • Propagate user annotations defined in kt.Compute (#2171)

Bug Fixes

  • Fix kt logs --tail(#2166)
  • Fix process group formation for data syncing (#2176)
  • List secrets in deployed namespaces only (#2177)
  • List volumes in deployed namespaces only (#2179)

v0.4.0

19 Jan 18:48
5ec2420

Choose a tag to compare

Kubetorch Controller (#1947)

This release introduces the Kubetorch Controller, a new cluster-side component that eliminates the need for local kubeconfig files and simplifies authentication — the Python client now communicates with your cluster through a centralized controller endpoint.

  • Simplified setup: No more managing kubeconfig files or local Kubernetes client dependencies
  • Unified API access: All resource types (Deployments, Knative Services, RayClusters, Kubeflow Training Jobs, or other arbitrary CRDs) are managed through a single endpoint

Standard Workflow (#2041, #2076, #2079, #2081)

The familiar Kubetorch experience — declare your compute requirements and let Kubetorch handle everything:

import kubetorch as kt

# Define compute requirements
compute = kt.Compute(cpus="2", memory="4Gi", gpus=1)

# Deploy your function
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

Kubetorch automatically:

  • Generates the appropriate Kubernetes manifest (Deployment, Knative Service, etc.)
  • Deploys and manages the workload
  • Creates service routing
  • Handles code syncing and remote execution

Bring Your Own Manifest (#1914, #1916, #1935, #1960)

Full support for deploying custom Kubernetes manifests while still leveraging Kubetorch's module system, code syncing, and remote execution capabilities.

Use Cases

(1) Provide your own K8s manifest and let Kubetorch manage the deployment

import kubetorch as kt

# Your custom deployment manifest
my_manifest = {"apiVersion": "apps/v1", "kind": "Deployment", ...}

# KT applies the manifest and creates the routing service
compute = kt.Compute.from_manifest(
  manifest=my_manifest,
  selector={"app": "my-workers"}
)
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

(2) Apply the manifest separately

Point Kubetorch at resources deployed via kubectl (or another tool):

# User already deployed pods with label app=workers via kubectl
# KT only registers the pool and creates routing

compute = kt.Compute(selector={"app": "workers", "team": "ml"})
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

(3) Provide a service endpoint

Use your own ingress, service mesh, or load balancer:

compute = kt.Compute.from_manifest(
    manifest=my_manifest,
    selector={"app": "my-workers"},
    endpoint=kt.Endpoint(url="http://my-svc.my-namespace.svc.cluster.local:8080")
)

(4) Custom endpoint selector for routing

Route to a subset of your pods (e.g., only worker pods, not master):

compute = kt.Compute.from_manifest(
    manifest=pytorch_job_manifest,
    selector={"job-name": "my-job"}, 
    endpoint=kt.Endpoint(selector={"job-name": "my-job", "replica-type": "worker"})  # Route: workers only
)

Highlights

  • Compute.from_manifest(): Create a Compute object from any existing Kubernetes manifest (Deployment, Knative Service, RayCluster, or Kubeflow Training Jobs)
  • Kubeflow v1 Training Jobs: Native support for PyTorchJob, TFJob, MXJob, and XGBoostJob with automatic distributed execution
  • Property overrides: Modify CPU, memory, replicas, image, env vars, and other settings on imported manifests using standard Compute properties
  • Custom service managers: Extensible architecture for adding support for additional Kubernetes resource types

Improvements

Performance & Reliability

  • Add retry logic for transient rsync errors (#2061)
  • Increase loki stream limits (#2063)
  • Improve service teardown reliability (#2057)
  • PDB websocket cleanup (#2056)
  • Remove default httpx client timeouts for long running connections (#2122)
  • Update TTL support to scrape pod metrics via Prometheus (#2149)

Distributed Training

  • Auto-enable distributed execution for training jobs with more than one replica (#2111)
  • Cloud agnostic DCGM config (#2114)

Architecture & Refactoring

  • Move all module calls into subprocesses (#2069, #2070)
  • Consolidate launch time updates to service manager (#2046)
  • Remove env vars from manifest and reverse websocket connection from launched pods to Kubetorch controller (#2081)
  • Simplify http server launched on Kubetorch workload pods (#2108)
  • Replace requests with httpx for improved HTTP client functionality (#2107)
  • Remove unused launch params (#2130)
  • Remove unused controller client APIs (#2138)
  • Remove helm limits (#2146)

Bug Fixes

  • Always start Ray on head node where relevant for BYO manifests (#2046)
  • Properly stream logs for kt run app (#2103, #2121)
  • Fix kt app liveness check and logging config (#2105)
  • Fix noisy log streaming and event loop blocking (#2094)
  • Persist serialization type when reloading from an existing manifest (#2110)
  • Fix sys.path for scripts in subdirectories to enable sibling package imports (#2140)
  • Fix log streaming duplication (#2146)
  • Fix serialization check (#2148)

v0.3.0

30 Dec 16:44
5ece2cc

Choose a tag to compare

This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.

Kubetorch Data Store

The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:

  1. Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
  2. In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss

(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)

Highlights

  • Unified API: Single kt.put()/kt.get() interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast)
  • No kubeconfig required: The Python client communicates through a centralized controller endpoint
  • Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
  • GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
  • Broadcast coordination: BroadcastWindow enables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation

Example Usage

import kubetorch as kt

# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")

# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))

# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")

See the docs for more info.

Improvements

  • Remove queue and scheduler from Compute configuration options (#1968)
  • Added kt teardown support for training jobs (#1986)
  • Updates to metrics streaming output (#1984)
  • Remove OTEL as a Helm dependency (#2016, #2022)
  • Allow custom annotations for Kubetorch service account configuration (#2009)

Bug Fixes

  • Use correct container name when querying logs for Kubetorch services (#1972)
  • Prevent events and logs from printing on same line (#2008)
  • Async lifecycle management and cleanup (#2028)
  • Start Ray on head node if no distributed config provided with BYO manifest (#2046)
  • Handle image pull errors when checking for knative service readiness (#2050)
  • Control over autoscaler pod eviction behavior for distributed jobs (#2052)

v0.2.9

09 Dec 00:37
452213e

Choose a tag to compare

Improvements

  • Introduce global LoggingConfig for easier control of logging behavior with Kubetorch services (#1959)
  • Simplify compute spec requirements in factory and constructor (#1963)

Bug Fixes

  • Prevent log recursion errors (#1957)
  • Metrics streaming for async service calls (#1958)
  • Specify limits in pod template when requesting GPUs (#1965)

v0.2.8

05 Dec 13:56
a2d81b9

Choose a tag to compare

Improvements

  • Support for loading all pod IPs for distributed workloads (#1937)
  • Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
  • Improve pdb debugging (#1950)
  • Remove resource limits for workloads (#1949)
  • Set allowed serialization methods using local environment variable (#1951)

Bug Fixes

  • Fix pdb support and other query params for callables (#1939)
  • Setting Kubetorch volume mount paths on creation & reload (#1944)
  • Fix to_async() when get_if_exists is set to True (#1945)

v0.2.7

26 Nov 23:00
85402c8

Choose a tag to compare

Improvements

  • Added kt logs CLI command to stream and follow Kubetorch deployment logs (including distributed deployments) (#1928)
  • Add affinity and tolerations for rsync and nginx proxy deployments (#1923)
  • Filter out metric service calls from kubetorch pods (#1918)

Bug Fixes

  • Always SSH into head node for Ray deployments (#1931)
  • Metrics streaming for async calls (#1924)

v0.2.6

20 Nov 20:43
67adf58

Choose a tag to compare

Improvements

  • Expanded python version support (#1907)
  • Support helm installation in any namespace for better isolation (#1913)
  • Suppress metrics log checks in kubetorch pod logs (#1918)
  • Relax cluster readiness checks (#1919)

Bug Fixes

  • Template label parsing (#1908)
  • Deploy headless service for distributed use cases only (#1920)

v0.2.5

13 Nov 22:08
1b2d670

Choose a tag to compare

New Features

Notebook Integration

  • Added kt notebook CLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890)
  • You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)

Improvements

  • Simplified kubetorch[client] dependencies (#1896)
  • Faster kt listCLI loading (#1905)

Bug Fixes

  • Module and submodule reimporting on the cluster (#1902)

v0.2.4

12 Nov 09:52
8f86a29

Choose a tag to compare

New Features

Metrics Streaming

Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:

  • CPU utilization (per service or pod)
  • Memory consumption (MiB)
  • GPU metrics (DCGM-based utilization and memory usage, where relevant)

This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.

Related PRs: #1856, #1867, #1881, #1887

Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart

Improvements

  • Helm chart cleanup of deprecated kubetorch config values (#1865)
  • Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
  • Logging: updating callable name for clarity (#1876)

Bug Fixes

  • Fix dockerfile sync when running kt app (#1863)
  • kt config set/unset to only update specific config keys (#1882)
  • Reload cached submodules when reimporting kubetorch module on the server (#1883)