Releases · run-house/kubetorch

18 Feb 00:00

carolineechen

v0.5.0

b39d0d7

v0.5.0 Latest

Latest

New Features

Workload CRD (#2195, #2210, #2219, #2222, #2236, #2238, #2251)

Kubetorch now stores workload information in a Kubernetes Workload CRD. This contains metadata for the Kubetorch services, such as labels, paths, images, and can be viewed and managed through standard Kubernetes tools.

To use the latest release, upgrade your kubetorch installation and ensure the CRD is installed.

BYO Manifest for arbitrary K8s resource types (#2133, #2211, #2237)

In release 0.4.0, we introduced bring-your-own (BYO) manifest for deploying custom Kubernetes manifests while maintaining Kubetorch compatible capabilities. In this release, we expand the supported resource types to be arbitrary Kubernetes resource types, in addition to the previously supported Kubetorch resource types. To apply this to new resource types, you will need to specify the pod_template_path parameter when creating the Compute object:

compute = kt.Compute.from_manifest(
    manifest=my_custom_manifest,
    pod_template_path="spec.workload.template",
)

Kt Apply (#2085, #2257)

kt apply is a new CLI command to deploy existing Kubernetes manifests through Kubetorch. This automatically injects the Kubetorch server to the manifest start and supports optional Dockerfile based image setup.

# Apply a deployment manifest
kt apply deployment.yaml

# Apply with Dockerfile for image setup
kt apply deployment.yaml --dockerfile Dockerfile

# Apply with HTTP proxying enabled
kt apply fastapi-deployment.yaml --port 8000 --health-check /health

Code synchronization control (#2228, #2243)

Introduce new parameters sync_dir, remote_dir, and remote_import_path in module initialization for finer grain control over code syncing and remote imports. Use sync_dir to specify a local directory to sync (or set to False to skip module syncing entirely), and use remote_dir and remote_import_path to point to code that already exists on the container (mutually exclusive from sync_dir).

# Sync a specific directory
remote_fn = kt.fn(my_function, sync_dir="./src").to(compute)

# Use code already on container (e.g., from image.copy())
image = kt.Image().copy("./src")
remote_fn = kt.fn(
    my_function,
    sync_dir=False,
    remote_dir="src",
    remote_import_path="mymodule"
).to(compute)

Improvements

Kill processes better (#2135)
Scale data store (#2185)
Update image to include dockerfile contents upon setup and .to (#2127, #2194)
kt describe to include the ingress if configured (#2191)
Allow setting kt config values to None (#2190)
Retry connection when hitting a RemoteProtocolError (#2204)
Split data store helm chart resources (#2216)
Add configuration for controller uvicorn worker count (#2217)
Update app http health check params (#2225)
Hide pod names by default for kt list (#2218)
Add release namespace for data store deployments (#2248)
Split up module pointers (#2227)
Add support for serialization=”none” (#2231)
Reduce poll interval for faster service readiness detection (#2232)
Eagerly load callable at subprocess startup (#2234)
Add callable_name property for modules (#2252)
Add more helpful logging for rsync errors (#2255)
Fix a few user facing type check errors (#2256)

Deprecations

Deprecate image rsync in favor of copy (#2127)

BC-Breaking

Require rsync 3.2.0+ instead of falling back to manual directory creation
Refactor teardown method (#2262)
- --force/-f flag no longer deletes without confirmation. --force indicates force deleting the resource, but user will still be prompted with a confirmation if a --yes/-y flag is not provided.

Bug Fixes

Fix controller connection scaling (#2184)
Fix noisy websocket error logging during shutdown (#2193)
Fix rerun errors by appending launch_id (#2223)
Add markers to support decorating modules (#2235)
Fix for single-file rsync and dockerfile absolute rsync check (#2241, #2245)
Fix EADDRINUSE errno check for macOS compatibility (#2265)

Assets 2

21 Jan 19:05

jlewitt1

v0.4.1

08af535

v0.4.1

Improvements

Global helm flags (#2168)
Remove service name from data store (#2173)
Set logging config level based on env var (#2172)
Propagate user annotations defined in kt.Compute (#2171)

Bug Fixes

Fix kt logs --tail(#2166)
Fix process group formation for data syncing (#2176)
List secrets in deployed namespaces only (#2177)
List volumes in deployed namespaces only (#2179)

Assets 2

19 Jan 18:48

jlewitt1

v0.4.0

5ec2420

v0.4.0

Kubetorch Controller (#1947)

This release introduces the Kubetorch Controller, a new cluster-side component that eliminates the need for local kubeconfig files and simplifies authentication — the Python client now communicates with your cluster through a centralized controller endpoint.

Simplified setup: No more managing kubeconfig files or local Kubernetes client dependencies
Unified API access: All resource types (Deployments, Knative Services, RayClusters, Kubeflow Training Jobs, or other arbitrary CRDs) are managed through a single endpoint

Standard Workflow (#2041, #2076, #2079, #2081)

The familiar Kubetorch experience — declare your compute requirements and let Kubetorch handle everything:

import kubetorch as kt

# Define compute requirements
compute = kt.Compute(cpus="2", memory="4Gi", gpus=1)

# Deploy your function
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

Kubetorch automatically:

Generates the appropriate Kubernetes manifest (Deployment, Knative Service, etc.)
Deploys and manages the workload
Creates service routing
Handles code syncing and remote execution

Bring Your Own Manifest (#1914, #1916, #1935, #1960)

Full support for deploying custom Kubernetes manifests while still leveraging Kubetorch's module system, code syncing, and remote execution capabilities.

Use Cases

(1) Provide your own K8s manifest and let Kubetorch manage the deployment

import kubetorch as kt

# Your custom deployment manifest
my_manifest = {"apiVersion": "apps/v1", "kind": "Deployment", ...}

# KT applies the manifest and creates the routing service
compute = kt.Compute.from_manifest(
  manifest=my_manifest,
  selector={"app": "my-workers"}
)
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

(2) Apply the manifest separately

Point Kubetorch at resources deployed via kubectl (or another tool):

# User already deployed pods with label app=workers via kubectl
# KT only registers the pool and creates routing

compute = kt.Compute(selector={"app": "workers", "team": "ml"})
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)

(3) Provide a service endpoint

Use your own ingress, service mesh, or load balancer:

compute = kt.Compute.from_manifest(
    manifest=my_manifest,
    selector={"app": "my-workers"},
    endpoint=kt.Endpoint(url="http://my-svc.my-namespace.svc.cluster.local:8080")
)

(4) Custom endpoint selector for routing

Route to a subset of your pods (e.g., only worker pods, not master):

compute = kt.Compute.from_manifest(
    manifest=pytorch_job_manifest,
    selector={"job-name": "my-job"}, 
    endpoint=kt.Endpoint(selector={"job-name": "my-job", "replica-type": "worker"})  # Route: workers only
)

Highlights

Compute.from_manifest(): Create a Compute object from any existing Kubernetes manifest (Deployment, Knative Service, RayCluster, or Kubeflow Training Jobs)
Kubeflow v1 Training Jobs: Native support for PyTorchJob, TFJob, MXJob, and XGBoostJob with automatic distributed execution
Property overrides: Modify CPU, memory, replicas, image, env vars, and other settings on imported manifests using standard Compute properties
Custom service managers: Extensible architecture for adding support for additional Kubernetes resource types

Improvements

Performance & Reliability

Add retry logic for transient rsync errors (#2061)
Increase loki stream limits (#2063)
Improve service teardown reliability (#2057)
PDB websocket cleanup (#2056)
Remove default httpx client timeouts for long running connections (#2122)
Update TTL support to scrape pod metrics via Prometheus (#2149)

Distributed Training

Auto-enable distributed execution for training jobs with more than one replica (#2111)
Cloud agnostic DCGM config (#2114)

Architecture & Refactoring

Move all module calls into subprocesses (#2069, #2070)
Consolidate launch time updates to service manager (#2046)
Remove env vars from manifest and reverse websocket connection from launched pods to Kubetorch controller (#2081)
Simplify http server launched on Kubetorch workload pods (#2108)
Replace requests with httpx for improved HTTP client functionality (#2107)
Remove unused launch params (#2130)
Remove unused controller client APIs (#2138)
Remove helm limits (#2146)

Bug Fixes

Always start Ray on head node where relevant for BYO manifests (#2046)
Properly stream logs for kt run app (#2103, #2121)
Fix kt app liveness check and logging config (#2105)
Fix noisy log streaming and event loop blocking (#2094)
Persist serialization type when reloading from an existing manifest (#2110)
Fix sys.path for scripts in subdirectories to enable sibling package imports (#2140)
Fix log streaming duplication (#2146)
Fix serialization check (#2148)

Assets 2

30 Dec 16:44

jlewitt1

v0.3.0

5ece2cc

v0.3.0

This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.

Kubetorch Data Store

The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:

Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss

(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)

Highlights

Unified API: Single kt.put()/kt.get() interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast)
No kubeconfig required: The Python client communicates through a centralized controller endpoint
Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
Broadcast coordination: BroadcastWindow enables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation

Example Usage

import kubetorch as kt

# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")

# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))

# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")

See the docs for more info.

Improvements

Remove queue and scheduler from Compute configuration options (#1968)
Added kt teardown support for training jobs (#1986)
Updates to metrics streaming output (#1984)
Remove OTEL as a Helm dependency (#2016, #2022)
Allow custom annotations for Kubetorch service account configuration (#2009)

Bug Fixes

Use correct container name when querying logs for Kubetorch services (#1972)
Prevent events and logs from printing on same line (#2008)
Async lifecycle management and cleanup (#2028)
Start Ray on head node if no distributed config provided with BYO manifest (#2046)
Handle image pull errors when checking for knative service readiness (#2050)
Control over autoscaler pod eviction behavior for distributed jobs (#2052)

Assets 2

09 Dec 00:37

jlewitt1

v0.2.9

452213e

v0.2.9

Improvements

Introduce global LoggingConfig for easier control of logging behavior with Kubetorch services (#1959)
Simplify compute spec requirements in factory and constructor (#1963)

Bug Fixes

Prevent log recursion errors (#1957)
Metrics streaming for async service calls (#1958)
Specify limits in pod template when requesting GPUs (#1965)

Assets 2

05 Dec 13:56

jlewitt1

v0.2.8

a2d81b9

v0.2.8

Improvements

Support for loading all pod IPs for distributed workloads (#1937)
Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
Improve pdb debugging (#1950)
Remove resource limits for workloads (#1949)
Set allowed serialization methods using local environment variable (#1951)

Bug Fixes

Fix pdb support and other query params for callables (#1939)
Setting Kubetorch volume mount paths on creation & reload (#1944)
Fix to_async() when get_if_exists is set to True (#1945)

Assets 2

26 Nov 23:00

jlewitt1

v0.2.7

85402c8

v0.2.7

Improvements

Added kt logs CLI command to stream and follow Kubetorch deployment logs (including distributed deployments) (#1928)
Add affinity and tolerations for rsync and nginx proxy deployments (#1923)
Filter out metric service calls from kubetorch pods (#1918)

Bug Fixes

Always SSH into head node for Ray deployments (#1931)
Metrics streaming for async calls (#1924)

Assets 2

20 Nov 20:43

jlewitt1

v0.2.6

67adf58

v0.2.6

Improvements

Expanded python version support (#1907)
Support helm installation in any namespace for better isolation (#1913)
Suppress metrics log checks in kubetorch pod logs (#1918)
Relax cluster readiness checks (#1919)

Bug Fixes

Template label parsing (#1908)
Deploy headless service for distributed use cases only (#1920)

Assets 2

13 Nov 22:08

jlewitt1

v0.2.5

1b2d670

v0.2.5

New Features

Notebook Integration

Added kt notebook CLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890)
You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)

Improvements

Simplified kubetorch[client] dependencies (#1896)
Faster kt listCLI loading (#1905)

Bug Fixes

Module and submodule reimporting on the cluster (#1902)

Assets 2

12 Nov 09:52

jlewitt1

v0.2.4

8f86a29

v0.2.4

New Features

Metrics Streaming

Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:

CPU utilization (per service or pod)
Memory consumption (MiB)
GPU metrics (DCGM-based utilization and memory usage, where relevant)

This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.

Related PRs: #1856, #1867, #1881, #1887

Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart

Improvements

Helm chart cleanup of deprecated kubetorch config values (#1865)
Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
Logging: updating callable name for clarity (#1876)

Bug Fixes

Fix dockerfile sync when running kt app (#1863)
kt config set/unset to only update specific config keys (#1882)
Reload cached submodules when reimporting kubetorch module on the server (#1883)

Assets 2

Releases: run-house/kubetorch

v0.5.0

New Features

Workload CRD (#2195, #2210, #2219, #2222, #2236, #2238, #2251)

BYO Manifest for arbitrary K8s resource types (#2133, #2211, #2237)

Kt Apply (#2085, #2257)

Code synchronization control (#2228, #2243)

Improvements

Deprecations

BC-Breaking

Bug Fixes

Uh oh!

v0.4.1

Improvements

Bug Fixes

Uh oh!

v0.4.0

Kubetorch Controller (#1947)

Standard Workflow (#2041, #2076, #2079, #2081)

Bring Your Own Manifest (#1914, #1916, #1935, #1960)

Use Cases

Highlights

Improvements

Performance & Reliability

Distributed Training

Architecture & Refactoring

Bug Fixes

Uh oh!

v0.3.0

Kubetorch Data Store

Highlights

Example Usage

Improvements

Bug Fixes

Uh oh!

v0.2.9

Improvements

Bug Fixes

Uh oh!

v0.2.8

Improvements

Bug Fixes

Uh oh!

v0.2.7

Improvements

Bug Fixes

Uh oh!

v0.2.6

Improvements

Bug Fixes

Uh oh!

v0.2.5

New Features

Notebook Integration

Improvements

Bug Fixes

Uh oh!

v0.2.4

New Features

Metrics Streaming

Improvements

Bug Fixes

Uh oh!