Releases: run-house/kubetorch
v0.5.0
New Features
Workload CRD (#2195, #2210, #2219, #2222, #2236, #2238, #2251)
Kubetorch now stores workload information in a Kubernetes Workload CRD. This contains metadata for the Kubetorch services, such as labels, paths, images, and can be viewed and managed through standard Kubernetes tools.
To use the latest release, upgrade your kubetorch installation and ensure the CRD is installed.
BYO Manifest for arbitrary K8s resource types (#2133, #2211, #2237)
In release 0.4.0, we introduced bring-your-own (BYO) manifest for deploying custom Kubernetes manifests while maintaining Kubetorch compatible capabilities. In this release, we expand the supported resource types to be arbitrary Kubernetes resource types, in addition to the previously supported Kubetorch resource types. To apply this to new resource types, you will need to specify the pod_template_path parameter when creating the Compute object:
compute = kt.Compute.from_manifest(
manifest=my_custom_manifest,
pod_template_path="spec.workload.template",
)Kt Apply (#2085, #2257)
kt apply is a new CLI command to deploy existing Kubernetes manifests through Kubetorch. This automatically injects the Kubetorch server to the manifest start and supports optional Dockerfile based image setup.
# Apply a deployment manifest
kt apply deployment.yaml
# Apply with Dockerfile for image setup
kt apply deployment.yaml --dockerfile Dockerfile
# Apply with HTTP proxying enabled
kt apply fastapi-deployment.yaml --port 8000 --health-check /healthCode synchronization control (#2228, #2243)
Introduce new parameters sync_dir, remote_dir, and remote_import_path in module initialization for finer grain control over code syncing and remote imports. Use sync_dir to specify a local directory to sync (or set to False to skip module syncing entirely), and use remote_dir and remote_import_path to point to code that already exists on the container (mutually exclusive from sync_dir).
# Sync a specific directory
remote_fn = kt.fn(my_function, sync_dir="./src").to(compute)
# Use code already on container (e.g., from image.copy())
image = kt.Image().copy("./src")
remote_fn = kt.fn(
my_function,
sync_dir=False,
remote_dir="src",
remote_import_path="mymodule"
).to(compute)Improvements
- Kill processes better (#2135)
- Scale data store (#2185)
- Update image to include dockerfile contents upon setup and .to (#2127, #2194)
kt describeto include the ingress if configured (#2191)- Allow setting kt config values to None (#2190)
- Retry connection when hitting a RemoteProtocolError (#2204)
- Split data store helm chart resources (#2216)
- Add configuration for controller uvicorn worker count (#2217)
- Update app http health check params (#2225)
- Hide pod names by default for kt list (#2218)
- Add release namespace for data store deployments (#2248)
- Split up module pointers (#2227)
- Add support for serialization=”none” (#2231)
- Reduce poll interval for faster service readiness detection (#2232)
- Eagerly load callable at subprocess startup (#2234)
- Add callable_name property for modules (#2252)
- Add more helpful logging for rsync errors (#2255)
- Fix a few user facing type check errors (#2256)
Deprecations
- Deprecate image rsync in favor of copy (#2127)
BC-Breaking
- Require rsync 3.2.0+ instead of falling back to manual directory creation
- Refactor teardown method (#2262)
--force/-fflag no longer deletes without confirmation.--forceindicates force deleting the resource, but user will still be prompted with a confirmation if a--yes/-yflag is not provided.
Bug Fixes
- Fix controller connection scaling (#2184)
- Fix noisy websocket error logging during shutdown (#2193)
- Fix rerun errors by appending launch_id (#2223)
- Add markers to support decorating modules (#2235)
- Fix for single-file rsync and dockerfile absolute rsync check (#2241, #2245)
- Fix EADDRINUSE errno check for macOS compatibility (#2265)
v0.4.1
v0.4.0
Kubetorch Controller (#1947)
This release introduces the Kubetorch Controller, a new cluster-side component that eliminates the need for local kubeconfig files and simplifies authentication — the Python client now communicates with your cluster through a centralized controller endpoint.
- Simplified setup: No more managing kubeconfig files or local Kubernetes client dependencies
- Unified API access: All resource types (Deployments, Knative Services, RayClusters, Kubeflow Training Jobs, or other arbitrary CRDs) are managed through a single endpoint
Standard Workflow (#2041, #2076, #2079, #2081)
The familiar Kubetorch experience — declare your compute requirements and let Kubetorch handle everything:
import kubetorch as kt
# Define compute requirements
compute = kt.Compute(cpus="2", memory="4Gi", gpus=1)
# Deploy your function
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)Kubetorch automatically:
- Generates the appropriate Kubernetes manifest (Deployment, Knative Service, etc.)
- Deploys and manages the workload
- Creates service routing
- Handles code syncing and remote execution
Bring Your Own Manifest (#1914, #1916, #1935, #1960)
Full support for deploying custom Kubernetes manifests while still leveraging Kubetorch's module system, code syncing, and remote execution capabilities.
Use Cases
(1) Provide your own K8s manifest and let Kubetorch manage the deployment
import kubetorch as kt
# Your custom deployment manifest
my_manifest = {"apiVersion": "apps/v1", "kind": "Deployment", ...}
# KT applies the manifest and creates the routing service
compute = kt.Compute.from_manifest(
manifest=my_manifest,
selector={"app": "my-workers"}
)
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)(2) Apply the manifest separately
Point Kubetorch at resources deployed via kubectl (or another tool):
# User already deployed pods with label app=workers via kubectl
# KT only registers the pool and creates routing
compute = kt.Compute(selector={"app": "workers", "team": "ml"})
remote_fn = kt.fn(my_func).to(compute)
result = remote_fn(1, 2)(3) Provide a service endpoint
Use your own ingress, service mesh, or load balancer:
compute = kt.Compute.from_manifest(
manifest=my_manifest,
selector={"app": "my-workers"},
endpoint=kt.Endpoint(url="http://my-svc.my-namespace.svc.cluster.local:8080")
)(4) Custom endpoint selector for routing
Route to a subset of your pods (e.g., only worker pods, not master):
compute = kt.Compute.from_manifest(
manifest=pytorch_job_manifest,
selector={"job-name": "my-job"},
endpoint=kt.Endpoint(selector={"job-name": "my-job", "replica-type": "worker"}) # Route: workers only
)Highlights
Compute.from_manifest(): Create a Compute object from any existing Kubernetes manifest (Deployment, Knative Service, RayCluster, or Kubeflow Training Jobs)- Kubeflow v1 Training Jobs: Native support for PyTorchJob, TFJob, MXJob, and XGBoostJob with automatic distributed execution
- Property overrides: Modify CPU, memory, replicas, image, env vars, and other settings on imported manifests using standard Compute properties
- Custom service managers: Extensible architecture for adding support for additional Kubernetes resource types
Improvements
Performance & Reliability
- Add retry logic for transient rsync errors (#2061)
- Increase loki stream limits (#2063)
- Improve service teardown reliability (#2057)
- PDB websocket cleanup (#2056)
- Remove default httpx client timeouts for long running connections (#2122)
- Update TTL support to scrape pod metrics via Prometheus (#2149)
Distributed Training
- Auto-enable distributed execution for training jobs with more than one replica (#2111)
- Cloud agnostic DCGM config (#2114)
Architecture & Refactoring
- Move all module calls into subprocesses (#2069, #2070)
- Consolidate launch time updates to service manager (#2046)
- Remove env vars from manifest and reverse websocket connection from launched pods to Kubetorch controller (#2081)
- Simplify http server launched on Kubetorch workload pods (#2108)
- Replace requests with httpx for improved HTTP client functionality (#2107)
- Remove unused launch params (#2130)
- Remove unused controller client APIs (#2138)
- Remove helm limits (#2146)
Bug Fixes
- Always start Ray on head node where relevant for BYO manifests (#2046)
- Properly stream logs for
kt runapp (#2103, #2121) - Fix
kt appliveness check and logging config (#2105) - Fix noisy log streaming and event loop blocking (#2094)
- Persist serialization type when reloading from an existing manifest (#2110)
- Fix
sys.pathfor scripts in subdirectories to enable sibling package imports (#2140) - Fix log streaming duplication (#2146)
- Fix serialization check (#2148)
v0.3.0
This release introduces the Kubetorch Data Store, a unified cluster data transfer system for seamless data movement between your local machine and Kubernetes pods.
Kubetorch Data Store
The data store provides a unified kt.put() and kt.get() API for both filesystem and GPU data. It solves two critical gaps in Kubernetes for machine learning:
- Fast deployment: Sync code and data to your cluster instantly via rsync - no container rebuilds necessary
- In-cluster data sharing: Peer-to-peer data transfer between pods with automatic caching and discovery - the "object store" functionality that Ray users miss
(#1994, #1989, #1988, #1987, #1985, #1982, #1981, #1979, #1933, #1932, #1929, #1893, #1997)
Highlights
- Unified API: Single
kt.put()/kt.get()interface handles both filesystem data (files/directories via rsync) and GPU data (CUDA tensors via NCCL broadcast) - No kubeconfig required: The Python client communicates through a centralized controller endpoint
- Peer-to-peer optimization: Intelligent routing that tries pod-to-pod transfers first before falling back to the central store
- GPU tensor & state dict transfers: First-class support for CUDA tensor broadcasting via NCCL, including efficient packed transfers for model state dicts
- Broadcast coordination:
BroadcastWindowenables coordinated multi-pod transfers with configurable quorum, fanout, and tree-based propagation
Example Usage
import kubetorch as kt
# Filesystem data
kt.put("my-service/weights", src="./model_weights/")
kt.get("my-service/weights", dest="./local_copy/")
# GPU tensors (NCCL broadcast)
kt.put("checkpoint", data=model.state_dict(), broadcast=kt.BroadcastWindow(world_size=2))
kt.get("checkpoint", dest=dest_state_dict, broadcast=kt.BroadcastWindow(world_size=2))
# List and manage keys
kt.ls("my-service/")
kt.rm("my-service/old-checkpoint")See the docs for more info.
Improvements
- Remove queue and scheduler from Compute configuration options (#1968)
- Added
kt teardownsupport for training jobs (#1986) - Updates to metrics streaming output (#1984)
- Remove OTEL as a Helm dependency (#2016, #2022)
- Allow custom annotations for Kubetorch service account configuration (#2009)
Bug Fixes
- Use correct container name when querying logs for Kubetorch services (#1972)
- Prevent events and logs from printing on same line (#2008)
- Async lifecycle management and cleanup (#2028)
- Start Ray on head node if no distributed config provided with BYO manifest (#2046)
- Handle image pull errors when checking for knative service readiness (#2050)
- Control over autoscaler pod eviction behavior for distributed jobs (#2052)
v0.2.9
v0.2.8
Improvements
- Support for loading all pod IPs for distributed workloads (#1937)
- Add app and deployment id labels for easier querying of all Kubetorch deployments (#1940)
- Improve pdb debugging (#1950)
- Remove resource limits for workloads (#1949)
- Set allowed serialization methods using local environment variable (#1951)
Bug Fixes
v0.2.7
v0.2.6
v0.2.5
New Features
Notebook Integration
- Added
kt notebookCLI command to launch a JupyterLab instance connected directly to your Kubetorch services (#1890) - You can now send Kubetorch functions defined inside local Jupyter notebooks to run on your cluster — no extra setup needed (#1892)
Improvements
Bug Fixes
- Module and submodule reimporting on the cluster (#1902)
v0.2.4
New Features
Metrics Streaming
Kubetorch now supports real-time metrics streaming during service execution.
While your service runs, you can watch live resource usage directly in your terminal, including:
- CPU utilization (per service or pod)
- Memory consumption (MiB)
- GPU metrics (DCGM-based utilization and memory usage, where relevant)
This feature makes it easier to monitor performance, detect bottlenecks, and verify resource scaling in real time.
Related PRs: #1856, #1867, #1881, #1887
Note: To disable metrics collection, set metrics.enabled to false in the values.yaml of the Helm chart
Improvements
- Helm chart cleanup of deprecated kubetorch config values (#1865)
- Convert cluster scoped RBAC to namespace scoped (#1864, #1861)
- Logging: updating callable name for clarity (#1876)