-
Notifications
You must be signed in to change notification settings - Fork 92
Description
Deployment Regression: torch.compile Triton Compilation Failures on TensorRT-LLM with Python 3.12 / PyTorch 2.5+
Issue Summary
Previously working TensorRT-LLM deployments with torch.compile now fail during startup due to Triton compiler errors in Baseten's updated deployment environment. The deployment environment appears to have been upgraded from Python 3.9 / PyTorch 2.3.x to Python 3.12 / PyTorch 2.5+, which introduces breaking changes for models using torch.compile on custom CUDA operations.
Environment
Previously Working (before ~January 2026):
Python: 3.9
PyTorch: ~2.3.x (inferred)
Deployment: Successful with torch.compile enabled
Current Environment (failing):
Python: 3.12 (confirmed in logs: /usr/local/briton/venv/lib/python3.12/)
PyTorch: 2.5.x / 2.7.x (based on torch==2.7.0 requirement behavior)
Deployment: Fails during torch.compile warmup
Expected Behavior
The model should deploy successfully with torch.compile enabled, as it did previously. Background compilation (via daemon thread) should complete without errors.
Actual Behavior
Deployment fails with Triton compiler InductorError during the compilation warmup phase:
torch._inductor.exc.InductorError: SubprocException: An exception occurred in a subprocess:
Traceback (most recent call last):
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 337, in do_job
result = job()
^^^^^
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py", line 61, in _worker_compile_triton
kernel.precompile(warm_cache_only=True)
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 267, in precompile
This occurs even when compilation is run in a background daemon thread, causing errors during model runtime.
Steps to Reproduce
Create a TensorRT-LLM Truss deployment with custom CUDA operations (e.g., SNAC audio decoder)
Enable torch.compile with dynamic batching:
decoder = torch.compile(model.decoder, dynamic=True)
Perform warmup compilation across multiple batch sizes
Deploy to Baseten with default Python/PyTorch versions
Observe Triton compilation failure during startup
Configuration
config.yaml:
python_version: py39 # Ignored? Deployed with Python 3.12
requirements:
- torch==2.7.0 # Gets 2.5+ with incompatible Triton
- transformers>=4.50.0
- huggingface-hub>=1.3.0
model.py snippet:
class SnacModelBatched:
def init(self):
self.dtype_decoder = torch.float32
compile_background = True # Daemon thread
use_compile = True
model = SNAC.from_pretrained("/app/snac_24khz").eval()
model = model.to("cuda")
if use_compile:
threading.Thread(target=self.compile, daemon=True).start()
def compile(self):
decoder = torch.compile(model.decoder, dynamic=True)
# Warmup with various batch sizes
for bs_size in range(1, 64):
# ... compilation warmup ...
Additional Context
Secondary Issues Encountered:
pynvml deprecation warning (cosmetic, but indicates environment changes)
huggingface-hub version conflict (transformers requires <1.0, but environment has 1.3.4)
Workaround
Downgrade to PyTorch 2.4.0 explicitly:
python_version: py39
requirements:
- torch==2.4.0 # Stable Triton compiler
- transformers>=4.50.0
- huggingface-hub>=1.3.0
This restores previous behavior and allows torch.compile to work correctly.
Impact
This is a breaking change for production deployments. Models that previously deployed successfully now fail, requiring code changes or version pinning to restore functionality. There was no deprecation warning or migration guide for this environment change.
Suggested Fix
Document environment versions: Clearly document base image Python/PyTorch versions and update schedules
Version stability: Allow explicit Python version control (current python_version: py39 appears ignored)
Graceful degradation: Catch Triton compilation errors and fallback to uncompiled mode with warnings
Pin dependencies: Default to stable PyTorch versions (e.g., 2.4.x) rather than bleeding-edge releases
Environment Details
Truss version: Latest (as of January 29, 2026)
Model type: TensorRT-LLM WebSocket endpoint
GPU: H100 40GB
Deployment status: Previously working, now broken