Skip to content

Deployment Regression: torch.compile Triton Compilation Failures on TensorRT-LLM with Python 3.12 / PyTorch 2.5+ #2168

@shekharmeena2896

Description

@shekharmeena2896

Deployment Regression: torch.compile Triton Compilation Failures on TensorRT-LLM with Python 3.12 / PyTorch 2.5+
Issue Summary
Previously working TensorRT-LLM deployments with torch.compile now fail during startup due to Triton compiler errors in Baseten's updated deployment environment. The deployment environment appears to have been upgraded from Python 3.9 / PyTorch 2.3.x to Python 3.12 / PyTorch 2.5+, which introduces breaking changes for models using torch.compile on custom CUDA operations.

Environment
Previously Working (before ~January 2026):

Python: 3.9
PyTorch: ~2.3.x (inferred)
Deployment: Successful with torch.compile enabled
Current Environment (failing):

Python: 3.12 (confirmed in logs: /usr/local/briton/venv/lib/python3.12/)
PyTorch: 2.5.x / 2.7.x (based on torch==2.7.0 requirement behavior)
Deployment: Fails during torch.compile warmup
Expected Behavior
The model should deploy successfully with torch.compile enabled, as it did previously. Background compilation (via daemon thread) should complete without errors.

Actual Behavior
Deployment fails with Triton compiler InductorError during the compilation warmup phase:

torch._inductor.exc.InductorError: SubprocException: An exception occurred in a subprocess:
Traceback (most recent call last):
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/compile_worker/subproc_pool.py", line 337, in do_job
result = job()
^^^^^
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/runtime/compile_tasks.py", line 61, in _worker_compile_triton
kernel.precompile(warm_cache_only=True)
File "/usr/local/briton/venv/lib/python3.12/site-packages/torch/_inductor/runtime/triton_heuristics.py", line 267, in precompile
This occurs even when compilation is run in a background daemon thread, causing errors during model runtime.

Steps to Reproduce
Create a TensorRT-LLM Truss deployment with custom CUDA operations (e.g., SNAC audio decoder)
Enable torch.compile with dynamic batching:
decoder = torch.compile(model.decoder, dynamic=True)
Perform warmup compilation across multiple batch sizes
Deploy to Baseten with default Python/PyTorch versions
Observe Triton compilation failure during startup
Configuration
config.yaml:

python_version: py39 # Ignored? Deployed with Python 3.12
requirements:

  • torch==2.7.0 # Gets 2.5+ with incompatible Triton
  • transformers>=4.50.0
  • huggingface-hub>=1.3.0
    model.py snippet:

class SnacModelBatched:
def init(self):
self.dtype_decoder = torch.float32
compile_background = True # Daemon thread
use_compile = True
model = SNAC.from_pretrained("/app/snac_24khz").eval()
model = model.to("cuda")

    if use_compile:
        threading.Thread(target=self.compile, daemon=True).start()

def compile(self):
    decoder = torch.compile(model.decoder, dynamic=True)
    # Warmup with various batch sizes
    for bs_size in range(1, 64):
        # ... compilation warmup ...

Additional Context
Secondary Issues Encountered:

pynvml deprecation warning (cosmetic, but indicates environment changes)
huggingface-hub version conflict (transformers requires <1.0, but environment has 1.3.4)
Workaround
Downgrade to PyTorch 2.4.0 explicitly:

python_version: py39
requirements:

  • torch==2.4.0 # Stable Triton compiler
  • transformers>=4.50.0
  • huggingface-hub>=1.3.0
    This restores previous behavior and allows torch.compile to work correctly.

Impact
This is a breaking change for production deployments. Models that previously deployed successfully now fail, requiring code changes or version pinning to restore functionality. There was no deprecation warning or migration guide for this environment change.

Suggested Fix
Document environment versions: Clearly document base image Python/PyTorch versions and update schedules
Version stability: Allow explicit Python version control (current python_version: py39 appears ignored)
Graceful degradation: Catch Triton compilation errors and fallback to uncompiled mode with warnings
Pin dependencies: Default to stable PyTorch versions (e.g., 2.4.x) rather than bleeding-edge releases
Environment Details
Truss version: Latest (as of January 29, 2026)
Model type: TensorRT-LLM WebSocket endpoint
GPU: H100 40GB
Deployment status: Previously working, now broken

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions