Skip to content

Releases: NVIDIA/TensorRT-LLM

Release v1.2.0rc3

21 Nov 08:35
2128f73

Choose a tag to compare

Release v1.2.0rc3 Pre-release
Pre-release

Announcement Highlights

  • Model Support

    • Add qwen3-next nvfp4 support (#8526)
    • Enable Nemotron H MoE sharding (#8744)
    • Support Latent MOE for Nemotron (#8955)
    • Add TP support for DeepSeek-V3.2 (#8943)
    • Support Glm4MoeForCausalLM (#8256)
    • Add support for disagg in DSv3.2 (#8735)
    • Add tool call parsing fixes and Qwen3 coder parser (#8817)
  • API

    • Add trtllm_ prefix for exposed metrics (#8845)
    • Return logprobs incrementally in torch backend (#8785)
    • Enable n > 1 in OpenAI API with PyTorch backend (#8951)
    • Support json_schema in response_format (#8934)
    • Add TRTLLM_NIXL_KVCACHE_BACKEND environment variable for NIXL backend selection (#9075)
    • Prevent negative max_tokens passed into tllm request (#9037)
  • Feature

    • Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt (#8771)
    • Add swapsMmaAb sparseMla kernels (#8913)
    • Implement Deep Research with scaffolding (#8452)
    • Add rope and uk-bgemm overlap for MLA generation (#8495)
    • Add NUMA-aware CPU affinity autoconfig (#8805)
    • Add custom indexer k cache scatter op (#8960)
    • Allow env variable to specify spawn process IPC address (#8922)
    • Implement sampling using FlashInfer.sampling (#8581)
    • Enhance the overlap scheduler for two-model spec decoding (#8706)
    • Update TRTLLM Cutlass MoE kernels with ReLU2 (#9011)
    • Unify MPI & Ray's req/response handling with RPC Client/Server (#8765)
    • Use triton kernels for RocketKV prediction module (#8682)
    • Support accuracy test and install from wheel (#9038)
    • Add tree attention support for blackwell arch (#8975)
    • Add simple optimizations for MTP 2-model (#9176)
    • Enable early exit with overlap scheduler (#8587)
    • Add dynamic draft length in spec decode (stage 1) (#8194)
    • Add bias for FP4 TRT-LLM Gen MoE (#9220)
    • Integrate CuteDSL NVFP4 grouped GEMM (#8880)
    • Add ability to cancel disagg request if KV cache resources are exhausted (#9155)
    • Make factory sharding the default (#9144)
    • Enable simple sharding for latent experts (#9099)
    • Update the indexer topK (#9255)
    • Add fp8 dense for sm120 (#9174)
    • Add specdec to nemotron nas (#8985)
    • Use CUDAGraph to improve the tuning accuracy for AutoTuner (#9089)
    • Add ReLU2 to TRTLLM Cutlass MoE BF16 kernels (#9191)
    • Add pp_partition to customize each rank's layer number (#9003)
    • Enable EPLB for trtllm-gen and cutlass backend (#8886)
    • Add optimized trtllm-gen attention kernels on sm103 (#9081)
    • Add MTP>1 support for DS-v3.2 (#9045)
  • Benchmark

    • Add Qwen3-Next to layer-wise benchmarks (#9065)
    • Refactor benchmark infrastructure (#9207)
    • Print device info in trtllm-bench report (#8584)
    • Use torch.compile to fuse copy + layernorm within the LayerNorm module (#9052)
    • Add torch.compile + multi-stream support for k-cache scatter and weight scaling (#8988)
    • Adjust select_alltoall_method_type (#8950)
  • Documentation

    • Replace the relative links with absolute links in README.md (#8995)
    • Update llama and llama4 example doc (#9048)
    • Update doc/tests/chat_template for nano-v2-vlm (#8840)
    • Add Mixed Precision Context and Generation section to Disagg (#8769)
    • Add DeepSeek-V3.2-Exp document (#9141)
    • Update docs for EPLB (#9166)
    • Update the Flux autodeploy example (#8434)
    • Update DS-R1 example doc (#9231)
    • Update license (#8807)
  • Fix & Infra

    • Fix the logger once key issue and further compress log in AutoTuner (#8873)
    • Fix disagg GPT-OSS test (#8870)
    • Remove PyTorchConfig completely (#8856)
    • Fix boost issue (#8996)
    • Lock onnx version <1.20.0 and remove WAR for TRT 10.13 (#9006)
    • Fix eagle3 accuracy issue on sm120 (#8944)
    • Add customized topk and related unit tests for DSA (#8882)
    • Improve type annotations on ResourceManager.get_resource_manager (#9013)
    • Add sm103 to CutlassFP8RowwiseGemm (#9042)
    • Add context manager to fix FakeTensorProp (#9047)
    • Initialize HF modules in worker_main for models with trust_remote=true (#8931)
    • Use async send_requests_to_next_pp (#9041)
    • Display the GPU memory information in GiB unit (#9070)
    • Add unit tests for TorchSampler batched sampling (#9012)
    • Remove circular dependency between model engine and cuda graph runner (#7572)
    • Fix precision issue due to KV layout mismatch for split/concat kernels (#6917)
    • Clear indexer k cache reference before releasing CUDA memory (#9110)
    • Disable UCC as WAR to MPI allgather issue before NGC PyTorch 25.12 upgrade (#9126)
    • Fix KV cache manager test warnings (#9103)
    • Fix the aux_stream in Llama4MinLatencyFusedMoE (#9035)
    • Avoid torch.compile being applied multiple times (#9135)
    • Upgrade tritonserver DLFW 25.10 (#8929)
    • Make the sliced nvfp4 output contiguous (#9123)
    • Update the attention layers counting for Qwen3-next (#9072)
    • Fix the rank to access all_rank_chunk_size_list when chunked MoE is used (#8723)
    • Fix missing ActivationType issue (#9171)
    • Support enroot/pyxis clusters in multi-node SLURM and enable oci-hsg GB200 in post-merge (#9117)
    • Fix lock file generation script (#9180)
    • Fix a deepseekv3 error when debug mode is on (#9217)
    • Fix DeepSeek V3.2 indexer RoPE (#9232)
    • Exclude number of draft tokens from mMaxSeqLenKv (#9210)
    • Upgrade NIXL to 0.7.1 (#9055)
    • Fix EPLB for DeepSeek-V3.2-Exp (#9245)
    • Log the LLM args for main branch (#9120, #9205)
    • Update TRTLLM MoE cubins, reduce mxfp4 weight padding requirement, and tighten TMA bound (#9025)
    • Upgrade precommit-hooks to v6.0.0 (#9097)

What's Changed

  • [https://nvbugs/5623960][fix] Fix the logger once key issue and further compress log in AutoTuner. by @hyukn in #8873
  • [None][infra] update github token name by @niukuo in #8907
  • [https://nvbugs/5624367][fix] Fix disagg GPT-OSS test by @chuangz0 in #8870
  • [https://nvbugs/5630345][chore] unwaive DS-v32 nvfp4 and fp8 tests by @lfr-0531 in #8887
  • [TRTLLM-7251][test] Get submit eplb slots empty key work by @fredricz-20070104 in #8945
  • [TRTLLM-8768][chore] Fuse QK down_proj with indexer K + weight_proj for FP4 ckpt by @chang-l in #8771
  • [None][feat] add swapsMmaAb sparseMla kernels by @PerkzZheng in #8913
  • [TRTLLM-8201][feat] Nemotron H MoE Sharding by @lucaslie in #8744
  • [#8924][fix] Fix AutoDeploy pattern matcher for torch 2.9 by @Fridah-nv in #8920
  • [https://nvbugs/5606166][fix] AutoDeploy: unwaive test for use tuples for cudagraph shape lookup by @lucaslie in #8957
  • [None][feat] Add qwen3-next nvfp4 support by @JadoTu in #8526
  • [None][feat] Deep Research Implemented with Scaffolding by @Boreas618 in #8452
  • [None][infra] allow to choose repo when generate lock files by @yuanjingx87 in #8659
  • [None][feat] add waive by sm version by @xinhe-nv in #8928
  • [None][feat] Add trtllm_ prefix for exposed metrics by @nv-yilinf in #8845
  • [TRTLLM-8803][feat] Add rope and uk-bgemm overlap for mla generation by @yunruis in #8495
  • [https://nvbugs/5630345] [chore] skip deepseek-v3.2 fp8 kv tests on pre-Blackwell architectures by @lfr-0531 in #8973
  • [None][chore] Use cached model in all ray tests by @shuyixiong in #8962
  • [https://nvbugs/5498478][fix] Fix eagle3 fp8 kv target model + bf16 draft model + chunked prefill by @DylanChen-NV in #8910
  • [TRTLLM-8814][feat] AutoDeploy: Use TRTLLM kernels for FP8 linear by @nvchenghaoz in #8820
  • [https://nvbugs/5527655][feat] Add NUMA-aware CPU affinity autoconfig by @dhansen-nvidia in #8805
  • [None][feat] AutoDeploy: Support Latent MOE for Nemotron by @nvchenghaoz in #8955
  • [None][fix] Fix KV cache clearing with KV Connector API by @jthomson04 in #8750
  • [https://nvbugs/5637012][fix] Bugfix when config is None for MLA by @chang-l in #8978
  • [https://nvbugs/5606136][ci] Remove tests for deprecating models. by @SimengLiu-nv in #8926
  • [None][feat] Return logprobs incrementally in torch backend by @dcaox in #8785
  • [https://nvbugs/5636986][fix] Fix DeepGemmMoe get_buffer calls by @VALLIS-NERIA in #8939
  • [None][fix] Switch AD AllReduce strategy to NCCL by @MrGeva in #8979
  • [https://nvbugs/5633340][fix] kill processes properly after test by @reasonsolo in #8970
  • [TRTLLM-9065][chore] remove PyTorchConfig completely by @QiJune in #8856
  • [https://nvbugs/5508536][fix] Take Over (#8627): Reintroduce: Move stop_criteria to sample_async (#7041) by @stnie in #8794
  • [None][fix] type annotations in fuse_input_embeds by @ixlmar in #8976
  • [None][fix] add missing CLI option in multimodal example by @ixlmar i...
Read more

v1.2.0rc2

07 Nov 07:10
3111682

Choose a tag to compare

v1.2.0rc2 Pre-release
Pre-release

Announcement Highlights

  • Model Support

    • Optimize the routing kernel for DeepSeek V3; add MoE TRTLLM backend support for KimiK2 and Qwen-next (#7761)
    • Support DeepSeek V3.2 with FP8/BF16 KV cache and NVFP4/BF16 KV cache (#8405)
    • Add EVS support for nano-v2-vlm (#8024)
    • Support Qwen3 reasoning and tool parsers (#8000, #8216)
    • Add Nemotron MOE support in AutoDeploy, including FP8 MOE (#8469, #8737, #8599)
  • API

    • Support ignored prompt length via new sampling parameter (#8127)
    • Replace unified attention before export (#8303)
    • Add max_total_draft_tokens (#8366)
    • Pass KvCacheRetentionConfig to torch LlmRequest (#8634)
  • Feature

    • Add cuBLASLt NVFP4 GEMM backend (#7943)
    • Add FP8 rowwise GEMMs for B200 (#8332)
    • Enable low-precision alltoall for CUTLASS/TRTLLMGen (#8675)
    • Integrate MNNVL Throughput and refactor allreduce kernel for TRTLLM MoE (#8728, #8018)
    • Enable RMS norm fusion for Nemotron MOE (#8563)
    • Add base64 video input support (#8458)
  • Fix & Infra

    • Upgrade to DLFW 25.10, PyTorch 2.9.0, and Triton 3.5.0 (#8838)
    • Fix FP8 blockwise GEMM performance with attention DP (#8501)
    • Fix pipeline-parallel bubbles (#8687)
    • Cache the AllReduce wrapper to avoid re-allocation hangs (#8803)
    • Stabilize tests/CI with waives and slurm/CI updates (#8524, #8573, #8749, #8775, #8808, #8896, #8897)
  • Benchmark

    • Add Server-Client Perf Test in pytest for B200/B300 (#7985)
    • Add layer-wise benchmarks and detailed KV cache transfer time breakdown (#8777, #8521)
    • Add longbench v2 for long-context evaluation (#8604)
    • Add benchmark to DeepConf and upload perf results to database (#8776, #8653)
  • Documentation

    • Add LLM-API change principles (#8350)
    • Add docs for torch.compile and piecewise CUDA graph (#8527)
    • Update public docs and add etcd auto-scaling tests (#8602)
    • Clarify perf best practices and supported hardware for GPT-OSS (#8665)
  • Known issue

    • For this pre-release version, install using the specific version identifier: pip3 install tensorrt-llm==1.2.0rc2. Installing with pip3 install tensorrt-llm --pre will result in a broken dependency on onnx==1.20.0rc1. This issue will be resolved in the next release.

What's Changed

  • [None][chore] update test duration by @xinhe-nv in #8377
  • [None][fix] Avoid overwrite of kv_cache_config.max_tokens for VSWA scheme for the KVCacheManager by @eopXD in #8219
  • [TRTLLM-4866] [test] Support waiving unit tests by waives.txt by @VALLIS-NERIA in #8359
  • [TRTLLM-8637][feat] Optimize the routing kernel for DeepseekV3 (MoE CUTLASS backend); Add support for 384 experts (MoE TRTLLM backend) by @ChristinaZ in #7761
  • [https://nvbugs/5542862][fix] Upgrade fmha_v2. by @yuxianq in #8364
  • [TRTLLM-8669][infra] Use artifactory mirror for install python by @ZhanruiSunCh in #8394
  • [TRTLLM-7255][feat] Add iteration log parser script for benchmark log by @yizhang-nv in #6942
  • [None][ci] move some test cases from H100 to A10 by @QiJune in #8449
  • [TRTLLM-8436][feat] batched sampling and top-k logprobs improvements by @ixlmar in #8398
  • [None][feat] Update devcontainer configuration to include additional extensions by @Funatiq in #8369
  • [https://nvbugs/5540752][fix] Support quantized Phi4 MM models by @pamelap-nvidia in #8190
  • [https://nvbugs/5492250][fix] Remove isolated cases and unwaive cases by @HuiGao-NV in #8492
  • [TRTLLM-6055][infra] Slurm Test refactor by @yuanjingx87 in #7176
  • [https://nvbugs/5568676][fix] Remove test waive by @dongfengy in #8437
  • [#8461][feat] AutoDeploy: trtllm-serve bug fix + unit test by @lucaslie in #8462
  • [None] [chore] Add architecture-specific ATTRIBUTIONS files by @venkywonka in #8468
  • [#8272][feat] Enable chunked prefill for SSMs in AutoDeploy by @suyoggupta in #8477
  • [None][feat] Update 3rdparty/DeepGEMM to latest commit by @ruoqianguo in #8488
  • [None][feat] Support kv_cahce_reuse for HyperCLOVAX-Vision model by @yechank-nvidia in #7789
  • [TRTLLM-8436][fix] restore list[list[list[int]]] in add_token by @ixlmar in #8502
  • [None][chore] Move submit.sh to python and use yaml configuration by @zerollzeng in #8003
  • [TRTLLM-7287][test] add multimodal chunked_prefill cases by @ruodil in #8011
  • [None][feat] Add alltoall to trtllm-gen MoE backend. by @bobboli in #8481
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #8486
  • [None][ci] rebalance H100 stages by @QiJune in #8491
  • [None][feat] Support Qwen3 reasoning parser by @LinPoly in #8000
  • [None][infra] Add split algorithm for slurm by @EmmaQiaoCh in #8516
  • [TRTLLM-8638][fix] Remove closed bugs by @xinhe-nv in #8478
  • [None][chore] Update feature combination matrix for SWA kv cache reuse by @eopXD in #8529
  • [None][fix] the api_stability unify default values of None and inspect._empty by @Superjomn in #8496
  • [None][infra] Waive failed tests for main 10/21 by @EmmaQiaoCh in #8524
  • [None][doc] Facilitates the integration of the transfer agent by @Shixiaowei02 in #7867
  • [TRTLLM-8160][feat] Add max_total_draft_tokens by @yweng0828 in #8366
  • [None][chore] AutoDeploy: replace HF's deprecated keyword torch_dtype --> dtype by @lucaslie in #8510
  • [TRTLLM-7843][feat] implement disagg cluster auto-scaling by @reasonsolo in #8215
  • [None][feat] AutoDeploy: Add Nemotron MOE support for AutoDeploy by @nvchenghaoz in #8469
  • [TRTLLM-8483][chore] Refine scheduler_config and peft_cache_config in create_py_executor by @leslie-fang25 in #8451
  • [https://nvbugs/5556020][fix] test_disaggregated_serving.py::TestLlama3_1_8BInstruct::test_eagle3 dimension mismatch by @sunnyqgg in #8517
  • [None][doc] Fix the incorrect doc figure by @Shixiaowei02 in #8536
  • [TRTLLM-8260][feat] Add Server-Client Perf Test in pytest for B200 and B300 by @chenfeiz0326 in #7985
  • [None][infra] Let CI continue running other isolation tests when an isolation test get hanging by @EmmaQiaoCh in #8471
  • [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #8554
  • [None][feat] Add vLLM KV Pool support for XQA mla kernel by @qsang-nv in #8560
  • [https://nvbugs/5451272][fix] unwaive the test by @Shixiaowei02 in #8537
  • [None][chore] Bump version to 1.2.0rc2 by @yiqingy0 in #8562
  • [None][doc] Paragraph adjustment and fix statistic by @yunruis in #8568
  • [None][infra] Waive failed cases for main branch 10/22 by @EmmaQiaoCh in #8573
  • [TRTLLM-8785][fix] fix conflicts between periodic-junit and store-durations by @crazydemo in #8518
  • [https://nvbugs/5594753][fix] fix rpc unique addr related issue by @Superjomn in #8419
  • [#8391][fix] check perf by device subtype by @MrGeva in #8428
  • [None][chore] replace print_colored_debug with logger_debug by @Superjomn in #8417
  • [None][fix] generate nanobind stubs for submodules by @ixlmar in #8539
  • [None][fix] fixed cached model path in test by @MrGeva in #8549
  • [None][chore] add precommit hook to remove redundant tab and white space by @xinhe-nv in #8534
  • [https://nvbugs/5429636][feat] Kv transfer timeout by @pcastonguay in #8459
  • [None][fix] Fix EPLB CPU thread NUMA binding by @dongxuy04 in #8579
  • [None][chore] Skip failing import of mxfp4_moe by @brb-nv in #8591
  • [TRTLLM-8754][chore] Refine PyTorchModelEngine with llm args by @leslie-fang25 in #8493
  • [TRTLLM-8682][chore] Remove auto_parallel module by @anish-shanbhag in #8329
  • [None][feat] Update TRTLLM MoE MxFP4 cubins; autotune tileN...
Read more

v1.2.0rc1

22 Oct 06:56
796891b

Choose a tag to compare

v1.2.0rc1 Pre-release
Pre-release

Announcement Highlights

  • Model Support

    • Add GPT-OSS Sm120/Sm121 support (#7937)
    • Fix: Disable DeepGEMM for Qwen3 MoE Attention layers (#8087)
    • Fix: Update is_post_quant_all2all_supported for MNNVL (#8355)
    • Support quantized model for nano-v2-vlm (#8304)
    • Fix: Address illegal access when scale is not provided in Llama3/4 (#7960)
    • Fix: Correct Qwen2.5-VL device_path error (#8057)
    • Add post-merge test for Seed-OSS-36B-Instruct (#8321)
    • Fix: Correct get_num_tokens_per_image for nano-v2-vlm (#8425)
    • Add Kimi multi-nodes case (#8025)
  • API

    • Refine sampling strategy selection (BREAKING CHANGE) (#8132)
    • Add cache_salt in LLM.generate (#8317)
    • Add input tensor pre-hook function API for tuning (#6924)
    • Add additional model outputs (#7206)
    • Clean create_py_executor API (#8412)
  • Benchmark

    • Add request timing breakdown option in benchmark_serving (#8128)
    • Fix bench_serving import error (#8296)
    • Update disagg benchmark configs (#8289)
    • Add multimodal data to dummy requests during memory profiling (#7539)
    • Save runtime report periodically (#8312)
    • Resolve sampling defaults in OpenAI API backend (#8121)
  • Feature

    • Add new orchestrator type: Ray (#7520)
    • Implement HTTP disagg-cluster management (#7869)
    • Add PDL support for more kernels (#7977)
    • Enable rejection sampling for CDL (#7731)
    • Add torch.compile support for CUDA Core GEMM op (#8261)
    • Support block-sparse attention in trtllm gen FMHA kernels (#8301)
    • Support SWA KV cache reuse OOW block detach (#7922)
    • Add factory TP sharding of quantized models (#8123)
    • Turn off speculative decode based on acceptance-length threshold (#7283)
    • Enable VLM subgraphs and CUDA graph/compile in AutoDeploy (#8203)
    • Add sparse attention framework and RocketKV support (#8086)
    • Implement etcd storage for disagg cluster (#8210)
    • Export scale factor properly for W4A8/NVFP4/FP8 (#8180)
    • Reuse CUDA graph memory pool in normal forward flow (#8095)
    • Revise TileN-related routing calculation in MoE backend (#8148)
    • Develop DeepConf (#8362)
    • Support per-expert pre-quant scale factor for W4A8 AWQ MoE (PyTorch) (#7286)
    • Support cached tokens for OpenAI server (#7637)
    • Add fmha_v2 kernel for head_dim=80 and SM100 to support VLM (#8392)
    • Add topological graph helpers (#8457)
    • Enable CUDA graph support for KvConnectorWorker API (#8275)
    • Add chunked prefill support in AutoDeploy (#8158)
    • Set nixl as default cache transceiver backend (#7926)
    • Enable FP8 ContextMLA on GB300 (#8080)
    • Skip unnecessary CUDA graph capture (#8050)
    • Use device tensor index for MTP (#8062)
  • Documentation

    • Publish blog: Scaling Expert Parallelism in TensorRT LLM (Part 3) (#8323)
    • Refine deployment guide by renaming TRT-LLM to TensorRT LLM (#8214)
    • Document the role of d2t (#8174)
    • Add Qwen3-next doc and L0 test case (#8288)
    • Update AutoDeploy README: expert section on YAML configuration (#8370)
    • Update TPOT/ITL docs (#8378)
    • Add Ray orchestrator initial doc (#8373)
    • Add documentation for CUDA 12.9 (#8411)
    • Combine feature combination matrix documents (#8442)
    • Add ATTRIBUTIONS-{CPP,Python}.md and update wheels setup (#8438)

What's Changed

  • [None][feat] AutoDeploy: Nemotron-H accuracy test by @lucaslie in #8133
  • [None][feat] AutoDeploy: graph/module inputs with kwargs instead of args by @lucaslie in #8137
  • [TRTLLM-7349][feat] Adding new orchestrator type -- ray by @joyang-nv in #7520
  • [None][autodeploy] small refactors on attention matching by @Fridah-nv in #8079
  • [#5255][autodeploy] Update FuseAllreduceResidualRMSNorm to use pattern matcher utility; remove fuse_collective by @Fridah-nv in #7545
  • [TRTLLM-8189][chore] enhance GenerationExecutor with RPC (part1) by @Superjomn in #5543
  • [https://nvbugs/5521949][fix] Re-enable test_bielik_11b_v2_2_instruct_multi_lora, fix its API use with pytorch flow LoRA by @amitz-nv in #8146
  • [None][fix] Adding docker folder to Dockerfile by @pcastonguay in #8138
  • [None][chore] fix llmargs conflict by @Superjomn in #8152
  • [TRTLLM-8413][chore] resolve sampling defaults in OpenAI API backend by @ixlmar in #8121
  • [None][chore] AutoDeploy: clean up accuracy test configs by @lucaslie in #8134
  • [None][fix] Eagle: Attention DP by @IzzyPutterman in #7939
  • [None][feat] GPT-OSS Sm120/Sm121 Support by @farazkh80 in #7937
  • [None][chore] Increase operations-per-run to 1000 for stale action by @karljang in #8162
  • [None] [test] Add B300 cases to CI by @VALLIS-NERIA in #8056
  • [None][infra] Skip failed cases for main by @EmmaQiaoCh in #8176
  • [None][fix] Fix MTP illegal memory access by @mikeiovine in #8161
  • [https://nvbugs/5488576][fix] Propagate disable_finalize_fusion config flag in WIDEEP MoE backend by @sklevtsov-nvidia in #8141
  • [None][test] add test-model-suites option in integration conftest.py by @ruodil in #8016
  • [https://nvbugs/5455140][fix] unwaive tests related to GB200 OOM by @lancelly in #8159
  • [https://nvbugs/5550283][fix] update test case to the latest MoE API by @xxi-nv in #8165
  • [TRTLLM-8414][chore] BREAKING CHANGE: refine sampling strategy selection by @ixlmar in #8132
  • [None][chore] Waive some tests failing on main post merge by @brb-nv in #8186
  • [https://nvbugs/5541545][fix] Remove test_llama4 by @mikeiovine in #8031
  • [https://nvbugs/5522746][fix] unwaive tests caused by node issues after rebooting by @lancelly in #8193
  • [None][fix] Restrict tinygemm use to certain SMs by @dongfengy in #8182
  • [None][ci] move some llama4 test cases to pre merge by @QiJune in #8189
  • [TRTLLM-7846][feat] Http disagg-cluster management implemention by @reasonsolo in #7869
  • [https://nvbugs/5516666][fix] unwaive some Qwen3 CI tests by @byshiue in #8130
  • [None][doc] Refine deployment guide by renaming TRT-LLM to TensorRT L… by @nv-guomingz in #8214
  • [None][ci] pin flashinfer-python version by @QiJune in #8217
  • [None][chore] Restore asserts in pytorch flow LoRA tests by @amitz-nv in #8227
  • [None][infra] Waive failed tests on main 10/09 by @EmmaQiaoCh in #8230
  • [TRTLLM-7769][chore] document the role of 'd2t' by @ixlmar in #8174
  • [https://nvbugs/5501820][fix] Add requirements for numba-cuda version to WAR mem corruption by @pengbowang-nv in #7992
  • [None][fix] Enable FP8 ContextMLA on GB300 by @longlee0622 in #8080
  • [None][chore] Remove closed bugs by @xinhe-nv in #8151
  • [None][chore] Print log with time for starting to load safetensor weights by @HuiGao-NV in #8218
  • [None][fix] Add failed cases into waives.txt by @xinhe-nv in #8229
  • [https://nvbugs/5547416][fix] unwaive no_cache test by @byshiue in #8213
  • [None][fix] add gc for test fixture by @xinhe-nv in #8220
  • [https://nvbugs/5558167][fix] update canceled_req_ids correctly for canceled requests by @QiJune in #8207
  • [None][fix] Add Lock to protect mReqeustToSession by @chuangz0 in #8085
  • [None][feat] Add request timing breakdown option in benchmark_serving by @nv-yilinf in #8128
  • [TRTLLM-6748][feat] add PDL support for more kernels by @dc3671 in #7977
  • [https://nvbugs/5534705][fix] Skip unnecessary CUDA graph capture by @ziyixiong-nv in #8050
  • [None][chore] Waive failing pre-merge test on main by @brb-nv in #8282
  • [None][infra] Remove WAR code for GH200 node by @ZhanruiSunCh in #8266
  • [TRTLLM-7384][feat] enable rejection sampling for CDL by @kris1025 in #7731
  • [None][infra] Skip failed cases for main branch by @EmmaQiaoCh in #8293
  • [None][fix] AD test_trtllm_bench to use small model config and skip loading weights by @MrGeva in #8149
  • [https://nvbugs/5510879][fix] Fix pytorch & TRT-python flows fused LoRA adapter modules weight split with TP>1 by @amitz-nv in https://gith...
Read more

v1.2.0rc0.post1

14 Oct 09:17
6632b40

Choose a tag to compare

v1.2.0rc0.post1 Pre-release
Pre-release

Announcement Highlights

  • Model Support

    • Support Qwen3 next (#7892)
  • API

    • Return topk logprobs in torch backend (#7976)
    • Add chunked return_generation_logits logic (#7831)
  • Benchmark

    • Lock gpu clocks in test_perf.py to reliably detect perf regressions (#8099)
    • Fixed the kv cache size parsing in test_perf.py AD backend (#8092)
    • Improve perf_metrics endpoint functionality (#8005)
  • Feature

    • AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) (#8068)
    • add W4A8 NVFP4 FP8 fused moe (#7968)
    • Add ModelOPT INT4 awq fake quant support in AutoDeploy (#7770)
    • Save state first pass for speculative decoding (#7012)
    • Executor changes to support helix parallelism (#7972)
    • Support CUDA graph for DeepEP (#7514)
    • Integrate tinygemm2 for gpt-oss (#7916)
    • Support for cancelling requests with disaggregation (#8114)
    • Update TRT-LLM Gen MoE kernels (#7970)
    • AutoDeploy: dive deeper into token generation bugs + enable_block_reuse (#8108)
    • AutoDeploy add autotuning when capturing cudagraphs (#8120)
    • AutoDeploy: compiler backends based on nn.Module (#8126)
    • Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True (#7891)
    • Improve batched sampling perf for contiguous batches (#7908)
  • Documentation

    • Add more description on EXAONE usage (#8089)
  • Fix & Infra

    • Fix CUDA graph for Qwen2.5-VL (#8047)
    • Refine qwen3-next implementation (#8064)
    • Patched incorrect starcoder tp config (#8118)
    • Fix Qwen3 FP8 per-tensor when requesting TRTLLM-GEN MoE backend (#8075)
    • Fix TRT-python multi LoRA TP=2 test arguments (#8059)
    • Fix the non-determinism issue in the mm_encoder test (#8033)
    • Checking connection to etcd server in unit test (#8006)
    • Add MNNVL AlltoAll tests to pre-merge (#7466)
    • Add test cases into QA test list (#8081)
    • Avoid downloading Tiny llama from HF (#8071)
    • Fix OOM issue when dp padding is enabled (#8052)
    • Fix unwaiving disagg pp tests (#8069)
    • Fix shape propagation after TP sharding (#7912)
    • Fix patchelf version issue (#8112)
    • Fix device id assignment for some vision models (#8070)
    • Do not explicitly pass temperature=0 to select greedy sampling (#8110)
    • Fix access to new tokens in sampler (#7958)
    • Adding install_tensorrt.sh script to pip wheel (#8116)
    • Fix flakey unit test for dynamic spec decoding (#8129)
    • Minor cleanup and improvements (#7619)
    • Reserve an extra slot for padded batch (#7998)
    • Fix MTP 2-model (#8115)
    • Add LoRa Torch tests for the latest NIM model list (#6806)

What's Changed

  • [TRTLLM-7728][perf] improve batched sampling perf for contiguous batches by @ixlmar in #7908
  • [None][feat] Support Qwen3 next by @byshiue in #7892
  • [TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling by @ixlmar in #7909
  • [None][fix] Fix TRT-python multi LoRA TP=2 test arguments by @amitz-nv in #8059
  • [https://nvbugs/5542867][fix] Fix the non-determinism issue in the mm_encoder test by @chang-l in #8033
  • [https://nvbugs/5538098][fix] Checking connection to etcd server in unit test by @pcastonguay in #8006
  • [TRTLLM-6741][fix] Add heuristics for lm head tp size when enable_lm_head_tp_in_adp=True by @Njuapp in #7891
  • [None][feat] Return topk logprobs in torch backend by @dcaox in #7976
  • [#4593][feat] AutoDeploy: Linear Attention Support (SSM + causal_conv + Bamba + Nemotron-H) by @lucaslie in #8068
  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7466
  • [TRTLLM-6239][feat] add test cases into QA test list by @xinhe-nv in #8081
  • [None][fix] Fix CUDA graph for Qwen2.5-VL by @yechank-nvidia in #8047
  • [None][chore] Bump version to 1.2.0rc1 by @yiqingy0 in #8097
  • [https://nvbugs/5547414][fix] avoid downloading Tiny llama from HF by @Tabrizian in #8071
  • [None][chore] Refine qwen3-next implementation. by @nv-guomingz in #8064
  • [TRTLLM-8269][fix] Revert "do not explicitly pass temperature=0 to select greedy sampling" by @ixlmar in #8103
  • [None][chore] Revert MNNVL alltoall MR by @brb-nv in #8106
  • [None][fix] : Fix OOM issue when dp padding is enabled by @peaceh-nv in #8052
  • [None][doc] Add more description on EXAONE example by @yechank-nvidia in #8089
  • [None][infra] Skip failed tests in post-merge for main by @EmmaQiaoCh in #8102
  • [https://nvbugs/5434320][fix] fix: Unwaiving disagg pp tests by @pcastonguay in #8069
  • [OMNIML-2336][feat] add W4A8 NVFP4 FP8 fused moe by @sychen52 in #7968
  • [TRTLLM-6342][bug] Fix shape propagation after TP sharding by @greg-kwasniewski1 in #7912
  • [TRTLLM-8031][feat] Add chunked return_generation_logits logic by @yibinl-nvidia in #7831
  • [#5860][feat] Add ModelOPT INT4 awq fake quant support in AutoDeploy by @Fridah-nv in #7770
  • [None][fix] fix patchelf version issue by @bo-nv in #8112
  • [None][feat] Save state first pass by @IzzyPutterman in #7012
  • [TRTLLM-7733][feat] Executor changes to support helix parallelism by @brb-nv in #7972
  • [https://nvbugs/5549081][fix] Fix device id assignment for some vision models by @chang-l in #8070
  • [#7588][feat] lock gpu clocks in test_perf.py to reliably detect perf regressions by @MrGeva in #8099
  • [TRTLLM-8269][test] do not explicitly pass temperature=0 to select greedy sampling by @ixlmar in #8110
  • [https://nvbugs/5556020][chore] waive test_eagle3 by @hchings in #8119
  • [TRTLLM-6589][feat] Support CUDA graph for DeepEP by @yifeizhang-c in #7514
  • [TRTLLM-7775][feat] Integrate tinygemm2 for gpt-oss by @dongfengy in #7916
  • [None][feat] Support for cancelling requests with disaggregation by @pcastonguay in #8114
  • [None][fix] Fix access to new tokens in sampler. by @dcampora in #7958
  • [None][chore] Adding install_tensorrt.sh script to pip wheel by @pcastonguay in #8116
  • [#7588][fix] fixed the kv cache size parsing in test_perf.py AD backend by @MrGeva in #8092
  • [TRTLLM-6342][bug] Patched incorrect starcoder tp config by @greg-kwasniewski1 in #8118
  • [None][feat] perf_metrics endpoint functionality improvement by @nv-yilinf in #8005
  • [None][feat] Update TRT-LLM Gen MoE kernels by @nekorobov in #7970
  • [https://nvbugs/5548098][fix] Fix flakey unit test for dynamic spec d… by @hchings in #8129
  • [None] [refactor] Minor cleanup and improvements by @Funatiq in #7619
  • [None][feat] AutoDeploy: dive deeper into token generation bugs + enable_block_reuse by @lucaslie in #8108
  • [None][fix] Fix Qwen3 FP8 per-tensor when requesting TRTLLM-GEN MoE backend by @achartier in #8075
  • [None][feat] AutoDeploy add autotuning when capturing cudagraphs by @suyoggupta in #8120
  • [https://nvbugs/5537878][fix] Reserve an extra slot for padded batch by @ziyixiong-nv in #7998
  • [None][feat] AutoDeploy: compiler backends based on nn.Module by @lucaslie in #8126
  • [None][fix] Fix MTP 2-model by @mikeiovine in #8115
  • [TRTLLM-6496][feat] Add LoRa Torch tests for the latest NIM model list by @moraxu in #6806
  • [None][chore] Bump version to 1.2.0rc0.post1 by @yiqingy0 in #8306

Full Changelog: v1.2.0rc0...v1.2.0rc0.post1

v1.2.0rc0

30 Sep 07:55
560ded5

Choose a tag to compare

v1.2.0rc0 Pre-release
Pre-release

Announcement Highlights

  • Model Support
    • Support nano_v2_vlm in pytorch backend (#7207)
    • Add Tencent HunYuanDenseV1 model support (#7081)
    • Support Seed-OSS model in pytorch backend (#7496)
    • GPT-OSS MXFP4 support (#7451)
  • API
    • Support new structural tag API (upgrade XGrammar to 0.1.25) (#7893)
    • Enable regex and EBNF grammar in trtllm-serve (#7925)
    • Optionally disable server GC and worker GC (#7995)
    • Add serialization/deserialization options for AutoTuner profiling cache (#7738)
    • Cherry-pick from (#7598) Make low_precision_combine as a llm arg (#7898)
  • Benchmark
    • Add gpt-oss serve benchmark tests (#7638)
    • Exit as early as possible and propagate exit status correctly for multi-node testing (#7739)
    • Add gpt oss model for trtllm perf test (#7328)
    • Add generation logits case for llama3 (#7759)
    • Feature fix model issue for disagg serving (#7785)
    • Add deepseek r1/v3 model with chunked prefill cases (#7124)
    • Add accuracy benchmark in stress test (#7561)
    • Add NoSmem epilogue schedule and dynamic cluster shape for sm10x group gemm (#7757)
    • Rename llm_perf_full to llm_perf_core and add missing cases (#7899)
    • Update benchmark script (#7860)
    • Add multi-nodes test for disagg-serving (#7470)
    • Update llm_models_root to improve path handling on BareMetal environment (#7876)
    • Add DS-R1/Qwen3 test cases for RTX 6000 (#7662)
    • Add NIM perf test cases (#7924)
    • Fix the tactic sorting in TrtllmGenBatchedGemmRunner::getValidConfigIndices (#7419)
    • Improve the failure message for accuracy test suite (#7994)
    • Update get_sysinfo.py to avoid UnboundLocalError (#7982)
    • Update disagg gen-only benchmark. (#7917)
  • Feature
    • Phi4-mm image modality inference optimization (#7918)
    • Add NVFP4 x FP8 moe kernels (#7821)
    • Enable KV cache reuse and chunked prefill for mistral3.1 (#7628)
    • Enable two-model spec dec for MTP Eagle (#7001)
    • Support EPLB in Qwen3 MoE (#7443)
    • Eagle3 cuda graph support for the first draft model inference (#7363)
    • Enable run_post_quant_allgather for MoE TRTLLM backend (#6794)
    • Enable gpt oss on DGX H100. (#6775)
    • Add gpt-oss chunked prefill tests (#7779)
    • Eagle, use last hidden post norm (#7546)
    • Optimize Qwen2/2.5-VL performance (#7250)
    • Support kvcache reuse and chunk prefill for phi4mm (#7723)
    • Support attention dp for qwen3 dense model (#7618)
    • AutoDeploy Fix memory leak in fuse_moe (#7844)
    • Enable overlap scheduler for two-model spec decoding (#7651)
    • Add support of CUDA13 and sm103 devices (#7568)
    • Add Cute DSL nvfp4 linear op (#7632)
    • Enable LM tp for MTP, under attention dp case (cherry-pick #7128) (#7571)
    • Add an example of KV cache host offloading (#7767)
    • Helix: make softmax stats pointer available to attention gen (#6865)
    • AutoDeploy: graph-less transformers mode for HF (#7635)
    • Cherry-pick DeepGEMM related commits from release/1.1.0rc2 (#7716)
    • Add swapab, tileN64, cga sync support for cute dsl nvfp4 gemm (#7764)
    • FP8 Context MLA integration (Cherry-pick #6059 from release/1.1.0rc2) (#7610)
    • Update CUTLASS to 4.2 and enable SM103 group gemm (#7832)
    • Cherry-pick fix to reuse pytorch memory segments occupied by cudagraph (#7747)
    • Helix: add custom position ids to MLA kernels (#6904)
    • Support for partial sharding from factory (#7393)
    • KV cache transmission in disagg with CP on gen side (#7624)
    • Cherry-pick from #7423 Support fp8 block wide ep cherry pick (#7712)
    • E-PD Disagg Support via llmapi (3/N) (#7577)
    • Add batch waiting when scheduling (#7416)
    • Use list instead of torch tensor for new tokens in update requests (#7730)
    • Support multi-threaded tokenizers for trtllm-serve (cherry-pick) (#7776)
    • Support JIT mha.cu for SPEC_DEC in runtime (#6078)
    • Batched sampling by strategy (supersedes enable_mixed_sampler, cf. TRTLLM-7156) (#7294)
    • Enable prompt_logprobs in pytorch backend (#7580)
    • Support SWA KV cache reuse (#6768)
    • Return topk logprobs in torch backend (#7756)
    • CapturedGraph to support max_batch_size > max(cuda_graph_batch_sizes) (#7888)
    • Revert " Return topk logprobs in torch backend (#7756)" (#7969)
    • DeepEP LL fp8 dispatch/combine (#7927)
    • Helix: add alltoall op (#6815)
    • Optimize kv cache transfer TEP (#7613)
    • Add environment variable to adjust block pool allocation ration under kv cache manager (#7923)
    • Add a standalone buffer cache class and reuse buffers between cduagraph and no-graph flow (#7669)
    • Add static tree sampling and verification (#7161)
    • Add support for KVCache transfer from KVCache reuse path (#6348)
    • Added AutoDeploy backend support to test_perf.py (#7588)
    • Speed up concat k and copy k_nope in context phase using torch.compile (#8044)
  • Documentation
    • Fix the link in the doc (#7713)
    • Clean the doc folder and move the outdated docs into lega… (#7729)
    • Add doc for KV cache salting support (#7772)
    • Fix section header of llm_kv_cache_offloading example (#7795)
    • Update Documentation link to point to docs instead of docs source code (#6495)
    • Cherry-pick deployment guide update from 1.1.0rc2 branch to main branch (#7774)
    • Tech blog: Combining Guided Decoding and Speculative Decoding: Making CPU and GPU Cooperate Seamlessly (#7864)
    • Update tech blog12 (#7884)
    • Add known issues to llmapi doc (#7560)
    • Add blackwell information into support matrix (#6740)
    • Fix a invalid link and a typo. (#7634)
    • Use hash id for external link (#7641)
    • Add labels description note into llm api section (#7696)
    • Enhance api reference doc by labeling stable APIs (#7751)
    • Add 1.0 release notes (#7605)
    • Scaffolding tech blog part one (#7835)
    • Update docker cmd in quick start guide and trtllm-serve … (#7787)
    • Replace the main in the examples' link with commit id. (#7837)
    • Rename TensorRT-LLM to TensorRT LLM for homepage and the … (#7850)
    • Add a guide for modifying APIs (#7866)
    • Update Perf-Overview.md for release/1.0 (#7848)
    • Add stable label to all the un-labelled arguments in LLM class (#7863)
    • Fix invalid links in perf benchmarking. (#7933)
    • Add Llama PP known issue to release note (#7959)
    • Add acknowledgements in scaffolding tech blog (#7983)
    • Add scaffolding tech blog to cover (#8021)
    • Refine perf overview.md and correct the error link in per… (#8035)
    • Scaffolding tech blog fix a typo (#8042)
    • Document hang issue caused by UnpicklingError (#8049)

What's Changed

  • [None][feat] Eagle, use last hidden post norm by @IzzyPutterman in #7546
  • [None][infra] AutoDeploy: codeowners for autodeploy unit tests by @lucaslie in #7743
  • [TRTLLM-6668][feat] Enable overlap scheduler for two-model spec decoding by @ziyixiong-nv in #7651
  • [None][ci] move qwen3 tests from GB200 to B200 by @QiJune in #7733
  • [None][feat] support attention dp for qwen3 dense model by @Nekofish-L in #7618
  • [None][doc] Fix the link in the doc by @Shixiaowei02 in #7713
  • [TRTLLM-4629] [feat] Add support of CUDA13 and sm103 devices by @VALLIS-NERIA in #7568
  • [TRTLLM-6295][test] Exit as early as possible and propagate exit status correctly for multi-node testing by @chzblych in #7739
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7735
  • [None][fix] Ensure that the W4A8 custom input scale remains aligned across all ranks by @yilin-void in #7614
  • [None][chore] Fix error when running trtllm-bench without cuda graph. by @bobboli in #7725
  • [None][doc] Clean the doc folder and move the outdated docs into lega… by @nv-guomingz in #7729
  • [TRTLLM-6898][feat] Add Cute DSL nvfp4 linear op by @limin2021 in #7632
  • [None] [chore] cherry pick changes on slurm scripts from release/1.1.0rc2 by @kaiyux in #7750
  • [https://nvbugs/5503529][fix] Change test_llmapi_example_multilora to get adapters path from cmd line to avoid downloading from HF by @amitz-nv in #7740
  • [TRTLLM-7070][feat] add gpt-oss serve benchmark tests by @xinhe-nv in #7638
  • [None][fix] waive hang tests on main by @xinhe-nv in #7720
  • [https://nvbugs/5471106][fix] Remove the waivers by @ziyixiong-nv in #7711
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7746
  • Revert "[None][feat] support attention dp for qwen3 dense model" by @byshiue in #7765
  • [TRTLLM-8044][refactor] Rename data -> cache for cacheTransceiver by @Tabrizian in #7659
  • [None][chore] AutoDeploy: neat disablement of transforms in pipeline by @lucaslie in #7736
  • [None][chore] Remove unused get_quant_scales methods by @achartier in #7687
  • [None][infra] add nspect allow list for false positive secrets by @yuanjingx87 in #5797
  • [TRTLLM-7398][doc] Add doc for KV cache salting support by @chang-l in #7772
  • [None][infra] Update CI allowlist 2025-09-16 ...
Read more

v1.0.0

24 Sep 12:53
ae8270b

Choose a tag to compare

TensorRT LLM Release 1.0

TensorRT LLM 1.0 brings 2 major changes: the PyTorch-based architecture is now stable and the default experience, and the LLM API is now stable. For more details on new developments in 1.0, please see below.

Key Features and Enhancements

  • Model Support

    • Add Mistral3.1 VLM model support
    • Add TensorRT-Engine Qwen3 (dense) model support
    • Add phi-4-multimodal model support
    • Add EXAONE 4.0 model support
    • Add Qwen3 MoE support to TensorRT backend
  • Features

    • Add support for sm121
    • Add LoRA support for Gemma3
    • Support PyTorch LoRA adapter eviction
    • Add LoRA support for PyTorch backend in trtllm-serve
    • Add support of scheduling attention dp request
    • Remove padding of FusedMoE in attention DP
    • Support torch compile for attention dp
    • Add KV events support for sliding window attention
    • Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE
    • Add Piecewise CUDA Graph support for MLA
    • Support mutliCtasKvMode for high-throughput MLA kernels
    • Enable kvcache to be reused during request generation
    • Add ADP schedule balance optimization
    • Add chunked prefill support for MLA (Blackwell)
    • Enable Multi-block mode for Hopper spec dec XQA kernel
    • Add vLLM KV Pool support for XQA kernel
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5
    • Add support for fused gate_up_proj scales for FP8 blockwise
    • Support FP8 row-wise dense GEMM in torch flow
    • Enable fp8 SwiGLU to minimize host overhead
    • Add Deepseek R1 FP8 Support on Blackwell
    • Add support for MXFP8xMXFP4 in pytorch
    • Support nvfp4 model and fp8 kv cache for MLA chunked prefill (Blackwell)
    • Opensource MOE MXFP8-MXFP4 implementation
    • Add support for Modelopt fp8_pb_wo quantization scheme
    • Support deepEP fp4 post quant all2all dispatch
    • Fuse w4a8 moe pre-quant scale on Hopper
    • Support Weight-Only-Quantization in PyTorch Workflow
    • Add support for per expert activation scaling factors
    • Add ReDrafter support for Qwen
    • Enable CUDA Graph for Nemotron-H
    • Add support for YARN in NemotronNAS models
    • Switch to internal version of MMProjector in Gemma3
    • Disable add special tokens for Llama3.3 70B
    • Auto-enable ngram with concurrency <= 32
    • Support turning on/off spec decoding dynamically
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21
    • Add support for external multimodal embeddings
    • Add support for disaggregation with pp with pytorch backend
    • Add status tags to LLM API reference
    • Support JSON Schema in OpenAI-Compatible API
    • Support chunked prefill on spec decode 2 model
    • Add KV cache reuse support for multimodal models
    • Support nanobind bindings
    • Add support for two-model engine KV cache reuse
    • Add Eagle-3 support for qwen3 dense model
    • Migrate Eagle-3 and draft/target speculation to Drafter
    • Enable guided decoding with overlap scheduler
    • Support n-gram speculative decoding with disagg
    • Add beam search support to the PyTorch Workflow
    • Add LLGuidance Support for PyTorch Backend
    • Add NGrams V2 support
    • Add MTP support for Online EPLB
    • Support disaggregated serving in TRTLLM Sampler
    • Add core infrastructure to enable loading of custom checkpoint formats
    • Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs
    • Use huge page mapping for host accessible memory on GB200
    • Add user-provided speculative decoding support
    • Add streaming scaffolding_llm.generate_async support
    • Detokenize option in /v1/completions request
    • Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner
    • Remove support for llmapi + TRT backend in Triton
    • Add request_perf_metrics to triton LLMAPI backend
    • Add support for Triton request cancellation
  • Benchmark:

    • Add support for benchmarking individual gemms in MOE benchmark (#6080)
    • Add speculative metrics for trtllm-bench
    • Add the ability to write a request timeline for trtllm-bench
    • Add no_kv_cache_reuse option and streaming support for trtllm-serve bench
    • Add latency support for trtllm-bench
    • Add Acceptance Rate calculation to benchmark_serving
    • Add wide-ep benchmarking scripts
    • Update trtllm-bench to support new Pytorch default
    • Add support for TRTLLM CustomDataset
    • Make benchmark_serving part of the library
  • Documentation:

    • Refactored the doc structure to focus on the PyTorch workflow.
    • Improved the LLMAPI and API reference documentation. Stable APIs are now protected and will remain consistent in subsequent versions following v1.0.
    • Removed legacy documentation related to the TensorRT workflow.

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.06-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.06-py3.
  • The dependent NVIDIA ModelOpt version is updated to 0.33.
  • The dependent xgrammar version is updated to 0.1.21.
  • The dependent transformers version is updated to 4.53.1.

API Changes

  • BREAKING CHANGE Promote PyTorch to be the default LLM backend
  • BREAKING CHANGE Change default backend to PyTorch in trtllm-serve
  • BREAKING CHANGE Unify KvCacheConfig in LLM class for pytorch backend
  • BREAKING CHANGE Rename cuda_graph_config padding_enabled field
  • BREAKING CHANGE Rename mixed_sampler to enable_mixed_sampler
  • BREAKING CHANGE Rename LLM.autotuner_enabled to enable_autotuner
  • Add back allreduce_strategy parameter into TorchLlmArgs
  • Add LLmArgs option to force using dynamic quantization
  • Change default LoRA cache sizes and change peft_cache_config cache size fields to take effect when not explicitly set in lora_config
  • Remove deprecated LoRA LLM args, that are already specified in lora_config
  • Add request_perf_metrics to LLMAPI
  • Remove batch_manager::KvCacheConfig and use executor::KvCacheConfig instead
  • Remove TrtGptModelOptionalParams
  • Remove ptuning knobs from TorchLlmArgs

Fixed Issues

  • Fix illegal memory access in MLA (#6437)
  • Fix nemotronNAS loading for TP>1 (#6447)
  • Fix wide EP when using DeepEP with online EPLB (#6429)
  • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
  • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
  • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Fix TMA error with GEMM+AR on TP=2 (#6075)
  • Fix scaffolding aime test in test_e2e (#6140)
  • Fix KV Cache overrides in trtllm-bench (#6103)
  • Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
  • Fix eagle3 two model disaggregated serving test (#6014)
  • Fix chunked prefill + overlap scheduling (#5761)
  • Fix mgmn postprocess error (#5835)
  • Fallback to cubins for fp8 fmha kernels on Ada (#5779)
  • Fix disagg + speculative decoding (#5558)
  • Fix test_generate_with_seed CI failure. (#5772)
  • Fix prompt adapter TP2 case (#5782)
  • Fix disaggregate serving with attention DP (#4993)
  • Fix a quote error introduced in #5534 (#5816)
  • Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
  • Fix lost requests for disaggregated serving (#5815)
  • Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
  • Fix GEMM+AR fusion on blackwell (#5563)
  • Fix llama4 multimodal support (#5809)
  • Fix Llama4 Scout FP4 crash issue (#5925)
  • Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
  • Fix moe regression for sm120 (#5823)
  • Fix Qwen2.5VL FP8 support (#5029)
  • Fix the illegal memory access issue in moe gemm on SM120 (#5636)
  • Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
  • Fix incremental detokenization (#5825)
  • Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
  • Fix mistral unit tests due to transformers upgrade (#5904)
  • Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
  • Fix Gemma3 unit tests due to transformers upgrade (#5921)
  • Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
  • Remove SpecConfig and fix thread leak issues (#5931)
  • Fast redux detection in trtllm gen routing kernel (#5941)
  • Fix cancel request logic (#5800)
  • Fix errors in wide-ep scripts (#5992)
  • Fix error in post-merge-tests (#5949)
  • Fix missing arg to alltoall_prepare_maybe_dispatch (#5669)
  • Fix attention DP doesn't work with embedding TP (#5642)
  • Fix broken cyclic reference detect (#5417)
  • Fix permission for local user issues in NGC docker container. (#5373)
  • Fix mtp vanilla draft inputs (#5568)
  • Fix mPtrExpertCounts allocation in MoE TRT-LLM backend (nvfp4) (#5519)
  • Fix block scale fp8 support for deepseek v3 on Blackwell. (#5514)
  • Fix the issue MoE autotune fallback failed to query default heuristic (#5520)
  • Fix the unexpected keyword argument 'streaming' (#5436)

Known Issues

  • When using disaggregated serving with pipeline parallelism and KV cache reuse, a hang can occur. This will be fixed in a future release. In the meantime, disabling KV cache reuse will fix this issue.
  • Running multi-node cases where each node has just a single GPU is known to fail. This will be addressed in a future release.
  • For the Llama 3.x and Llama 4 models, there is an issue with pipeline parallelism when using FP8 and NVFP4 weights. As a workaround, you can set the environment variable export TRTLLM_LLAMA_EAGER_FUSION_DISABLED=1.

What's Changed

Read more

v1.1.0rc5

18 Sep 01:49
0c9430e

Choose a tag to compare

v1.1.0rc5 Pre-release
Pre-release

Announcement Highlights

  • Model Support
    • Enable NvFP4/FP8 quantization for Nemotron-H architecture (#7589)
    • Enable KV-cache reuse and add E2E tests for llava-next (#7349)
    • Support gpt-oss with fp8 kv cache (#7612)
    • Support kvcache reuse for phi4mm (#7563)
  • API
    • Add TorchLlmArgs to the connector api (#7493)
  • Benchmark
    • Extend test_perf.py to add disagg-serving perf tests (#7503)
    • Add accuracy test for deepseek-r1 with chunked_prefill (#7365)
  • Feature
    • Optimize MLA kernels with separate reduction kernels (#7597)
    • Wrap MOE with custom op (#7277)
    • Make the should_use_spec_decode logic a bit smarter (#7112)
    • Use a shell context to install dependancies (#7383)
    • Topk logprobs for TRT backend and top1 logprob for PyT backend (#6097)
    • Support chunked prefill for multimodal models (#6843)
    • Optimize MLA chunked prefill && support fp8 mla chunked prefill (#7477)
    • Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues (#7616)
    • Add deepseek r1-w4afp8 quickstart (#7645)
    • Nanobind: Allow none types for fields in result (#7672)
    • Using arrival time in llmapi when creating LlmRequest in pytorch workflow (#7553)
    • UCX zmq ip support ipv6 (#7530)
    • Refactor: Quantization Transforms with Inheritance (#7227)

What's Changed

  • [None][chore] Remove closed bugs by @xinhe-nv in #7591
  • [https://nvbugs/5454559][fix] handle bias term in fuse_gate_mlp by @Linda-Stadter in #7449
  • [None][fix] enable NvFP4/FP8 quantization for Nemotron-H architecture by @tomeras91 in #7589
  • [None][feat] Optimize MLA kernels with separate reduction kernels by @PerkzZheng in #7597
  • [https://nvbugs/5445466][fix] unwaive DS R1 test cases with bug already fixed by @lancelly in #7429
  • [#6798][fix] fix compilation error in ub_allocator in single device build by @WilliamTambellini in #6874
  • [https://nvbugs/5434424][fix] A quick fix for the wrong output issue of SM89 blocked scaling batched GEMM when the input tensor is non-contiguous. by @StudyingShao in #7615
  • [None][chore] add TorchLlmArgs to the connector api by @richardhuo-nv in #7493
  • [TRTLLM-6707][fix] nanobind fix for executor exit call by @Linda-Stadter in #7565
  • [None][ci] add DGX_H100-2_GPUs-PyTorch-Others-1 pipeline by @QiJune in #7629
  • [TRTLLM-7408][feat] Wrap MOE with custom op. by @liji-nv in #7277
  • [TRTLLM-5059][feat] Enable KV-cache reuse and add E2E tests for llava-next by @chang-l in #7349
  • [None][fix] fix post-merge issue raised by #5488 by @nv-guomingz in #7655
  • [https://nvbugs/5410687][test] Add deepseek r1-w4afp8 quickstart by @fredricz-20070104 in #7645
  • [None][fix]UCX zmq ip support ipv6 by @chuangz0 in #7530
  • [None][feat] Make the should_use_spec_decode logic a bit smarter by @zheyuf in #7112
  • [#5861][autodeploy] Refactor: Quantization Transforms with Inheritance by @Fridah-nv in #7227
  • [#7208][fix] Fix config type of MedusaConfig by @karljang in #7320
  • [None][infra] Bump version to 1.1.0rc5 by @yiqingy0 in #7668
  • [TRTLLM-7871][infra] Extend test_perf.py to add disagg-serving perf tests. by @bo-nv in #7503
  • [https://nvbugs/5494698][fix] skip gemma3 27b on blackwell by @xinhe-nv in #7505
  • [https://nvbugs/5477359][fix] Nanobind: Allow none types for fields in result by @Linda-Stadter in #7672
  • [None][chore] remove executor config in kv cache creator by @leslie-fang25 in #7526
  • [https://nvbugs/5488212][waive] Waive failed tests for L20 by @nvamyt in #7664
  • [None][feat] Use a shell context to install dependancies by @v-shobhit in #7383
  • [https://nvbugs/5505402] [fix] Disable deep_gemm for Qwen3 QKNormRoPEAttention and Linear layers due to accuracy issues by @DomBrown in #7616
  • [None][infra] Waive failed cases on main 0910 by @EmmaQiaoCh in #7676
  • [None][infra] Adjust labeling llm prompt for bug issues by @karljang in #7385
  • [None][ci] move some test cases from l40s to a30 by @QiJune in #7684
  • [None][fix] Fix the incorrect header file import in dataType.h by @Fan-Yunfan in #7133
  • [https://nvbugs/5498165][fix] fix permission error for config file lock by @chang-l in #7656
  • [https://nvbugs/5513192][fix] Add the missing param for kv_cache_tran… by @nv-guomingz in #7679
  • [TRTLLM-1302][feat] Topk logprobs for TRT backend and top1 logprob for PyT backend by @LinPoly in #6097
  • [TRTLLM-7169][infra] Fix Slurm multi-node test showing "Submit Test Results" in the test name by @ZhanruiSunCh in #6856
  • [TRTLLM-6791][infra] Add check for uploading stage name and avoid overriding test result tar file by @ZhanruiSunCh in #6742
  • [None][ci] Some improvements for Slurm CI by @chzblych in #7689
  • [None][ci] Test waives for the main branch 09/14 by @chzblych in #7698
  • [None][feat] support gpt-oss with fp8 kv cache by @PerkzZheng in #7612
  • [TRTLLM-6903][feat] Support chunked prefill for multimodal models by @chang-l in #6843
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7682
  • [None][chore] Enable multiple postprocess workers tests for chat completions api by @JunyiXu-nv in #7602
  • [TRTLLM-7279][test] add accuracy test for deepseek-r1 with chunked_prefill by @crazydemo in #7365
  • [https://nvbugs/5467981][fix] Fix Qwen2.5-VL fails with cuda graph padding by @DylanChen-NV in #7122
  • [None][chore] move some cases from post-merge to pre-merge to detect errors in early stage by @HuiGao-NV in #7699
  • [TRTLLM-7918][feat] Support kvcache reuse for phi4mm by @Wanli-Jiang in #7563
  • [None][test] add test for min_tokens by @ixlmar in #7678
  • [TRTLLM-7918][feat] Revert "Support kvcache reuse for phi4mm (#7563)" by @Wanli-Jiang in #7722
  • [None][fix] using arrival time in llmapi when creating LlmRequest in pytorch workflow by @zhengd-nv in #7553
  • [TRTLLM-7192][feat] optimize MLA chunked prefill && support fp8 mla chunked prefill by @jmydurant in #7477
  • [None][ci] Test waives for the main branch 09/15 by @chzblych in #7709

New Contributors

Full Changelog: v1.1.0rc4...v1.1.0rc5

v1.1.0rc4

10 Sep 07:32
62b564a

Choose a tag to compare

v1.1.0rc4 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Support phi-4 model in pytorch backend (#7371)
    • Support Aggregate mode for phi4-mm (#7521)
  • API
    • Implement basic functionalities for Responses API (#7341)
    • Support multiple postprocess workers for chat completions API (#7508)
    • Report failing requests (#7060)
  • Benchmark
    • Test trtllm-serve with --extra_llm_api_options (#7492)
  • Feature
    • Add MOE support for dynamic cluster shapes and custom epilogue schedules (#6126)
    • Autotune TRT-LLM Gen MoE when using CUDA graphs (#7285)
    • Enable guided decoding with speculative decoding (part 2: one-model engine) (#6948)
    • Separate run_shape_prop as another graph utility (#7313)
    • MultiLayer Eagle (#7234)
    • Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec (#7481)
    • Add NVFP4 x FP8 (#6809)
    • Support hashing and KV cache reuse for videos (#7360)
    • Add MCTS and TOT tree-based inference controllers to Scaffolding (#7490)
    • Introduce QKNormRoPEAttention module (#6830)
    • AutoDeploy: flexible args for sequence interface + AD multi-modal input processor + llama4 VLM example (#7221)
    • Support KV cache salting for secure KV cache reuse (#7106)
    • trtllm-gen kernels support sm103 (#7570)
    • Move stop_criteria to sample_async (#7041)
    • KV cache transfer for uneven pp (#7117)
    • Update multimodal utility get_num_tokens_per_image for better generalization (#7544)
    • AutoDeploy: set torch recompile_limit based on cuda_graph_batch_sizes and refactored (#7219)
    • Add Request specific exception (#6931)
    • Add DeepSeek-v3-0324 e2e torch test (#7413)
    • Add 8-GPU test cases for RTX6000 (#7083)
    • add gptoss 20g tests (#7361)
    • Nixl support for GDS (#5488)
    • CMake option to link statically with cublas/curand (#7178)
    • Extend VLM factory and add Mistral3 factory (#7583)
  • Documentation
    • fix example in docstring (#7410)
    • Fix formatting error in Gemma3 readme (#7352)
    • Add note about trtllm-serve to the devel container (#7483)
    • add GPT OSS Eagle3 blog (#7140)
    • 1.0 Documentation. (#6696)
    • Update kvcache part (#7549)
    • Rename TensorRT-LLM to TensorRT LLM. (#7554)
    • refine docs for accuracy evaluation of gpt-oss models (#7252)

What's Changed

  • [https://nvbugs/5485430][fix] Copy the nanobind file when using precompiled package by @jiaganc in #7334
  • [None][infra] Using local variables in rerun function by @yiqingy0 in #7198
  • [None][ci] Correct docker args for GPU devices and remove some stale CI codes by @chzblych in #7417
  • [https://nvbugs/5476580][fix] unwaive test_nvfp4_4gpus by @Superjomn in #7454
  • [None][test] auto reuse torch empty cache on qa test by @crazydemo in #7421
  • [None][doc] fix example in docstring by @tomeras91 in #7410
  • [TRTLLM-6643][feat] Add DeepSeek-v3-0324 e2e torch test by @aalanwyr in #7413
  • [None][infra] waive test case failed on post-merge by @HuiGao-NV in #7471
  • [TRTLLM-7208][feat] Implement basic functionalities for Responses API by @JunyiXu-nv in #7341
  • [https://nvbugs/5453992][unwaive] Unwaive llama quickstart test by @peaceh-nv in #7242
  • [None][infra] Waive failed tests on main branch 0902 by @EmmaQiaoCh in #7482
  • [None][chore] Fix formatting error in Gemma3 readme by @karljang in #7352
  • [https://nvbugs/5470782][fix] Add specific test names for test_deepseek.py by @SimengLiu-nv in #7318
  • [https://nvbugs/5458798][fix] Disabled test_trtllm_bench_backend_comparison due to timeout by @MrGeva in #7397
  • [None][chore] Add note about trtllm-serve to the devel container by @MartinMarciniszyn in #7483
  • [None][chore] rm executor config in kv cache connector by @leslie-fang25 in #7372
  • [None][perf] Add MOE support for dynamic cluster shapes and custom epilogue … by @djns99 in #6126
  • [None][perf] Autotune TRT-LLM Gen MoE when using CUDA graphs by @jinyangyuan-nvidia in #7285
  • [TRTLLM-7261][feat] Support phi-4 model in pytorch backend by @Wanli-Jiang in #7371
  • [https://nvbugs/5480289][fix] release slot manager in mtp MTPHiddenStatesManager by @yweng0828 in #7340
  • [https://nvbugs/5488141][fix] Unwaive llama3 test_eagle3 by @mikeiovine in #7486
  • [https://nvbugs/5472947][fix] wait on isend handles before reusing buffers by @amukkara in #7462
  • [TRTLLM-7363][test] Add 8-GPU test cases for RTX6000 by @StanleySun639 in #7083
  • [https://nvbugs/5485593][fix] improve accuracy/test_disaggregated_serving.py by @reasonsolo in #7366
  • [None][doc] add GPT OSS Eagle3 blog by @IzzyPutterman in #7140
  • [None][fix] Fix KV cache recompute in draft_target spec decode by @mikeiovine in #7348
  • [TRTLLM-7028][feat] Enable guided decoding with speculative decoding (part 2: one-model engine) by @syuoni in #6948
  • [None][chore] Remove two unused parameters in create_py_executor by @leslie-fang25 in #7458
  • [#7222][autodeploy] Separate run_shape_prop as another graph utility by @Fridah-nv in #7313
  • [None][fix] Fix a numerical stability issue for XQA with spec dec by @lowsfer in #7114
  • [https://nvbugs/5470769][fix] fix disagg-serving accuracy test case by @reasonsolo in #7338
  • [TRTLLM-7876][test] Test trtllm-serve with --extra_llm_api_options by @StanleySun639 in #7492
  • [https://nvbugs/5485102][fix] Correctly set stride for piecewise outp… by @liji-nv in #7442
  • [TRTLLM-7442][model] Remove unnecessary D2H copies by @2ez4bz in #7273
  • [TRTLLM-6199][infra] Update for using open driver from BSL by @EmmaQiaoCh in #7430
  • [None][fix] Fix a typo in the Slurm CI codes by @chzblych in #7485
  • [TRTLLM-6342][fix] Fixed triggering BMM sharding by @greg-kwasniewski1 in #7389
  • [None][fix] fix hunyuan_moe init bug by @sorenwu in #7502
  • [None][chore] Bump version to 1.1.0rc4 by @yiqingy0 in #7525
  • [https://nvbugs/5485886][fix] Fix resource free of Eagle3ResourceManager by @kris1025 in #7437
  • [TRTLLM-6893][infra] Disable the x86 / SBSA build stage when run BuildDockerImage by @ZhanruiSunCh in #6729
  • [https://nvbugs/5477730][fix] Fix the alltoall case when tp_size larger than ep_size by @WeiHaocheng in #7331
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #7521
  • [None][ci] set TORCHINDUCTOR_COMPILE_THREADS for thop/parallel tests by @QiJune in #7489
  • [None][test] update nim and full test list by @crazydemo in #7468
  • [None][feat] MultiLayer Eagle by @IzzyPutterman in #7234
  • [TRTLLM-7027][feat] Fuse d2t to logitsBitmaskKernel and fix a race condition in one-model spec by @syuoni in #7481
  • [OMNIML-2336][feat] Add NVFP4 x FP8 by @sychen52 in #6809
  • [https://nvbugs/5492485][fix] Use offline dataset from llm-models instead. by @yuxianq in #7435
  • [TRTLLM-7410][feat] Support hashing and KV cache reuse for videos by @chang-l in #7360
  • [https://nvbugs/5369366] [fix] Report failing requests by @arekay in #7060
  • [None][feat] Add Request specific exception by @Shunkangz in #6931
  • [#3325][feat] Add MCTS and TOT tree-based inference controllers to Scaffolding by @therealnaveenkamal in #7490
  • [https://nvbugs/5483615][fix] Remove unnecessary assertion to let mai… by @liji-nv in #7441
  • [None][ci] remove unnecessary test_modeling_deepseek.py by @QiJune in #7542
  • [None][chore] Remove closed bugs by @xinhe-nv in #7408
  • [TRTLLM-6642][feat] add gptoss 20g tests by @xinhe-nv in #7361
  • [None][ci] Increase the number of retries in docker image generation by @chzblych in #7557
  • [None][infra] update nspect version by @niukuo in #7552
    *...
Read more

v1.1.0rc2.post2

15 Sep 05:11
ef0d06d

Choose a tag to compare

v1.1.0rc2.post2 Pre-release
Pre-release

Announcement Highlights

  • Feature
    • Add MNNVL AlltoAll tests to pre-merge (#7465)
    • Support multi-threaded tokenizers for trtllm-serve (#7515)
    • FP8 Context MLA integration (#7581)
    • Support block wise FP8 in wide ep (#7423)
    • Cherry-pick Responses API and multiple postprocess workers support for chat harmony (#7600)
    • Make low_precision_combine as a llm arg (#7598)
  • Documentation
    • Update deployment guide and cherry-pick CI test fix from main (#7623)

What's Changed

  • [None] [test] Add MNNVL AlltoAll tests to pre-merge by @kaiyux in #7465
  • [TRTLLM-7292][feat] Support multi-threaded tokenizers for trtllm-serve by @nv-yilinf in #7515
  • [None][fix] trtllm-serve yaml loading by @Superjomn in #7551
  • [None][chore] Bump version to 1.1.0rc2.post2 by @yiqingy0 in #7582
  • [https://nvbugs/5498967][fix] Downgrade NCCL by @yizhang-nv in #7556
  • [TRTLLM-6994][feat] FP8 Context MLA integration. by @yuxianq in #7581
  • [TRTLLM-7831][feat] Support block wise FP8 in wide ep by @xxi-nv in #7423
  • [None][chore] Make use_low_precision_moe_combine as a llm arg by @zongfeijing in #7598
  • [None][fix] Update deployment guide and cherry-pick CI test fix from main by @dongfengy in #7623
  • [None][feat] Cherry-pick Responses API and multiple postprocess workers support for chat harmony by @JunyiXu-nv in #7600
  • [None][chore] Fix kernel launch param and add TRTLLM MoE backend test by @pengbowang-nv in #7524

New Contributors

Full Changelog: v1.1.0rc2.post1...v1.1.0rc2.post2

v1.1.0rc2.post1

06 Sep 00:06
9d6e87a

Choose a tag to compare

v1.1.0rc2.post1 Pre-release
Pre-release

Announcement Highlights:

  • API
    • Update TargetInfo to accommodate CP in disagg (#7224)
  • Benchmark
    • Minor fixes to slurm and benchmark scripts (#7453)
  • Feature
    • Support DeepGEMM swap-AB on sm100 (#7355)
    • Merge add sparse exp and shared exp into local re… (#7422)
    • Add batch waiting when scheduling (#7287)
    • Reuse pytorch memory segments occupied by cudagraph pool (#7457)
    • Complete the last missing allreduce op in Llama3/4 (#7420)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • store blog 10 media via lfs (#7375)

What's Changed

  • [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
  • [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
  • [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
  • [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
  • [None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
  • [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
  • [None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
  • [None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
  • [None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
  • [None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
  • [TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
  • [None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
  • [https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
  • [None][fix] Update DG side branch name by @Barry-Delaney in #7491
  • [None][fix] Update DG commit by @Barry-Delaney in #7534
  • [None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
  • [https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
  • [None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
  • [None][opt] Add batch waiting when scheduling by @yunruis in #7287
  • [https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
  • [None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc2.post1