Add CUDA argmax kernel for LLM sampler #16386

larryliu0820 · 2025-12-24T01:46:55Z

Add a CUDA kernel for argmax operation to support GPU-based sampling:

argmax.cuh: Template kernel using warp-level reductions with __shfl_xor_sync
for efficient parallel max finding. Supports float, half, and bfloat16.
argmax.cu: Wrapper function argmax_cuda() that launches the kernel,
handles device-to-host copy, and synchronization.
test_argmax.cu: Comprehensive unit tests covering various vocab sizes,
data types, edge cases, and numerical precision.
CMakeLists.txt: Build configuration for extension_llm_sampler_cuda library
and GoogleTest-based unit tests.

[ghstack-poisoned]

larryliu0820 · 2025-12-24T01:46:56Z

Stack from ghstack (oldest at bottom):

pytorch-bot · 2025-12-24T01:46:58Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16386

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 2 Unrelated Failures

As of commit f8cd4d2 with merge base c5d66a5 ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/llm/sampler/test/test_argmax.cu:
pull / test-samsung-models-linux / linux-job (gh)
test_inception_v3_fp16

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Test Metal Backend / export-model-metal-artifact (openai, whisper-small, non-quantized) / macos-job (gh) (matched macos rule in flaky-rules.json)
File doesn't exist

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Gasoonjia · 2025-12-25T06:26:19Z

extension/llm/sampler/argmax.cu

+#include <executorch/extension/llm/sampler/cuda_sampler.h>
+#include <executorch/runtime/platform/log.h>
+
+namespace executorch {


consider using nested namespace to follow c++ 17 standard

Gasoonjia · 2025-12-25T06:28:54Z

extension/llm/sampler/argmax.cuh

+          (const nv_bfloat16*)logits, rows, vocab, out_token, out_maxlogit);
+      break;
+    default:
+      // Unsupported type, fall back to float


perhapes we need to raise error here to avoid silent error?

mergennachin

I don't understand why you had to do a completely custom code.

There's already reduce aten kernel for argmax in pytorch (https://docs.pytorch.org/docs/stable/generated/torch.argmax.html)

https://github.com/pytorch/pytorch/blob/e20b76caff04de2c451777ea55e6fa97075ccbbf/aten/src/ATen/native/cuda/ReduceArgMaxKernel.cu

The torch layer has libtorch dependency (e.g., depends on TensorIterator) so you want to implement argmax as a shim layer instead in ExecuTorch, similar to how int4mm is done. That way, any model that uses torch.argmax can be lowered to ExecuTorch cuda backend easily in the future.

You can add files in
backends/cuda/runtime/shims/argmax.cu
backends/cuda/runtime/shims/argmax.cuh
backends/cuda/runtime/shims/argmax.h - header file
update aoti_cuda_shims.lib with the new kernel

Side note: another midsize task we can tackle as a team is to decouple TensorIterator in pytorch/pytorch to be libtorch independent as much as possible -- perhaps part of the header only file.

Update

f8cd4d2

[ghstack-poisoned]

larryliu0820 requested review from kirklandsign and mergennachin as code owners December 24, 2025 01:46

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 24, 2025

This was referenced Dec 24, 2025

Add CudaSampler class for GPU-based token sampling #16387

Open

Integrate CUDA sampler into ASR runner and enable skip_copy for decoder #16388

Open

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 24, 2025

Gasoonjia approved these changes Dec 25, 2025

View reviewed changes

mergennachin requested changes Dec 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add CUDA argmax kernel for LLM sampler #16386

Add CUDA argmax kernel for LLM sampler #16386

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

Uh oh!

Gasoonjia Dec 25, 2025

Uh oh!

Gasoonjia Dec 25, 2025

Uh oh!

mergennachin left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add CUDA argmax kernel for LLM sampler #16386

Are you sure you want to change the base?

Add CUDA argmax kernel for LLM sampler #16386

Conversation

larryliu0820 commented Dec 24, 2025

Uh oh!

larryliu0820 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16386

❌ 2 New Failures, 2 Unrelated Failures

Uh oh!

Gasoonjia Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Dec 25, 2025

Choose a reason for hiding this comment

Uh oh!

mergennachin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

larryliu0820 commented Dec 24, 2025 •

edited

Loading

pytorch-bot bot commented Dec 24, 2025 •

edited

Loading

mergennachin left a comment •

edited

Loading