[Feature][Kernel][DSR1]: Make`fused_grouped_topk` more fused (integrate TRT-LLM kernel)

### 🚀 The feature, motivation and pitch

Right now, this operation kernel has multiple steps (https://github.com/vllm-project/vllm/blob/4ea62b77f5c009515f50d14cda24665101a5d910/vllm/model_executor/layers/fused_moe/fused_moe.py#L1320-L1350)

```python
def fused_grouped_topk(
    hidden_states: torch.Tensor,
    gating_output: torch.Tensor,
    topk: int,
    renormalize: bool,
    e_score_correction_bias: torch.Tensor,
    num_expert_group: int = 0,
    topk_group: int = 0,
    scoring_func: str = "softmax",
    routed_scaling_factor: float = 1.0,
) -> tuple[torch.Tensor, torch.Tensor]:
    assert hidden_states.size(0) == gating_output.size(0), "Number of tokens mismatch"

    if scoring_func == "softmax":
        scores = torch.softmax(gating_output, dim=-1)
    elif scoring_func == "sigmoid":
        scores = gating_output.sigmoid()
    else:
        raise ValueError(f"Unsupported scoring function: {scoring_func}")

    scores_with_bias = scores + e_score_correction_bias.unsqueeze(0)
    topk_values, topk_indices = ops.grouped_topk(
        scores,
        scores_with_bias.to(scores.dtype),
        num_expert_group,
        topk_group,
        topk,
        renormalize,
        routed_scaling_factor,
    )
    return topk_values.to(torch.float32), topk_indices.to(torch.int32)
```

We should make this one single kernel to do the:
- sigmoid
- addition
- output types

### Alternatives

See below. we should pull in the kernel from trt-llm

### Additional context

See below. we should pull in the kernel from trt-llm

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature][Kernel][DSR1]: Make`fused_grouped_topk` more fused (integrate TRT-LLM kernel) #28086

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	def fused_grouped_topk(
	hidden_states: torch.Tensor,
	gating_output: torch.Tensor,
	topk: int,
	renormalize: bool,
	e_score_correction_bias: torch.Tensor,
	num_expert_group: int = 0,
	topk_group: int = 0,
	scoring_func: str = "softmax",
	routed_scaling_factor: float = 1.0,
	) -> tuple[torch.Tensor, torch.Tensor]:
	assert hidden_states.size(0) == gating_output.size(0), "Number of tokens mismatch"

	if scoring_func == "softmax":
	scores = torch.softmax(gating_output, dim=-1)
	elif scoring_func == "sigmoid":
	scores = gating_output.sigmoid()
	else:
	raise ValueError(f"Unsupported scoring function: {scoring_func}")

	scores_with_bias = scores + e_score_correction_bias.unsqueeze(0)
	topk_values, topk_indices = ops.grouped_topk(
	scores,
	scores_with_bias.to(scores.dtype),
	num_expert_group,
	topk_group,
	topk,
	renormalize,
	routed_scaling_factor,
	)
	return topk_values.to(torch.float32), topk_indices.to(torch.int32)

Uh oh!

[Feature][Kernel][DSR1]: Makefused_grouped_topk more fused (integrate TRT-LLM kernel) #28086

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature][Kernel][DSR1]: Make`fused_grouped_topk` more fused (integrate TRT-LLM kernel) #28086