Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")` #6

lezcano · 2024-06-11T10:34:56Z

fullgraph=True will make sure that there are no graphbreaks (this may already be the case).
mode="reduce-overhead" will use CUDA graphs if possible. See in [these benchmarks] that going from regular torch.compile to reduce-overhead gives a good 70-100% speed-up on top of regular torch.compile.

The text was updated successfully, but these errors were encountered:

lezcano · 2024-06-11T10:40:07Z

For performance reasons, you might also want to avoid synchronising after every loop iteration, and just doing it after 10 or 20 iterations and averaging out the result. That being said, I understand this would affect QoL for the script, so fair enough.

zeux · 2024-06-12T01:30:47Z

On RTX 4090 / PyTorch nightly this reduces the throughput slightly (from 130k tok/s to 127 tok/s, or equivalently from 4030ms dt to 4116ms dt; using B=16 to make sure training fits into 24 GB VRAM). This is specifically attributed to reduce-overhead mode, fullgraph=True works fine without changing performance (as I understand it merely turns graph breaks into compile errors and there are no graph breaks).

lezcano · 2024-06-12T08:09:19Z

Perhaps some tweaks are needed to make them run. You can see whether they were enabled or not running your program with TORCH_LOGS=cudagraphs.

zeux · 2024-06-12T15:49:07Z

Yes, the logs indicate that mode=reduce-overhead uses cuda graphs and by default they are not used. I assume there are some restrictions on kernel compilation/fusion when cuda graphs are enabled and these outweigh the CPU overhead savings in this case, as an individual step is fairly expensive anyway.

lezcano · 2024-06-12T17:43:27Z

Within PyTorch there are no heuristics on whether to use cudagraphs or not. If reduce-overhead is on, PyTorch will try its best to use cudagrapsh. There are some limitations as to the programs that we can turn on cuda-graphs for in terms of input mutations, graph dynamism and so on, though. Sometimes the implementation needs to be tweaked a little bit (often not too much) to make it amenable to be used with cudagraphs.

zeux · 2024-06-12T17:59:39Z

Sure - my point is that whatever else reduce-overhead changes in the compilation process, it’s more detrimental to overall performance on this workload on 4090 than cuda graphs are beneficial.

JohannesVod · 2024-06-13T12:44:13Z

someone has to try this out on a A100, it probably boosts the performance quite a lot. There are also other flags that are worth trying.

marib00 · 2024-06-18T05:35:10Z

Tried on a H100. Goes down from ~277k tok/sec to ~269k tok/sec on nightly 2.5.0.dev20240616+cu124 🤷‍♂️

lezcano · 2024-06-18T07:39:55Z

A few points:

reduce-overhead tries to turn on the CUDA graphs. Sometimes it can't, and will simply fallback to eager
To see the reasons why this may have failed, run the program with TORCH_LOGS=cudagraphs
If PyTorch could not run the model with CUDA graphs enabled, you might need to perform some minor modifications to the model for it to run
All in all, I'd be very surprised if the model with CUDA graphs enabled actually run slower than without them.

If you find the culprit of why didn't it run in the first place, feel free to tag @eellison in that PR. He's the maintainer of CUDA graphs within PyTorch.

JohannesVod · 2024-06-18T09:08:12Z

@marib00 very nice! Can you try "max-autotune" as well maybe? It is documented in https://pytorch.org/docs/stable/generated/torch.compile.html and might be even faster. Anyway, someone should create a PR

marib00 · 2024-06-18T09:10:50Z

@JohannesVod I did try "max-autotune" already and no change; it was compiling (i.e. autotuning) forever though.

lezcano mentioned this issue Jun 11, 2024

[RFC] Use CUDA graphs by default on torch.compile pytorch/pytorch#121968

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")` #6

Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")` #6

lezcano commented Jun 11, 2024

lezcano commented Jun 11, 2024

zeux commented Jun 12, 2024 •

edited

Loading

lezcano commented Jun 12, 2024

zeux commented Jun 12, 2024

lezcano commented Jun 12, 2024

zeux commented Jun 12, 2024

JohannesVod commented Jun 13, 2024

marib00 commented Jun 18, 2024

lezcano commented Jun 18, 2024 •

edited

Loading

JohannesVod commented Jun 18, 2024

marib00 commented Jun 18, 2024

Consider using torch.compile(model, fullgraph=True, mode="reduce-overhead") #6

Consider using torch.compile(model, fullgraph=True, mode="reduce-overhead") #6

Comments

lezcano commented Jun 11, 2024

lezcano commented Jun 11, 2024

zeux commented Jun 12, 2024 • edited Loading

lezcano commented Jun 12, 2024

zeux commented Jun 12, 2024

lezcano commented Jun 12, 2024

zeux commented Jun 12, 2024

JohannesVod commented Jun 13, 2024

marib00 commented Jun 18, 2024

lezcano commented Jun 18, 2024 • edited Loading

JohannesVod commented Jun 18, 2024

marib00 commented Jun 18, 2024

Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")` #6

Consider using `torch.compile(model, fullgraph=True, mode="reduce-overhead")` #6

zeux commented Jun 12, 2024 •

edited

Loading

lezcano commented Jun 18, 2024 •

edited

Loading