PyTorch 2.3.1 Release, bug fix release

atalman released this 05 Jun 19:16

· 3601 commits to main since this release

This release is meant to fix the following issues (regressions / silent correctness):

Torch.compile:

Remove runtime dependency on JAX/XLA, when importing torch.__dynamo (#124634)
Hide Plan failed with a cudnnException warning (#125790)
Fix CUDA memory leak (#124238) (#120756)

Distributed:

Fix format_utils executable, which was causing it to run as a no-op (#123407)
Fix regression with device_mesh in 2.3.0 during initialization causing memory spikes (#124780)
Fix crash of FSDP + DTensor with ShardingStrategy.SHARD_GRAD_OP (#123617)
Fix failure with distributed checkpointing + FSDP if at least 1 forward/backward pass has not been run. (#121544) (#127069)
Fix error with distributed checkpointing + FSDP, and with use_orig_params = False and activation checkpointing (#124698) (#126935)
Fix set_model_state_dict errors on compiled module with non-persistent buffer with distributed checkpointing (#125336) (#125337)

MPS:

Fix data corruption when coping large (>4GiB) tensors (#124635)
Fix Tensor.abs() for complex (#125662)

Packaging:

Fix UTF-8 encoding on Windows .pyi files (#124932)
Fix import torch failure when wheel is installed for a single user on Windows(#125684)
Fix compatibility with torchdata 0.7.1 (#122616)
Fix aarch64 docker publishing to https://ghcr.io (#125617)
Fix performance regression an aarch64 linux (pytorch/builder#1803)

Other:

Fix DeepSpeed transformer extension build on ROCm (#121030)
Fix kernel crash on tensor.dtype.to_complex() after ~100 calls in ipython kernel (#125154)

Release tracker #125425 contains all relevant pull requests related to this release as well as links to related issues.

Assets 3