Skip to content

CI: 06/10/25 upstream sync #464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1,958 commits into from

Conversation

rocm-repo-management-api-2[bot]
Copy link

Daily sync with upstream

chr1sj0nes and others added 30 commits May 27, 2025 17:53
PiperOrigin-RevId: 764148812
This helps with performance a bit (we only allocate and deallocate TMEM once in each
SM), and opens up the opportunity for better overlapping of the epilogue.

PiperOrigin-RevId: 764168230
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, and leads to improved build and iteration times.

This required moving a couple `jax.numpy` imports into local functions. These could probably be addressed by moving the registrations elsewhere.

PiperOrigin-RevId: 764170653
… TMEM

Otherwise one block can begin the deallocation process before the other is
done using it.

PiperOrigin-RevId: 764173760
…d layouts

Any tile-aligned slicing is easy to handle.

PiperOrigin-RevId: 764189366
This allows us to prime the GMEM->SMEM pipeline for the next tile
while storing the SMEM->GMEM tile for the current one. However, this implies
that we can no longer share the same SMEM region for the MMA pipeline
and the epilogue, which pushes the SMEM pressure so high that we can't fetch
too many steps into the future. Overall the performance is slightly worse than
for the baseline kernel, but it recovers and improves upon it in the follow up.

PiperOrigin-RevId: 764220403
This reworks the previous scheme by transferring all of TMEM to registers at once,
and then doing RMEM->SMEM->GMEM in multiple phases, allowing us to use a smaller
SMEM buffer. This, in turn, lets us bump max_concurrent_steps for the MMA pipeline
which increases performance considerably.

The only downside of this scheme is that even though it should be technically feasible
to perform the epilogue with 255 registers per thread, ptxas generates a number of spills
that might be lowering our performance. Either way, it's still better than the previous
alternatives.

PiperOrigin-RevId: 764249234
This replaces the old scheme that still included a bit of a bubble at the
end of each tile with a new scheme that should be entirely bubble-free, for
as long as the MMA loop is long enough to hide the store latency (i.e. for big
enough K dimensions). This also removes the problems with spills we had in the
previous version since the register footprint is relatively small now.

PiperOrigin-RevId: 764256446
…mosaic:GPU can get access to the device ids in the mesh

PiperOrigin-RevId: 764263324
XLA dumps one more HLO file by default, which leads to one more PGLE profile
file.

PiperOrigin-RevId: 764274080
All Triton-specific APIs are always used qualified, e.g. `plgpu.TritonCompilerParams`,
so the prefix is redundant.

PiperOrigin-RevId: 764276165
Resolve an issue where `jax.devices()` hangs due to unwanted TPU
metadata query when using LibTPU with a device other than TPU (ex:
CPU's).
This feature can be useful in cross [AOT](https://docs.jax.dev/en/latest/aot.html).
This strips away the redundant terms in job names to keep them shorter and easy to read. Actions displays job names that reuse workflows in the following format: `caller workflow name / called workflow name`. The changes here are done in the called workflow names as changing the caller workflow names seem to make the summary page hard to parse (see https://github.com/jax-ml/jax/actions/runs/15217612585).

Here's how the continuous workflow's summary page looks like with this change: https://github.com/jax-ml/jax/actions/runs/15286609214/job/42998511666

PiperOrigin-RevId: 764390866
PiperOrigin-RevId: 764419062
…uild container to us-docker.pkg.dev/ml-oss-artifacts-published/ml-public-container/ml-build.

These containers are the same (same build script), but they are just in a different repositories.

PiperOrigin-RevId: 764435895
Updates LLVM usage to match
[2b8bff6f66fd](llvm/llvm-project@2b8bff6f66fd)

PiperOrigin-RevId: 764439621
carlosgmartin and others added 26 commits June 8, 2025 19:01
…ve_scan_reverse_argument_order

PiperOrigin-RevId: 769139664
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, and leads to improved build and iteration times.

This required a few local imports and refactors.

PiperOrigin-RevId: 769184594
PiperOrigin-RevId: 769190882
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769236580
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769249808
'exectuable' should be 'executable'.

PiperOrigin-RevId: 769256903
… to result in a ptxas miscompilation (between 12.8.0 and 12.9.1).

PiperOrigin-RevId: 769257583
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769264747
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769280698
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769320414
Creating smaller build rules enforces better organized dependency graphs in the JAX project, helps pytype propagate annotations correctly, prevents use of internal APIs, and leads to improved build and iteration times.

PiperOrigin-RevId: 769356578
…Assignment values.

PiperOrigin-RevId: 769417940
It is probably a more useful default behavior not to implicitly inline everything.

PiperOrigin-RevId: 769452443
@rocm-repo-management-api-2 rocm-repo-management-api-2 bot requested a review from a team as a code owner June 10, 2025 06:02
@rocm-repo-management-api-2 rocm-repo-management-api-2 bot enabled auto-merge (rebase) June 10, 2025 06:02
auto-merge was automatically disabled June 19, 2025 15:59

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.