Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recipe: progressive optimization of matrix multiplication in custom ops #13

Merged

Conversation

BradLarson
Copy link
Collaborator

Adds the next recipe in the custom ops series: a progressive optimization of matrix multiplications in MAX. The example here is once again drawn from the MAX repo's custom ops examples.

Copy link
Collaborator

@ehsanmok ehsanmok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! just a small thing

"https://conda.modular.com/max",
"https://repo.prefix.dev/modular-community",
]
platforms = ["linux-64", "osx-arm64", "linux-aarch64"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

osx-arm64 needs to be removed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'd pointed out the same thing on the last one, but why should we remove that platform support? These examples do build and run on macOS, but they fall back on the CPU path for naive implementations. I know they're aimed at Linux where you'll have GPU support, but they can at least be looked at locally on macOS.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It goes against the GPU story. Fine if we want to show on CPU.

bench_matmul_kernel["tiled_register"]()
bench_matmul_kernel["block_tiled"]()
bench_matmul_kernel["block_tiled_vectorized"]()
_ = gpu_ctx
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these needed at the end? the origin was recently added to the layout tensor and not sure about the rest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I think they're still needed. We're only in the middle of transitioning these types, and when I tried removing the manual lifetime extensions on the latest nightly this still crashes on GPU execution with CUDA illegal access errors.

@BradLarson
Copy link
Collaborator Author

As discussed with the other custom ops recipes, I'll merge to place an initial state of these in the repository to make it easier for others to enhance them with improved wording and code.

@BradLarson BradLarson merged commit a4fc268 into modular:main Mar 4, 2025
1 check passed
@BradLarson BradLarson deleted the custom-ops-matrix-multiplication branch March 6, 2025 15:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants