-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add recipe: progressive optimization of matrix multiplication in custom ops #13
Add recipe: progressive optimization of matrix multiplication in custom ops #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! just a small thing
"https://conda.modular.com/max", | ||
"https://repo.prefix.dev/modular-community", | ||
] | ||
platforms = ["linux-64", "osx-arm64", "linux-aarch64"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
osx-arm64
needs to be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'd pointed out the same thing on the last one, but why should we remove that platform support? These examples do build and run on macOS, but they fall back on the CPU path for naive implementations. I know they're aimed at Linux where you'll have GPU support, but they can at least be looked at locally on macOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It goes against the GPU story. Fine if we want to show on CPU.
bench_matmul_kernel["tiled_register"]() | ||
bench_matmul_kernel["block_tiled"]() | ||
bench_matmul_kernel["block_tiled_vectorized"]() | ||
_ = gpu_ctx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these needed at the end? the origin was recently added to the layout tensor and not sure about the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, I think they're still needed. We're only in the middle of transitioning these types, and when I tried removing the manual lifetime extensions on the latest nightly this still crashes on GPU execution with CUDA illegal access errors.
As discussed with the other custom ops recipes, I'll merge to place an initial state of these in the repository to make it easier for others to enhance them with improved wording and code. |
Adds the next recipe in the custom ops series: a progressive optimization of matrix multiplications in MAX. The example here is once again drawn from the MAX repo's custom ops examples.