-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fixing max_vect_factor initialization #4076
base: main
Are you sure you want to change the base?
Conversation
!test --pybench |
Review updated until commit 3896fc7 Description
Changes walkthrough 📝
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
constexpr int64_t kSixteen = 16; // clang tidy | ||
auto max_vect_factor = ceilDiv( | ||
// Available vectorization based on size of data type | ||
(int64_t)kSixteen / max_input_dtype_size, | ||
(int64_t)kSixteen / max_dtype_size_for_vectorization, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I should use this only for max_vect_factor?
I see max_input_dtype_size
has also be used for unroll, so I should leave that as-is and not replace it with the max_dtype_size_for_vectorization.
cc'ing @liqiangxl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at some parts of the code, such as getEmpiricalUnrollFactor
. It seems it shouldn't use the max size but it should use the size of each individual vectorized input.
!test |
!test |
@@ -295,10 +295,17 @@ std::unique_ptr<PointwiseParams> getPointwiseHeuristics( | |||
largest_out, true, true)); | |||
}); | |||
|
|||
int64_t max_dtype_size_for_vectorization = 1; | |||
for (auto inp : vectorizable_inputs_outputs_entry.get()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an obvious mistake. It seems the transpose scheduler is fine, but the reduction scheduler is not. Can you also fix it as well?
This fixes the performance issue observed in embedding fwd, where the index input with
int64_t
that didn't participate in vectorization ended up capping our max_vect_factor to 2.This restores the performance mentioned in #4048
Embedding forward nvfuser design doc.