Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64: matmul: amx blocking heuristics #2855

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

yair-obodovsky
Copy link
Contributor

New blocking heuristics for matrix multiplication using AMX include a new k-loop, utilize blocked C, and employ larger N dimensions in BRGEM with LDB2

@yair-obodovsky yair-obodovsky requested a review from a team as a code owner March 11, 2025 14:14
@github-actions github-actions bot added the platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 label Mar 11, 2025
@yair-obodovsky
Copy link
Contributor Author

make test

@yair-obodovsky yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch 3 times, most recently from af88559 to 0bf3e0d Compare March 11, 2025 14:27
@yair-obodovsky
Copy link
Contributor Author

make test

@yair-obodovsky yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch 2 times, most recently from 241add3 to 33fd053 Compare March 12, 2025 13:43
@yair-obodovsky yair-obodovsky changed the title x64: matmul: amx blocking heuristics [WIP] x64: matmul: amx blocking heuristics Mar 12, 2025

float get_bw(int x) { return linear_interpolation(multicore_bw, x); }

float linear_interpolation(const std::map<int, float> &points, float x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make it private?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

const int m_blk_local = m_blk_idx % get_M_chunk_size();

int n_blk_local;
int m_blk_local;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assign these to 0 and change below if-else to enter if only else condition is met

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

return status::unimplemented;
}
}
return status::success;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's redundant I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

}
}

// printf("winner: m_div=%d, k_div=%d, n_div=%d, b_div=%d, score=%f\n", best_blocking.nthr_m_, best_blocking.nthr_k_, best_blocking.nthr_n_,best_blocking.nthr_b_, best_blocking.efficiency_score_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

//don't do reduction if c tmp doesn't fit
//also parallel reduction is not supported for large batch. This conforms with the assert in brgemm_matmul.cpp:
//assert(IMPLICATION(parallel_reduction_is_used(),
// bgmmc.batch == 1 && !calculate_compensations_in_copy_routines));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

size_t B_chunk_sz = b_dt_sz * k_chunk_elems_ * n_chunk_elems_;
size_t B_buf_sz = use_buffer_b ? tr_b_dt_sz * n_blk_ * k_chunk_elems_ : 0;
size_t C_chunk_sz = c_dt_sz * m_chunk_elems_ * n_chunk_elems_;
size_t C_buf_sz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these variables can be const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

* nthr_mnb_;
float k_parallel_score = 1.0f;
if (nthr_k_ > 1) {
dim_t num_K_chunks = div_up(K, k_chunk_elems_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be const, same for M, N

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

dim_t largest_k_tiles = largest_k / this->k_tmul;
dim_t k_tiles = div_up(K, this->k_tmul);
dim_t k_per_thread_tiles = div_up(k_tiles, nthr_k_);
dim_t num_K_blocks = div_up(k_per_thread_tiles, largest_k_tiles);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all variables can be const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@yair-obodovsky yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch from 33fd053 to 527e4fc Compare March 13, 2025 10:16
@yair-obodovsky yair-obodovsky requested a review from a team as a code owner March 13, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants