New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

x64: matmul: amx blocking heuristics #2855

Open

yair-obodovsky wants to merge 7 commits into main from yobodovs/amx_blocking_heuristics

Contributor

yair-obodovsky commented Mar 11, 2025

New blocking heuristics for matrix multiplication using AMX include a new k-loop, utilize blocked C, and employ larger N dimensions in BRGEM with LDB2


          cpu: x64: brgemm: extend LDB2,LDC2 support

bc2fccf

yair-obodovsky requested a review from a team as a code owner

March 11, 2025 14:14

github-actions bot added the platform:cpu-x64 label

Contributor Author

yair-obodovsky commented Mar 11, 2025

make test

yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch 3 times, most recently from af88559 to 0bf3e0d Compare

March 11, 2025 14:27

Contributor Author

yair-obodovsky commented Mar 11, 2025

make test

yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch 2 times, most recently from 241add3 to 33fd053 Compare

March 12, 2025 13:43

yair-obodovsky changed the title ~~x64: matmul: amx blocking heuristics [WIP]~~ x64: matmul: amx blocking heuristics

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.hpp


		float get_bw(int x) { return linear_interpolation(multicore_bw, x); }

		float linear_interpolation(const std::map<int, float> &points, float x) {

Contributor

tczeszun Mar 12, 2025

make it private?

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/brgemm_matmul.cpp Outdated

-                      const int m_blk_local = m_blk_idx % get_M_chunk_size();
+                      int n_blk_local;
+                      int m_blk_local;

Contributor

tczeszun Mar 12, 2025

assign these to 0 and change below if-else to enter if only else condition is met

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/brgemm_matmul.cpp Outdated

+                          return status::unimplemented;
+                      }
+                  }
+                  return status::success;

Contributor

tczeszun Mar 12, 2025

it's redundant I guess

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.cpp Outdated

+                      }
+                  }
+                  //    printf("winner: m_div=%d, k_div=%d, n_div=%d, b_div=%d,  score=%f\n", best_blocking.nthr_m_, best_blocking.nthr_k_, best_blocking.nthr_n_,best_blocking.nthr_b_, best_blocking.efficiency_score_);

Contributor

tczeszun Mar 12, 2025

not needed?

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.cpp Outdated

+                          //don't do reduction if c tmp doesn't fit
+                          //also parallel reduction is not supported for large batch. This conforms with the assert in brgemm_matmul.cpp:
+                          //assert(IMPLICATION(parallel_reduction_is_used(),
+                          //  bgmmc.batch == 1 && !calculate_compensations_in_copy_routines));

Contributor

tczeszun Mar 12, 2025

is it needed?

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.cpp Outdated

+                  size_t B_chunk_sz = b_dt_sz * k_chunk_elems_ * n_chunk_elems_;
+                  size_t B_buf_sz = use_buffer_b ? tr_b_dt_sz * n_blk_ * k_chunk_elems_ : 0;
+                  size_t C_chunk_sz = c_dt_sz * m_chunk_elems_ * n_chunk_elems_;
+                  size_t C_buf_sz

Contributor

tczeszun Mar 12, 2025

all these variables can be const

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.cpp Outdated

+                          * nthr_mnb_;
+                  float k_parallel_score = 1.0f;
+                  if (nthr_k_ > 1) {
+                      dim_t num_K_chunks = div_up(K, k_chunk_elems_);

Contributor

tczeszun Mar 12, 2025

can be const, same for M, N

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

tczeszun reviewed

View reviewed changes

src/cpu/x64/matmul/amx_blocking_heuristics.cpp Outdated

+                  dim_t largest_k_tiles = largest_k / this->k_tmul;
+                  dim_t k_tiles = div_up(K, this->k_tmul);
+                  dim_t k_per_thread_tiles = div_up(k_tiles, nthr_k_);
+                  dim_t num_K_blocks = div_up(k_per_thread_tiles, largest_k_tiles);

Contributor

tczeszun Mar 12, 2025

all variables can be const

Contributor Author

yair-obodovsky Mar 13, 2025

fixed

yair-obodovsky added 6 commits

March 13, 2025 11:58


          cpu: x64: brgemm_matmul: Added additional K loop wrapping BRGEMM

39e3326


          cpu: x64: brgemm_matmul_copy_utils: Added LDB2 support for B relayout

9ca5cc4


          cpu: x64: brgemm_matmul: c buffer layout optimization (blocked)

faf2150


          cpu: x64: brgemm_matmul: Added nthr_m and nthr_n and nthr_b

df79dcd


          cpu: x64: brgemm_matmul: new AMX blocking heuristics

52ae2f1


          cpu: x64: brgemm: prefetchw removed from c buffer

527e4fc

yair-obodovsky force-pushed the yobodovs/amx_blocking_heuristics branch from 33fd053 to 527e4fc Compare

March 13, 2025 10:16

yair-obodovsky requested a review from a team as a code owner

March 13, 2025 10:16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

platform:cpu-x64