-
Notifications
You must be signed in to change notification settings - Fork 75
Description
Context: cudaErrorMisalignedAddress occurred while running nvfuser matmul benchmarks on problem sizes from matmul_problems.csv using parameters taken from the NvJet cuda kernel name.
Branch: https://github.com/NVIDIA/Fuser/tree/hopper_matmul_heuristics
Error message:
CUDA error: cudaErrorMisalignedAddress failed with error misaligned address
Exception raised from time at /opt/pytorch/nvfuser/csrc/fusion_profiler.cpp:228
Layout NN -- (m = 5320, n = 33928, k = 3464, layout = 'NN')
To Reproduce:
Update doc/dev/python_scheduling/profile_matmul with the following matmul parameters:
# These are the parameters we'll optimize
parameter_configurations = {
"tile_sizes": [
MatMulTileOptions(GemmTile(192, 208, 64), GemmTile(192, 104, 64))
],
"mma_macro": [MmaMacroEncode(MmaMacroArch.hopper, 64, 104, 16)],
"tile_order": [MatmulTileRasterizationOrder.column_major],
"cluster_dims": [ClusterDims(1, 1, 1)],
"circular_buffer_stages": [4],
}Then, run the following command:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 5320 33928 3464 NN --verbose
Details
===== Matmul Parameters ========
MMA macro: Hopper_64_104_16
CircularBufferOptions:
circular_buffer_smem_write: true
circular_buffer_smem_read: false
smem_circular_buffer_stage: 4
smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
a: 8
b: 8
epilogue: 8
MatMulTileOptions: warp tile [192, 104, 64], CTA tile [192, 208, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================
Layout TN -- (m = 1304, n = 4936, k = 688, layout = 'TN')
Layout TT -- (m = 272, n = 8952, k = 360, layout = 'TT')
(1304, 4936, 688, 'TN') and (272, 8952, 360, 'TT') share the same matmul parameter configuration
To Reproduce:
Update doc/dev/python_scheduling/profile_matmul with the following matmul parameters:
# These are the parameters we'll optimize
parameter_configurations = {
"tile_sizes": [
MatMulTileOptions(GemmTile(192, 144, 64), GemmTile(192, 72, 64))
],
"mma_macro": [MmaMacroEncode(MmaMacroArch.hopper, 64, 72, 16)],
"tile_order": [MatmulTileRasterizationOrder.column_major],
"cluster_dims": [ClusterDims(1, 1, 1)],
"circular_buffer_stages": [5],
}Then, run the following command:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 1304 4936 688 TN --verbose
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 272 8952 360 TT --verbose
Details
===== Matmul Parameters ========
MMA macro: Hopper_64_72_16
CircularBufferOptions:
circular_buffer_smem_write: true
circular_buffer_smem_read: false
smem_circular_buffer_stage: 5
smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
a: 8
b: 8
epilogue: 8
MatMulTileOptions: warp tile [192, 72, 64], CTA tile [192, 144, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================