Skip to content

cudaErrorMisalignedAddress when sweeping matmul problems with NN, TN, and TT layouts. #3966

@rdspring1

Description

@rdspring1

Context: cudaErrorMisalignedAddress occurred while running nvfuser matmul benchmarks on problem sizes from matmul_problems.csv using parameters taken from the NvJet cuda kernel name.

Branch: https://github.com/NVIDIA/Fuser/tree/hopper_matmul_heuristics

Error message:

CUDA error: cudaErrorMisalignedAddress failed with error misaligned address
Exception raised from time at /opt/pytorch/nvfuser/csrc/fusion_profiler.cpp:228

Layout NN -- (m = 5320, n = 33928, k = 3464, layout = 'NN')

To Reproduce:
Update doc/dev/python_scheduling/profile_matmul with the following matmul parameters:

    # These are the parameters we'll optimize
    parameter_configurations = {
        "tile_sizes": [
            MatMulTileOptions(GemmTile(192, 208, 64), GemmTile(192, 104, 64))
        ],
        "mma_macro": [MmaMacroEncode(MmaMacroArch.hopper, 64, 104, 16)],
        "tile_order": [MatmulTileRasterizationOrder.column_major],
        "cluster_dims": [ClusterDims(1, 1, 1)],
        "circular_buffer_stages": [4],
    }

Then, run the following command:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 5320 33928 3464 NN --verbose

Details
===== Matmul Parameters ========
MMA macro: Hopper_64_104_16
CircularBufferOptions:
  circular_buffer_smem_write: true
  circular_buffer_smem_read: false
  smem_circular_buffer_stage: 4
  smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
  a: 8
  b: 8
  epilogue: 8
MatMulTileOptions: warp tile [192, 104, 64], CTA tile [192, 208, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================

Layout TN -- (m = 1304, n = 4936, k = 688, layout = 'TN')

Layout TT -- (m = 272, n = 8952, k = 360, layout = 'TT')

(1304, 4936, 688, 'TN') and (272, 8952, 360, 'TT') share the same matmul parameter configuration

To Reproduce:
Update doc/dev/python_scheduling/profile_matmul with the following matmul parameters:

    # These are the parameters we'll optimize
    parameter_configurations = {
        "tile_sizes": [
            MatMulTileOptions(GemmTile(192, 144, 64), GemmTile(192, 72, 64))
        ],
        "mma_macro": [MmaMacroEncode(MmaMacroArch.hopper, 64, 72, 16)],
        "tile_order": [MatmulTileRasterizationOrder.column_major],
        "cluster_dims": [ClusterDims(1, 1, 1)],
        "circular_buffer_stages": [5],
    }

Then, run the following command:
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 1304 4936 688 TN --verbose
NVFUSER_ENABLE=fuse_matmul NVFUSER_DISABLE=matmul_expr_eval python profile_matmul.py 272 8952 360 TT --verbose

Details
===== Matmul Parameters ========

MMA macro: Hopper_64_72_16
CircularBufferOptions:
  circular_buffer_smem_write: true
  circular_buffer_smem_read: false
  smem_circular_buffer_stage: 5
  smem_circular_buffer_prefetch_gap: 1
SupportedVectorization:
  a: 8
  b: 8
  epilogue: 8
MatMulTileOptions: warp tile [192, 72, 64], CTA tile [192, 144, 64]
Async global mem load: true
Indexing mode: int32_t
Tile rasterization order: column-major
Grid swizzle factor: 1
Tiling strategy: OneTilePerCTA
Buffering loop level: CTATiles
Circular buffering strategy: WarpSpecialized
__cluster_dims__(1, 1, 1)
Use shared memory epilogue: 1
Promote re-use of prologue shared memory: 1
Split-K factor: 1
====================================

Metadata

Metadata

Assignees

Labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions