Build failure: resulting binary over 2GB #39

prusnak · 2025-02-11T21:31:12Z

This is related to pytorch/pytorch#39968 which got resolved in pytorch/pytorch#49050 by splitting the library into smaller ones

I am also including the Nixpkgs issue for more details: NixOS/nixpkgs#239237

In Nixpkgs we used -Xfatbin=-compress-all first (same as pytorch), but we are again hitting the 2G limit. Using -mcmodel=large does not seem to help, so I guess the only way is to split the magma library into smaller ones too (same approach as pytorch)

On x86_64-linux:

/nix/store/81mi7m3k3wsiz9rrrg636sx21psj20hc-glibc-2.40-66/lib/crti.o: in function `_init':
(.init+0xb): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against undefined symbol `__gmon_start__'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_zgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x35a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `zgeqrf_panel_decision_a100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_cgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x45a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `cgeqrf_panel_decision_a100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_dgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x55a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_a100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `magma_use_sgeqrf_batched_fused_update':
get_batched_crossover.cpp:(.text+0x65a): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_a100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o: in function `__static_initialization_and_destruction_0()':
get_batched_crossover.cpp:(.text.startup+0x4f8): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `sgeqrf_panel_decision_mi100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
get_batched_crossover.cpp:(.text.startup+0x554): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::~vector()' defined in .text._ZNSt6vectorIS_IiSaIiEESaIS1_EED2Ev[_ZNSt6vectorIS_IiSaIiEESaIS1_EED5Ev] section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
get_batched_crossover.cpp:(.text.startup+0x55b): relocation truncated to fit: R_X86_64_PC32 against symbol `__dso_handle' defined in .data.rel.local section in /nix/store/dmpy95dhb2m473l4nyxgmcp4cnnial8w-gfortran-14-20241116/lib/gcc/x86_64-unknown-linux-gnu/14.2.1/crtbeginS.o
get_batched_crossover.cpp:(.text.startup+0x854): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_mi100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
get_batched_crossover.cpp:(.text.startup+0x8a0): relocation truncated to fit: R_X86_64_REX_GOTPCRELX against symbol `dgeqrf_panel_decision_mi100' defined in .bss section in CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o
get_batched_crossover.cpp:(.text.startup+0x8a7): additional relocation overflows omitted from the output
lib/libmagma.so: PC-relative offset overflow in GOT PLT entry for `_Z38magmablas_zapply_vector_kernel_batchediiP7double2iPS0_ii'
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

On aarch64-linux:

CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x1c): relocation truncated to fit: R_AARCH64_PREL32 against `.text._ZNSt6vectorIS_IiSaIiEESaIS1_EED2Ev'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x6c): relocation truncated to fit: R_AARCH64_PREL32 against `.text.startup'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x9c): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0xb0): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0xc4): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0xd8): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0xec): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x100): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x114): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x128): relocation truncated to fit: R_AARCH64_PREL32 against `.text'
CMakeFiles/magma.dir/control/get_batched_crossover.cpp.o:(.eh_frame+0x13c): additional relocation overflows omitted from the output
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.

The text was updated successfully, but these errors were encountered:

mgates3 · 2025-02-21T14:43:12Z

Could you give some context here to reproduce the issue? The best would be a MAGMA Makefile (make.inc) or CMake configuration that reproduces it. I saw the Nixpkgs issue, but it isn't clear there how you are building MAGMA. It appears that you are compiling for all these CUDA capabilities:
["6.0", "6.1", "6.2", "7.0", "7.2", "7.5", "8.0", "8.6", "8.9", "9.0"]
I don't know that there's significant benefit from compiling for, say, 6.1 or 6.2 if 6.0 is available.

prusnak · 2025-02-21T15:05:56Z

It appears that you are compiling for all these CUDA capabilities:
["6.0", "6.1", "6.2", "7.0", "7.2", "7.5", "8.0", "8.6", "8.9", "9.0"]

Right

I don't know that there's significant benefit from compiling for, say, 6.1 or 6.2 if 6.0 is available.

Maybe @ConnorBaker can explain why we do this?

ConnorBaker · 2025-02-21T15:39:41Z

The GPU selection done in Nixpkgs is fairly naive and driven by https://github.com/NixOS/nixpkgs/blob/daa2a442b9c82a265afaedbdf5589adadc01095c/pkgs/development/cuda-modules/gpus.nix

Essentially, every capability supported by a CUDA version is added by default to maximize compatibility and performance. (This is also done in part from reduce load on CI, as specifying a different list of capabilities involves rebuilding all CUDA-enabled packages.)

I recommended that users configure Nixpkgs to target their specific capability. Unfortunately, that generally requires they rebuild everything locally as CI only caches packages built with the default configuration.

I’ve not done any research into the various capabilities to find out what has changed between them (excluding the floating point operation speed up between 8.0 and 8.6), so that would be a good place to start if we wanted to cull additional capabilities from the default set.

prusnak mentioned this issue Feb 11, 2025

Build failure: dynamic Magma with cudaSupport NixOS/nixpkgs#239237

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build failure: resulting binary over 2GB #39

Build failure: resulting binary over 2GB #39

prusnak commented Feb 11, 2025

mgates3 commented Feb 21, 2025

prusnak commented Feb 21, 2025

ConnorBaker commented Feb 21, 2025

Build failure: resulting binary over 2GB #39

Build failure: resulting binary over 2GB #39

Comments

prusnak commented Feb 11, 2025

mgates3 commented Feb 21, 2025

prusnak commented Feb 21, 2025

ConnorBaker commented Feb 21, 2025