[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit · 2024-02-17T18:32:39Z

Problem Description

Running stable diffusion webui, worked when rocm was version 5.7. Version 6.0, updated feb 15, breaks this. While I had the occasional hiccup, lockup or reboot before with v5.7, it was fairly stable and could produce images. Version 6.0 will crash upon trying to load any non-trivial data into the gpu consistently.

It reports the following stack traces to me. Somewhere in between, I can see a, probably from a different thread, runtimeError. When loading multiple models (such as when using Low-Rank adaptations), I get a RuntimeError for each one.

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 147, in load_model
    shared.sd_model  # noqa: B018
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/shared_items.py", line 128, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 531, in get_sd_model
    load_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
                                                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
    original(module, state_dict, strict=strict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
    linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
                                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
    module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/_meta_registrations.py", line 4815, in zeros_like
    res.fill_(0)
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

Exception in thread Thread-2 (load_model):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 153, in load_model
    devices.first_time_calculation()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/devices.py", line 166, in first_time_calculation
    conv2d(x)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 501, in network_Conv2d_forward
    return originals.Conv2d_forward(self, input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 462, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 458, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Digging further, I found that using the environment variable AMD_LOG_LEVEL and setting it higher (anything higher than zero was enough, so try env AMD_LOG_LEVEL=1 ) gave me another clue;

:1:hip_code_object.cpp      :616 : 66280489141 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 
:1:hip_module.cpp           :83  : 66280489163 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 for module: 0x1c4f5db0

I edited the code of webui to put a little 'press any key' prompt in, and attached gdb, then made it break at that line. Here's a full backtrace. Involved are the following things:

rocm
blas
hip
pytorch
torchvision
stable-diffusion-webui

gdb) bt
#0  hip::DynCO::getDynFunc (func_name=..., hfunc=<optimized out>, this=0x76732cb4b7d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_code_object.cpp:616
#1  PlatformState::getDynFunc (
    func_name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"..., hmod=0x76732c0b6ed0, hfunc=<optimized out>, this=0x6550370672d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:747
#2  hipModuleGetFunction (hfunc=<optimized out>, hmod=<optimized out>, 
    name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"...) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:82
#3  0x00007675c994165c in ?? () from /opt/rocm/lib/librocblas.so.4
#4  0x00007675c994250c in ?? () from /opt/rocm/lib/librocblas.so.4
#5  0x00007675c994280c in ?? () from /opt/rocm/lib/librocblas.so.4
#6  0x00007675c8f88a6f in ?? () from /opt/rocm/lib/librocblas.so.4
#7  0x00007675c9061ec9 in ?? () from /opt/rocm/lib/librocblas.so.4
#8  0x00007675c905fa3c in ?? () from /opt/rocm/lib/librocblas.so.4
#9  0x00007675c905ac1e in ?? () from /opt/rocm/lib/librocblas.so.4
#10 0x00007675c90586d9 in rocblas_gemm_ex () from /opt/rocm/lib/librocblas.so.4
#11 0x0000767646fbda9a in ?? () from /usr/lib/libtorch_hip.so
#12 0x0000767646fdb254 in ?? () from /usr/lib/libtorch_hip.so
#13 0x0000767647122601 in ?? () from /usr/lib/libtorch_hip.so
#14 0x00007676471226a4 in ?? () from /usr/lib/libtorch_hip.so
#15 0x00007676a8c791cc in at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#16 0x00007676aad66f13 in ?? () from /usr/lib/libtorch_cpu.so
#17 0x00007676aad67e46 in ?? () from /usr/lib/libtorch_cpu.so
#18 0x00007676a8cee25b in at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#19 0x00007676a851bd80 in at::native::linear(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#20 0x00007676a975fb5b in ?? () from /usr/lib/libtorch_cpu.so
#21 0x00007676a8cd56c7 in at::_ops::linear::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#22 0x00007676b3222be8 in ?? () from /usr/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#23 0x00007676be1fdd41 in cfunction_call (func=0x7674af8af470, args=<optimized out>, kwargs=<optimized out>) at Objects/methodobject.c:542
#24 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x7674af8af470, args=<optimized out>, nargs=3, keywords=0x0) at Objects/call.c:214
#25 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#26 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9428, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#27 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff1b0, locals=0x0, func=0x76733198c5e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#28 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff1b0, func=0x76733198c5e0) at Objects/call.c:393
#29 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff1b0, callable=0x76733198c5e0, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#30 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#31 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x7673304003c0, callargs=0x7673301e9840, func=0x76734ac3d600, tstate=<optimized out>) at Python/ceval.c:7352
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#33 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9308, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#34 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff3e0, locals=0x0, func=0x7674ae114680, tstate=0x65504125ade0) at Python/ceval.c:6434
#35 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff3e0, func=0x7674ae114680) at Objects/call.c:393
#36 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff3e0, callable=0x7674ae114680, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#37 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#38 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x76732b538e00, callargs=0x76734ac34d90, func=0x76734ac0a080, tstate=<optimized out>) at Python/ceval.c:7352
#39 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#40 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9280, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#41 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7673311ff6a0, locals=0x0, func=0x7674ae1145e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#42 _PyFunction_Vectorcall (func=0x7674ae1145e0, stack=0x7673311ff6a0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#43 0x00007676be1e0d97 in _PyObject_FastCallDictTstate (tstate=0x65504125ade0, callable=0x7674ae1145e0, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:141
#44 0x00007676be216b3d in _PyObject_Call_Prepend (tstate=0x65504125ade0, callable=0x7674ae1145e0, obj=0x76734ac3d6d0, args=<optimized out>, kwargs=0x0) at Objects/call.c:482
#45 0x00007676be2dba82 in slot_tp_call (self=0x76734ac3d6d0, args=0x76734ac34d30, kwds=0x0) at Objects/typeobject.c:7623
#46 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x76734ac3d6d0, args=<optimized out>, nargs=1, keywords=0x0) at Objects/call.c:214
#47 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#48 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e91f8, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#49 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x767349893880, tstate=0x65504125ade0) at Python/ceval.c:6434
#50 _PyFunction_Vectorcall (func=0x767349893880, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#51 0x00007676be2e2297 in bounded_lru_cache_wrapper (self=0x767349937ed0, args=0x7676be573ff8 <_PyRuntime+58904>, kwds=0x0) at ./Modules/_functoolsmodule.c:1021
#52 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x767349937ed0, args=<optimized out>, nargs=0, keywords=0x0) at Objects/call.c:214
#53 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#54 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9188, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#55 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x76733132b9c0, tstate=0x65504125ade0) at Python/ceval.c:6434
#56 _PyFunction_Vectorcall (func=0x76733132b9c0, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#57 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x767331384e40, callargs=0x7676be573ff8 <_PyRuntime+58904>, func=0x76733132b9c0, tstate=<optimized out>) at Python/ceval.c:7352
#58 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#59 0x00007676be22e583 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9020, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#60 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7673311ffe28, locals=0x0, func=0x7676bd476ca0, tstate=0x65504125ade0) at Python/ceval.c:6434
#61 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7673311ffe28, func=0x7676bd476ca0) at Objects/call.c:393
#62 _PyObject_VectorcallTstate (tstate=0x65504125ade0, callable=0x7676bd476ca0, args=0x7673311ffe28, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#63 0x00007676be22e070 in method_vectorcall (method=<optimized out>, args=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#64 0x00007676be2f4df8 in thread_run (boot_raw=0x655044198250) at ./Modules/_threadmodule.c:1124
#65 0x00007676be2cc538 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:241
#66 0x00007676bdea955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#67 0x00007676bdf26a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Operating System

Arch linux, kernel 6.7.4-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

To reproduce;

1* Create a venv. enter it.
2* Install stable diffusion webui, following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux
3* Download any sd model and place in models folder.
4* either ./webui.sh or python launch.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-<redacted>               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   17152                              
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 528                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***

Additional Information

No response

The text was updated successfully, but these errors were encountered:

Kamishirasawa-keine · 2024-03-28T16:55:08Z

Same issue.

alexxu-amd · 2024-04-03T15:49:37Z

Yea, I am able to reproduce this error using the installation step from https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux.

Can you guys try reinstalling torch and torchvision using
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0?

ppanchad-amd · 2024-05-31T19:38:15Z

@AphidGit @Kamishirasawa-keine Please try @alexxu-amd suggestion above. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit commented Feb 17, 2024

Kamishirasawa-keine commented Mar 28, 2024

alexxu-amd commented Apr 3, 2024

ppanchad-amd commented May 31, 2024

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

Comments

AphidGit commented Feb 17, 2024

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

Additional Information

Kamishirasawa-keine commented Mar 28, 2024

alexxu-amd commented Apr 3, 2024

ppanchad-amd commented May 31, 2024