Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

Open
AphidGit opened this issue Feb 17, 2024 · 3 comments
Open

[Issue]: Stable diffusion, Pytorch conv2d breaks in rocm 6.0 #3418

AphidGit opened this issue Feb 17, 2024 · 3 comments

Comments

@AphidGit
Copy link

Problem Description

Running stable diffusion webui, worked when rocm was version 5.7. Version 6.0, updated feb 15, breaks this. While I had the occasional hiccup, lockup or reboot before with v5.7, it was fairly stable and could produce images. Version 6.0 will crash upon trying to load any non-trivial data into the gpu consistently.

It reports the following stack traces to me. Somewhere in between, I can see a, probably from a different thread, runtimeError. When loading multiple models (such as when using Low-Rank adaptations), I get a RuntimeError for each one.

Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 147, in load_model
    shared.sd_model  # noqa: B018
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/shared_items.py", line 128, in sd_model
    return modules.sd_models.model_data.get_sd_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 531, in get_sd_model
    load_model()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 658, in load_model
    load_model_weights(sd_model, checkpoint_info, state_dict, timer)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_models.py", line 375, in load_model_weights
    model.load_state_dict(state_dict, strict=False)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 223, in <lambda>
    module_load_state_dict = self.replace(torch.nn.Module, 'load_state_dict', lambda *args, **kwargs: load_state_dict(module_load_state_dict, *args, **kwargs))
                                                                                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 221, in load_state_dict
    original(module, state_dict, strict=strict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2139, in load_state_dict
    load(self, state_dict)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2127, in load
    load(child, child_state_dict, child_prefix)
  [Previous line repeated 1 more time]
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2121, in load
    module._load_from_state_dict(
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 225, in <lambda>
    linear_load_from_state_dict = self.replace(torch.nn.Linear, '_load_from_state_dict', lambda *args, **kwargs: load_from_state_dict(linear_load_from_state_dict, *args, **kwargs))
                                                                                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/sd_disable_initialization.py", line 191, in load_from_state_dict
    module._parameters[name] = torch.nn.parameter.Parameter(torch.zeros_like(param, device=device, dtype=dtype), requires_grad=param.requires_grad)
                                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/_meta_registrations.py", line 4815, in zeros_like
    res.fill_(0)
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.

Exception in thread Thread-2 (load_model):
Traceback (most recent call last):
  File "/usr/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/initialize.py", line 153, in load_model
    devices.first_time_calculation()
  File "/hdd/AI/sd-venv/stable-diffusion-webui/modules/devices.py", line 166, in first_time_calculation
    conv2d(x)
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/hdd/AI/sd-venv/stable-diffusion-webui/extensions-builtin/Lora/networks.py", line 501, in network_Conv2d_forward
    return originals.Conv2d_forward(self, input)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 462, in forward
    return self._conv_forward(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.11/site-packages/torch/nn/modules/conv.py", line 458, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: shared object initialization failed
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Digging further, I found that using the environment variable AMD_LOG_LEVEL and setting it higher (anything higher than zero was enough, so try env AMD_LOG_LEVEL=1 ) gave me another clue;

:1:hip_code_object.cpp      :616 : 66280489141 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 
:1:hip_module.cpp           :83  : 66280489163 us: [pid:379524 tid:0x769a212006c0] Cannot find the function: Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_GLS0_ISA000_IU1_K1_KLS_LBSPP0_LPA0_LPB0_LDL1_LRVW1_LWPMn1_LDW0_FMA_MIAV0_MDA2_MO40_MMFSC_NTA0_NTB0_NTC0_NTD0_NEPBS0_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR0_PLR1_SIA1_SS0_SU32_SUM0_SUS256_SCIUI1_SPO0_SRVW0_SSO0_SVW4_SNLL0_TSGRA0_TSGRB0_TT2_8_TLDS0_UMLDSA0_UMLDSB0_USFGRO0_VAW2_VS1_VW1_WSGRA0_WSGRB0_WS64_WG8_8_1_WGM8 for module: 0x1c4f5db0

I edited the code of webui to put a little 'press any key' prompt in, and attached gdb, then made it break at that line. Here's a full backtrace. Involved are the following things:

  • rocm
  • blas
  • hip
  • pytorch
  • torchvision
  • stable-diffusion-webui
gdb) bt
#0  hip::DynCO::getDynFunc (func_name=..., hfunc=<optimized out>, this=0x76732cb4b7d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_code_object.cpp:616
#1  PlatformState::getDynFunc (
    func_name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"..., hmod=0x76732c0b6ed0, hfunc=<optimized out>, this=0x6550370672d0) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_platform.cpp:747
#2  hipModuleGetFunction (hfunc=<optimized out>, hmod=<optimized out>, 
    name=0x76732c011850 "Cijk_Ailk_Bljk_HHS_BH_MT16x64x16_SN_1LDSB0_APM1_ABV0_ACED0_AF0EM1_AF1EM1_AMAS0_ASE_ASGT_ASLT_ASAE01_ASCE01_ASEM1_AAC0_BL0_BS1_DTLA0_DTLB0_DTVA0_DTVB0_DVO0_ETSP_EPS0_FSSC10_FL0_GRPM1_GRVW1_GSU1_GSUASB_"...) at /usr/src/debug/hip-runtime-amd/clr-rocm-6.0.0/hipamd/src/hip_module.cpp:82
#3  0x00007675c994165c in ?? () from /opt/rocm/lib/librocblas.so.4
#4  0x00007675c994250c in ?? () from /opt/rocm/lib/librocblas.so.4
#5  0x00007675c994280c in ?? () from /opt/rocm/lib/librocblas.so.4
#6  0x00007675c8f88a6f in ?? () from /opt/rocm/lib/librocblas.so.4
#7  0x00007675c9061ec9 in ?? () from /opt/rocm/lib/librocblas.so.4
#8  0x00007675c905fa3c in ?? () from /opt/rocm/lib/librocblas.so.4
#9  0x00007675c905ac1e in ?? () from /opt/rocm/lib/librocblas.so.4
#10 0x00007675c90586d9 in rocblas_gemm_ex () from /opt/rocm/lib/librocblas.so.4
#11 0x0000767646fbda9a in ?? () from /usr/lib/libtorch_hip.so
#12 0x0000767646fdb254 in ?? () from /usr/lib/libtorch_hip.so
#13 0x0000767647122601 in ?? () from /usr/lib/libtorch_hip.so
#14 0x00007676471226a4 in ?? () from /usr/lib/libtorch_hip.so
#15 0x00007676a8c791cc in at::_ops::addmm::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#16 0x00007676aad66f13 in ?? () from /usr/lib/libtorch_cpu.so
#17 0x00007676aad67e46 in ?? () from /usr/lib/libtorch_cpu.so
#18 0x00007676a8cee25b in at::_ops::addmm::call(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::Scalar const&, c10::Scalar const&) () from /usr/lib/libtorch_cpu.so
#19 0x00007676a851bd80 in at::native::linear(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#20 0x00007676a975fb5b in ?? () from /usr/lib/libtorch_cpu.so
#21 0x00007676a8cd56c7 in at::_ops::linear::call(at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&) () from /usr/lib/libtorch_cpu.so
#22 0x00007676b3222be8 in ?? () from /usr/lib/python3.11/site-packages/torch/lib/libtorch_python.so
#23 0x00007676be1fdd41 in cfunction_call (func=0x7674af8af470, args=<optimized out>, kwargs=<optimized out>) at Objects/methodobject.c:542
#24 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x7674af8af470, args=<optimized out>, nargs=3, keywords=0x0) at Objects/call.c:214
#25 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#26 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9428, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#27 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff1b0, locals=0x0, func=0x76733198c5e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#28 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff1b0, func=0x76733198c5e0) at Objects/call.c:393
#29 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff1b0, callable=0x76733198c5e0, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#30 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#31 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x7673304003c0, callargs=0x7673301e9840, func=0x76734ac3d600, tstate=<optimized out>) at Python/ceval.c:7352
#32 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#33 0x00007676be22de9f in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9308, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#34 _PyEval_Vector (kwnames=0x0, argcount=<optimized out>, args=0x7673311ff3e0, locals=0x0, func=0x7674ae114680, tstate=0x65504125ade0) at Python/ceval.c:6434
#35 _PyFunction_Vectorcall (kwnames=0x0, nargsf=<optimized out>, stack=0x7673311ff3e0, func=0x7674ae114680) at Objects/call.c:393
#36 _PyObject_VectorcallTstate (kwnames=0x0, nargsf=<optimized out>, args=0x7673311ff3e0, callable=0x7674ae114680, tstate=0x65504125ade0) at ./Include/internal/pycore_call.h:92
#37 method_vectorcall (method=<optimized out>, args=<optimized out>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:89
#38 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x76732b538e00, callargs=0x76734ac34d90, func=0x76734ac0a080, tstate=<optimized out>) at Python/ceval.c:7352
#39 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#40 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9280, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#41 _PyEval_Vector (kwnames=<optimized out>, argcount=2, args=0x7673311ff6a0, locals=0x0, func=0x7674ae1145e0, tstate=0x65504125ade0) at Python/ceval.c:6434
#42 _PyFunction_Vectorcall (func=0x7674ae1145e0, stack=0x7673311ff6a0, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#43 0x00007676be1e0d97 in _PyObject_FastCallDictTstate (tstate=0x65504125ade0, callable=0x7674ae1145e0, args=<optimized out>, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:141
#44 0x00007676be216b3d in _PyObject_Call_Prepend (tstate=0x65504125ade0, callable=0x7674ae1145e0, obj=0x76734ac3d6d0, args=<optimized out>, kwargs=0x0) at Objects/call.c:482
#45 0x00007676be2dba82 in slot_tp_call (self=0x76734ac3d6d0, args=0x76734ac34d30, kwds=0x0) at Objects/typeobject.c:7623
#46 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x76734ac3d6d0, args=<optimized out>, nargs=1, keywords=0x0) at Objects/call.c:214
#47 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#48 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e91f8, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#49 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x767349893880, tstate=0x65504125ade0) at Python/ceval.c:6434
#50 _PyFunction_Vectorcall (func=0x767349893880, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#51 0x00007676be2e2297 in bounded_lru_cache_wrapper (self=0x767349937ed0, args=0x7676be573ff8 <_PyRuntime+58904>, kwds=0x0) at ./Modules/_functoolsmodule.c:1021
#52 0x00007676be1dc054 in _PyObject_MakeTpCall (tstate=0x65504125ade0, callable=0x767349937ed0, args=<optimized out>, nargs=0, keywords=0x0) at Objects/call.c:214
#53 0x00007676be1e76e1 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:4769
#54 0x00007676be20e8c0 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9188, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#55 _PyEval_Vector (kwnames=<optimized out>, argcount=0, args=0x7676be574010 <_PyRuntime+58928>, locals=0x0, func=0x76733132b9c0, tstate=0x65504125ade0) at Python/ceval.c:6434
#56 _PyFunction_Vectorcall (func=0x76733132b9c0, stack=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=<optimized out>) at Objects/call.c:393
#57 0x00007676be1eb6a3 in do_call_core (use_tracing=<optimized out>, kwdict=0x767331384e40, callargs=0x7676be573ff8 <_PyRuntime+58904>, func=0x76733132b9c0, tstate=<optimized out>) at Python/ceval.c:7352
#58 _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/ceval.c:5376
#59 0x00007676be22e583 in _PyEval_EvalFrame (throwflag=0, frame=0x7676053e9020, tstate=0x65504125ade0) at ./Include/internal/pycore_ceval.h:73
#60 _PyEval_Vector (kwnames=<optimized out>, argcount=<optimized out>, args=0x7673311ffe28, locals=0x0, func=0x7676bd476ca0, tstate=0x65504125ade0) at Python/ceval.c:6434
#61 _PyFunction_Vectorcall (kwnames=<optimized out>, nargsf=<optimized out>, stack=0x7673311ffe28, func=0x7676bd476ca0) at Objects/call.c:393
#62 _PyObject_VectorcallTstate (tstate=0x65504125ade0, callable=0x7676bd476ca0, args=0x7673311ffe28, nargsf=<optimized out>, kwnames=<optimized out>) at ./Include/internal/pycore_call.h:92
#63 0x00007676be22e070 in method_vectorcall (method=<optimized out>, args=0x7676be574010 <_PyRuntime+58928>, nargsf=<optimized out>, kwnames=0x0) at Objects/classobject.c:67
#64 0x00007676be2f4df8 in thread_run (boot_raw=0x655044198250) at ./Modules/_threadmodule.c:1124
#65 0x00007676be2cc538 in pythread_wrapper (arg=<optimized out>) at Python/thread_pthread.h:241
#66 0x00007676bdea955a in start_thread (arg=<optimized out>) at pthread_create.c:447
#67 0x00007676bdf26a3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

Operating System

Arch linux, kernel 6.7.4-arch1-1

CPU

AMD Threadripper 1950X

GPU

AMD Radeon RX 7900 XTX

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

To reproduce;

1* Create a venv. enter it.
2* Install stable diffusion webui, following https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux
3* Download any sd model and place in models folder.
4* either ./webui.sh or python launch.py

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen Threadripper 1950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen Threadripper 1950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            32                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: FINE GRAINED        
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 3                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131743808(0x7da4040) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
*******                  
Agent 2                  
*******                  
  Name:                    gfx1100                            
  Uuid:                    GPU-<redacted>               
  Marketing Name:          AMD Radeon RX 7900 XTX             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          64(0x40)                           
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      32(0x20) KB                        
    L2:                      6144(0x1800) KB                    
    L3:                      98304(0x18000) KB                  
  Chip ID:                 29772(0x744c)                      
  ASIC Revision:           0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2482                               
  BDFID:                   17152                              
  Internal Node ID:        1                                  
  Compute Unit:            96                                 
  SIMDs per CU:            2                                  
  Shader Engines:          6                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Coherent Host Access:    FALSE                              
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      TRUE                               
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        32(0x20)                           
  Max Work-item Per CU:    1024(0x400)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Packet Processor uCode:: 528                                
  SDMA engine uCode::      19                                 
  IOMMU Support::          None                               
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    25149440(0x17fc000) KB             
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 3                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1100         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

Additional Information

No response

@Kamishirasawa-keine
Copy link

Same issue.

@alexxu-amd
Copy link

Yea, I am able to reproduce this error using the installation step from https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs#user-content-install-on-amd-and-arch-linux.

Can you guys try reinstalling torch and torchvision using
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0?

@ppanchad-amd
Copy link

@AphidGit @Kamishirasawa-keine Please try @alexxu-amd suggestion above. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants