Skip to content

[Feature Request] Add compile-time static check for UB memory allocation size to prevent runtime OOB errors #159

@fengz72

Description

@fengz72

Description

Currently, when generating kernels for Ascend devices (e.g., using TileLang/Codegen), there is no static check for the total size of allocated Unified Buffer (UB).

If a user defines a tiling strategy that requests more UB memory than the hardware limit (e.g., exceeding 192KB/256KB depending on the chip), the compilation passes successfully. The error is only caught during runtime execution on the NPU, typically manifesting as an obscure hardware exception (e.g., error code 507015 or VEC instruction error: the ub address out of bounds).

This behavior makes debugging extremely difficult, as the runtime error stack trace does not directly point to the "oversized allocation" in the source code.

Current Behavior

  • The compiler allows allocating UB size beyond hardware physical limits.
  • The program crashes at runtime with generic errors like:
      [E1222 16:07:41.419460944 compiler_depend.ts:442] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:265 NPU function error: call aclnnCat failed, error code is 507015
    [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed
    [Error]: The aicore execution is abnormal. 
            Rectify the fault based on the error information in the ascend log.
    [PID: 297856] 2025-12-22-16:07:41.478.942 AclNN_Runtime_Error(EZ9903): rtKernelLaunchWithHandleV2 failed: 507015
            Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog.
            TraceBack (most recent call last):
            Failed to submit kernel task, retCode=0x7150026.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:1164]
            kernel launch submit failed.[FUNC:LaunchKernelWithHandle][FILE:context.cc][LINE:1428]
            rtKernelLaunchWithHandleV2 execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
            rtKernelLaunchWithHandleV2 failed: 507015
            #### KernelLaunch failed: /home/h00909914/cann/day251203/env/ascend-toolkit/8.3.RC1/opp/built-in/op_impl/ai_core/tbe//kernel/ascend910_93/concat_d/ConcatD_233851a3505389e43928a8bba133a74d_high_performance.o
            Kernel Run failed. opType: 8, ConcatD
            launch failed for ConcatD, errno:361001.
    
    Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:265 (most recent call first):
    frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xfffefec03908 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xfffefebb2404 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so)
    frame #2: <unknown function> + 0xd97fa0 (0xfffdf5da4fa0 in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
    frame #3: <unknown function> + 0x24c80ec (0xfffdf74d50ec in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
    frame #4: <unknown function> + 0x8a14c4 (0xfffdf58ae4c4 in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
    frame #5: <unknown function> + 0x8a3a7c (0xfffdf58b0a7c in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
    frame #6: <unknown function> + 0x8a006c (0xfffdf58ad06c in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so)
    frame #7: <unknown function> + 0xce4dc (0xffff0b0544dc in /usr/lib64/libstdc++.so.6)
    frame #8: <unknown function> + 0x821e8 (0xffffb0e761e8 in /usr/lib64/libc.so.6)
    frame #9: <unknown function> + 0xe95dc (0xffffb0edd5dc in /usr/lib64/libc.so.6)
    
    Traceback (most recent call last):
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1232, in not_close_error_metas
        pair.compare()
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 711, in compare
        self._compare_values(actual, expected)
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 841, in _compare_values
        compare_fn(
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1023, in _compare_regular_values_close
        if torch.all(matches):
    RuntimeError: operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:47 NPU function error: c10_npu::acl::AclrtSynchronizeStreamWithTimeout(copy_stream), error code is 507015
    [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed
    [Error]: The aicore execution is abnormal. 
            Rectify the fault based on the error information in the ascend log.
    EZ9999: Inner Error!
    EZ9999[PID: 297856] 2025-12-22-16:07:41.442.168 (EZ9999):  The error from device(chipId:0, dieId:0), serial number is 604, there is an exception of fftsplus aivector error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x12400000083c, current: 0x124000000d88, vec error info: 0x300000056, mte error info: 0x6000089, ifu error info: 0x112c140900000, ccu error info: 0x40e0000001800056, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100340080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_core_proc.cc][LINE:333]
            TraceBack (most recent call last):
           The extend info: errcode:(0x4000000000000000, 0x4000, 0) errorStr: VEC instruction error: the ub address out of bounds.CCU instruction address check error. fixp_error0 info: 0x6000089, fixp_error1 info: 0, fsmId:0, tslot:2, thread:0, ctxid:0, blk:4, sublk:1, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_core_proc.cc][LINE:353]
           Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1555]
           AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1191]
           rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
           synchronize stream with timeout failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
    
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/home/h00909914/workspace/api/tilelang-ascend/examples/elementwise/elementwise_pow.py", line 66, in <module>
        torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1508, in assert_close
        error_metas = not_close_error_metas(
                      ^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1239, in not_close_error_metas
        f"Comparing\n\n"
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 378, in __repr__
        body = [
               ^
      File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 379, in <listcomp>
        f"    {name}={value!s},"
      File "/usr/local/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__
        return torch._tensor_str._str(self, tensor_contents=tensor_contents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str
        return _str_intern(self, tensor_contents=tensor_contents)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern
        tensor_str = _tensor_str(self, indent)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 357, in _tensor_str
        formatter = _Formatter(get_summarized_data(self) if summarize else self)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^
      File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 393, in get_summarized_data
        return torch.stack([get_summarized_data(x) for x in (start + end)])
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnCat.
    Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1.
    Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging.
    [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed.
    
    [W1222 16:07:41.446160004 compiler_depend.ts:536] Warning: NPU warning, error code is 507015[Error]: 
    [Error]: The aicore execution is abnormal. 
            Rectify the fault based on the error information in the ascend log.
    EZ9999: Inner Error!
    EZ9999[PID: 297856] 2025-12-22-16:07:41.506.831 (EZ9999):  [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1191]
            TraceBack (most recent call last):
           [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1191]
           Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=2, flip_num=0, fault kernel_name=main_kernel, fault kernel info ext=main_kernel, program id=1, hash=2295926399778295069.[FUNC:GetError][FILE:stream.cc][LINE:1191]
           rtDeviceSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
           wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
     (function npuSynchronizeUsedDevices)
    [W1222 16:07:41.447917957 compiler_depend.ts:518] Warning: NPU warning, error code is 507015[Error]: 
    [Error]: The aicore execution is abnormal. 
            Rectify the fault based on the error information in the ascend log.
    EH9999: Inner Error!
            rtDeviceSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
    EH9999[PID: 297856] 2025-12-22-16:07:41.509.552 (EH9999):  wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162]
            TraceBack (most recent call last):
     (function npuSynchronizeDevice)
    
Image

Expected Behavior

  • Implement a static check during the Codegen phase.
  • Calculate the total size of all buffers allocated in UB (e.g., A_ub, B_ub, C_ub, intermediate buffers).
  • If the total size exceeds the hardware capacity (e.g., get_hardware_ub_limit()), throw a compilation error immediately with a clear message.
  • Example Error Message: Compilation Error: Total requested UB size (250KB) exceeds the hardware limit (192KB). Please reduce the tile size.

Context**

  • Backend: Ascend / NPU
  • Error type: Runtime Address Out of Bounds

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions