-
Notifications
You must be signed in to change notification settings - Fork 52
Open
Description
Description
Currently, when generating kernels for Ascend devices (e.g., using TileLang/Codegen), there is no static check for the total size of allocated Unified Buffer (UB).
If a user defines a tiling strategy that requests more UB memory than the hardware limit (e.g., exceeding 192KB/256KB depending on the chip), the compilation passes successfully. The error is only caught during runtime execution on the NPU, typically manifesting as an obscure hardware exception (e.g., error code 507015 or VEC instruction error: the ub address out of bounds).
This behavior makes debugging extremely difficult, as the runtime error stack trace does not directly point to the "oversized allocation" in the source code.
Current Behavior
- The compiler allows allocating UB size beyond hardware physical limits.
- The program crashes at runtime with generic errors like:
[E1222 16:07:41.419460944 compiler_depend.ts:442] operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:265 NPU function error: call aclnnCat failed, error code is 507015 [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. [PID: 297856] 2025-12-22-16:07:41.478.942 AclNN_Runtime_Error(EZ9903): rtKernelLaunchWithHandleV2 failed: 507015 Solution: In this scenario, collect the plog when the fault occurs and locate the fault based on the plog. TraceBack (most recent call last): Failed to submit kernel task, retCode=0x7150026.[FUNC:LaunchKernelSubmit][FILE:context.cc][LINE:1164] kernel launch submit failed.[FUNC:LaunchKernelWithHandle][FILE:context.cc][LINE:1428] rtKernelLaunchWithHandleV2 execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] rtKernelLaunchWithHandleV2 failed: 507015 #### KernelLaunch failed: /home/h00909914/cann/day251203/env/ascend-toolkit/8.3.RC1/opp/built-in/op_impl/ai_core/tbe//kernel/ascend910_93/concat_d/ConcatD_233851a3505389e43928a8bba133a74d_high_performance.o Kernel Run failed. opType: 8, ConcatD launch failed for ConcatD, errno:361001. Exception raised from operator() at build/CMakeFiles/torch_npu.dir/compiler_depend.ts:265 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0xb8 (0xfffefec03908 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x6c (0xfffefebb2404 in /usr/local/lib/python3.11/site-packages/torch/lib/libc10.so) frame #2: <unknown function> + 0xd97fa0 (0xfffdf5da4fa0 in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #3: <unknown function> + 0x24c80ec (0xfffdf74d50ec in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #4: <unknown function> + 0x8a14c4 (0xfffdf58ae4c4 in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #5: <unknown function> + 0x8a3a7c (0xfffdf58b0a7c in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #6: <unknown function> + 0x8a006c (0xfffdf58ad06c in /usr/local/lib/python3.11/site-packages/torch_npu/lib/libtorch_npu.so) frame #7: <unknown function> + 0xce4dc (0xffff0b0544dc in /usr/lib64/libstdc++.so.6) frame #8: <unknown function> + 0x821e8 (0xffffb0e761e8 in /usr/lib64/libc.so.6) frame #9: <unknown function> + 0xe95dc (0xffffb0edd5dc in /usr/lib64/libc.so.6) Traceback (most recent call last): File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1232, in not_close_error_metas pair.compare() File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 711, in compare self._compare_values(actual, expected) File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 841, in _compare_values compare_fn( File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1023, in _compare_regular_values_close if torch.all(matches): RuntimeError: operator():build/CMakeFiles/torch_npu.dir/compiler_depend.ts:47 NPU function error: c10_npu::acl::AclrtSynchronizeStreamWithTimeout(copy_stream), error code is 507015 [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EZ9999: Inner Error! EZ9999[PID: 297856] 2025-12-22-16:07:41.442.168 (EZ9999): The error from device(chipId:0, dieId:0), serial number is 604, there is an exception of fftsplus aivector error, core id is 0, error code = 0x4000000000000000, dump info: pc start: 0x12400000083c, current: 0x124000000d88, vec error info: 0x300000056, mte error info: 0x6000089, ifu error info: 0x112c140900000, ccu error info: 0x40e0000001800056, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd00028c, para base: 0x12c100340080.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_core_proc.cc][LINE:333] TraceBack (most recent call last): The extend info: errcode:(0x4000000000000000, 0x4000, 0) errorStr: VEC instruction error: the ub address out of bounds.CCU instruction address check error. fixp_error0 info: 0x6000089, fixp_error1 info: 0, fsmId:0, tslot:2, thread:0, ctxid:0, blk:4, sublk:1, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_core_proc.cc][LINE:353] Kernel task happen error, retCode=0x26, [aicore exception].[FUNC:PreCheckTaskErr][FILE:davinci_kernel_task.cc][LINE:1555] AICORE Kernel task happen error, retCode=0x26.[FUNC:GetError][FILE:stream.cc][LINE:1191] rtStreamSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] synchronize stream with timeout failed, runtime result = 507015[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/h00909914/workspace/api/tilelang-ascend/examples/elementwise/elementwise_pow.py", line 66, in <module> torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2) File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1508, in assert_close error_metas = not_close_error_metas( ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 1239, in not_close_error_metas f"Comparing\n\n" File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 378, in __repr__ body = [ ^ File "/usr/local/lib/python3.11/site-packages/torch/testing/_comparison.py", line 379, in <listcomp> f" {name}={value!s}," File "/usr/local/lib/python3.11/site-packages/torch/_tensor.py", line 523, in __repr__ return torch._tensor_str._str(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 708, in _str return _str_intern(self, tensor_contents=tensor_contents) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 625, in _str_intern tensor_str = _tensor_str(self, indent) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 357, in _tensor_str formatter = _Formatter(get_summarized_data(self) if summarize else self) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/site-packages/torch/_tensor_str.py", line 393, in get_summarized_data return torch.stack([get_summarized_data(x) for x in (start + end)]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: The Inner error is reported as above. The process exits for this inner error, and the current working operator name is aclnnCat. Since the operator is called asynchronously, the stacktrace may be inaccurate. If you want to get the accurate stacktrace, please set the environment variable ASCEND_LAUNCH_BLOCKING=1. Note: ASCEND_LAUNCH_BLOCKING=1 will force ops to run in synchronous mode, resulting in performance degradation. Please unset ASCEND_LAUNCH_BLOCKING in time after debugging. [ERROR] 2025-12-22-16:07:41 (PID:297856, Device:0, RankID:-1) ERR00100 PTA call acl api failed. [W1222 16:07:41.446160004 compiler_depend.ts:536] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EZ9999: Inner Error! EZ9999[PID: 297856] 2025-12-22-16:07:41.506.831 (EZ9999): [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1191] TraceBack (most recent call last): [AIC_INFO] after execute:mixCtx print end[FUNC:GetError][FILE:stream.cc][LINE:1191] Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=2, flip_num=0, fault kernel_name=main_kernel, fault kernel info ext=main_kernel, program id=1, hash=2295926399778295069.[FUNC:GetError][FILE:stream.cc][LINE:1191] rtDeviceSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162] (function npuSynchronizeUsedDevices) [W1222 16:07:41.447917957 compiler_depend.ts:518] Warning: NPU warning, error code is 507015[Error]: [Error]: The aicore execution is abnormal. Rectify the fault based on the error information in the ascend log. EH9999: Inner Error! rtDeviceSynchronizeWithTimeout execute failed, reason=[aicore exception][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53] EH9999[PID: 297856] 2025-12-22-16:07:41.509.552 (EH9999): wait for compute device to finish failed, runtime result = 507015.[FUNC:ReportCallError][FILE:log_inner.cpp][LINE:162] TraceBack (most recent call last): (function npuSynchronizeDevice)
Expected Behavior
- Implement a static check during the Codegen phase.
- Calculate the total size of all buffers allocated in UB (e.g.,
A_ub,B_ub,C_ub, intermediate buffers). - If the total size exceeds the hardware capacity (e.g.,
get_hardware_ub_limit()), throw a compilation error immediately with a clear message. - Example Error Message:
Compilation Error: Total requested UB size (250KB) exceeds the hardware limit (192KB). Please reduce the tile size.
Context**
- Backend: Ascend / NPU
- Error type: Runtime Address Out of Bounds
Metadata
Metadata
Assignees
Labels
No labels