Xilinx ZynqMP CSUDMA problem workaround #7603

vesajaaskelainen · 2025-11-13T08:08:27Z

With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed thus causing secure storage to not be operational.

It was noticed that the actual data for verification path was not completed in a way that destination DDR memory had received the resulting AES decryption operation result.

As we know what output should be retry loop up to 1 second for the result to give some time for CPU internals to be able to deliver the result in DDR memory.

This is somewhat annoying issue. And only happens on secure device which makes it a bit harder to debug.

This doesn't happen so far with any other device than XCZU15EG. We have used several Xilinx MPSoC variants and they seem to be happy so far.

Practically it required one retry loop with 1ms delay to allow result to be available in DDR memory for sanity comparison.

In theory there could be issue in forward direction on encryption operation too -- but probably enough time is spent in there.

If someone figures better fix I would be very happy for that as this only scales for this particular case.

Existing code inadvertently used same length for iv buffer but it was defined with wrong means. Use sizeof(iv) as was used in encrypt operation. Signed-off-by: Vesa Jääskeläinen <[email protected]>

Cache operations should never go out-of-bounds of allocated data as that might have inadvertent effects. Also change to use more generic cache_operation() function that handles both L1 and L2 caches. Signed-off-by: Vesa Jääskeläinen <[email protected]>

With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed thus causing secure storage to not be operational. It was noticed that the actual data for verification path was not completed in a way that destination DDR memory had received the resulting AES decryption operation result. As we know what output should be retry loop up to 1 second for the result to give some time for CPU internals to be able to deliver the result in DDR memory. Signed-off-by: Vesa Jääskeläinen <[email protected]>

Emantor · 2025-11-21T10:45:47Z

core/drivers/zynqmp_huk.c

+		}
+
+		/*
+		 * Give CPU's internal fifos some time to transfer data to the


Isn't this the CSUDMA transfer to RAM that is incorrectly signaled as completed?

Since this is black box more or less for us so we are somewhat guessing in here :|

Current theory is that there are several internal FIFOs between the domains and those "splits" the bus ack so that the sender (CSUDMA) thinks that they are complete. And now some internal timing on larger chip (or so) has changed so then the delay is so slightly larger for result to arrive in DDR.

In our testing we only got one round i.e. 1 ms enough but for went for larger retry.

That being said -- other operations also for the CSUDMA are subjects for the same issue. Would really like to hear from AMD/Xilinx folks what is the right thing to do in here.

We had similar issue with FSBL where we decided to disable CSUDMA usage altogether for "memcpy" operation optimization. This happened also in smaller chips.

Btw. we have also observed similar issue when transferring data from FPGA to DDR that some data might just take a bit to transfer after we get IRQ so we had to implement some waiting logic and magic & counter checks to check that we know that data has finally arrived as complete before staring to work with that. This we concluded as normal operation. This could be the similar FIFO issue in both.

I prepared a different kind of change that still needs a testing in secured hardware -- but more or less I decided that 1 ms should be all enough for internal FIFO's to work their way -- and I relocated the fixup in CSU DMA driver:

diff --git a/core/drivers/zynqmp_csudma.c b/core/drivers/zynqmp_csudma.c index 4e3b59c9c..e20a2798e 100644 --- a/core/drivers/zynqmp_csudma.c +++ b/core/drivers/zynqmp_csudma.c @@ -74,6 +74,28 @@ TEE_Result zynqmp_csudma_sync(enum zynqmp_csudma_channel channel) status = io_read32(dma + CSUDMA_I_STS_OFFSET); if ((status & CSUDMA_IXR_DONE_MASK) == CSUDMA_IXR_DONE_MASK) { csudma_clear_intr(channel, CSUDMA_IXR_DONE_MASK); + + if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) { + /* + * This retry extra delay here is a workaround for hardware + * issue where CSU-DMA transfer is not yet delivered to DDR + * memory. + * + * Even thou there is now a signal that transfer has left the + * DMA engine there are still other CPU's internal FIFOs that + * needs some time to transfer data to the final destination. + * + * 1 millisecond delay should be more than enough to transfer + * the remaining data. + * + * Technically this should only be needed on DDR destination but + * as operations in OP-TEE are few, the extra delay for crypto + * destination should not cause much an issue. "Guarantee" that + * the data is in destination before returning to caller is the + * driver here. + */ + mdelay(1); + } return TEE_SUCCESS; } }

What do you think -- should we actually panic() if HUK generation fails?

I think that it makes sense. As long as the CSUDMA is not used excessively on the OP-TEE side this should not impact performance. I agree that getting some feedback from Xilinx/AMD would be great here as well.

Unfortunately that alternative change did not work on the hardware tests.

I based the csudma driver on https://github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers/csudma . Could you check if there is any fix for the issue you are reporting there ?

Can check -- their CSU-DMA memcpy() has been broken for some releases -- we have a patch to workaround that.

Unfortunately that alternative change did not work on the hardware tests.

Will retry this with cache line fix that had insufficient amount of bytes for size.

Emantor · 2025-11-21T10:47:36Z

core/drivers/zynqmp_huk.c

+		 * If the transfer is not completed within 1 second then there
+		 * is a real failure.
+		 */
+		if (retry_counter >= 1000) {


Suggestion: move this to while() condition and check for errors via checking the retry counter afterwards. Does not need to be applied.

I am OK to change the construct.

please could you extract the workaround section into a workaround function? makes it easier to understand

ldts

thanks. Yes the assert at the start of the function takes care of the actual restriction (re 87ddd09)

ldts · 2025-12-09T10:33:26Z

core/drivers/zynqmp_csudma.c

 	if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
 		dma = dma + CSUDMA_OFFSET_DIFF;
-		dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
+		res = cache_operation(TEE_CACHEINVALIDATE, addr, len);


are you sure this handles words and not bytes?

also please add some details - this is not fixing anything, just introducing a regression.

With cache functions you should never go over the range of expected bytes or you can get random problems as you might in-advertendtly reference memory that was used by some other code and cause unexpected behavior for that.

cache_operation() is used in many places in OP-TEE to do these operations. Unit of length is not specified as such but all code references tend to assume number of bytes.

dcache_inv_range() is no longer being used directly in OP-TEE.

There is helper allocator function alloc_cache_aligned() that would normally allocated cache aligned memory, but in here we have variable size and alignment magic in stack so cache line alignment happens internally.

dcache_inv_range() is no longer being used directly in OP-TEE

Yes, it is. And I'd say that I tend to prefer it over cache_operation().

@vesajaaskelainen I dont see what magic you are referring to and I dont think anything needs fixing here. The alignment is checked in the function for these page sizes.

dcache_inv_range() takes size in bytes.

Now the size is given as = SHIFT_U64(len, CSUDMA_SIZE_SHIFT).

Where:

#define CSUDMA_SIZE_SHIFT 2

Which is same as *4.

But what I did not spot on first round was:

len = len / sizeof(uint32_t);

So sorry about that.

To get full picture original code:

/* convert to 32 bit word transfers */ len = len / sizeof(uint32_t); if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) { dma = dma + CSUDMA_OFFSET_DIFF; dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT)); } else { dcache_clean_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT)); }

Where sizeof(uint32_t) == 4

So we have arithmetics that counter each other.

Let me test the msleep(1) workaround again with to notion that was pointed out.

And the magic that I referred is:

uint8_t dst[ZYNQMP_CSU_AES_DST_LEN(sizeof(src))] __aligned_csuaes = { 0 }; // where: #define __aligned_csuaes __aligned_csudma #define ZYNQMP_CSUDMA_ALIGN 64 #define __aligned_csudma __aligned(ZYNQMP_CSUDMA_ALIGN)

With __aligned() we align the memory to start in boundary of CSUDMA alignment requirement.

dcache_inv_range() is no longer being used directly in OP-TEE

Yes, it is. And I'd say that I tend to prefer it over cache_operation().

I was looking with git grep:

$ git grep dcache_inv_range core/arch/arm/kernel/cache_helpers_a32.S:FUNC dcache_inv_range , : core/arch/arm/kernel/cache_helpers_a32.S:END_FUNC dcache_inv_range core/arch/arm/kernel/cache_helpers_a64.S:FUNC dcache_inv_range , : core/arch/arm/kernel/cache_helpers_a64.S:END_FUNC dcache_inv_range core/arch/arm/kernel/entry_a32.S: bl dcache_inv_range core/arch/arm/mm/core_mmu.c: dcache_inv_range(va, len); core/arch/riscv/kernel/cache_helpers_rv.S:/* void dcache_inv_range(void *addr, size_t size); */ core/arch/riscv/kernel/cache_helpers_rv.S:FUNC dcache_inv_range , : core/arch/riscv/kernel/cache_helpers_rv.S:END_FUNC dcache_inv_range core/arch/riscv/mm/core_mmu_arch.c: dcache_inv_range(va, len); core/drivers/zynqmp_csudma.c: dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT)); core/include/kernel/cache_helpers.h:void dcache_inv_range(void *addr, size_t size);

And the comparison:

$ git grep cache_operation | wc -l 150

Used lots in drivers.

This made me believe that cache_operation() is the "public api" and the way to go.

Please correct me if I am wrong but isn't dcache_inv_range() working in CPU's L1 cache only?

If we look at Zynq UltraScale+ Device TRM:

From CSU Processor -> DDR:

IOP Outbound

LPD Main Switch

CCI

DDR Memory Controller

From A53 Core X -> DDR:

32K L1 cache (for A53 Core X)

SCU

1MB L2 cache (for all A53 cores)

CCI

DDR Memory Controller

Shared parts in here are CCI and DDR. If CCI just knows that everything is correct it should be good enough -- don't see the need to do any CCI operations.

SCU should help for requests that it does see but CSU Processor flow does not go thru that.

But what we have is L2 cache which is handled by cache_operation().

Please correct me if I am wrong but isn't dcache_inv_range() working in CPU's L1 cache only?

It operates on all internal caches to the Point of Coherency.

But I was a bit quick to dismiss cache_operation(), it's the right thing to use from a driver. cache_operation() also handles an eventual external cache.

The dcache_*() functions are sometimes more precise, so there are situations where those are needed.

vesajaaskelainen · 2025-12-19T09:43:49Z

I will lose access to the secure hardware during holiday season. So I'll continue working on this PR after holiday season. Sorry for the delay.

vesajaaskelainen added 3 commits November 13, 2025 10:00

drivers: zynqmp_huk: Use correct size for iv buffer

87ddd09

Existing code inadvertently used same length for iv buffer but it was defined with wrong means. Use sizeof(iv) as was used in encrypt operation. Signed-off-by: Vesa Jääskeläinen <[email protected]>

Emantor reviewed Nov 21, 2025

View reviewed changes

ldts approved these changes Dec 9, 2025

View reviewed changes

ldts reviewed Dec 9, 2025

View reviewed changes

Xilinx ZynqMP CSUDMA problem workaround #7603

Are you sure you want to change the base?

Xilinx ZynqMP CSUDMA problem workaround #7603

Conversation

vesajaaskelainen commented Nov 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldts left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldts Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vesajaaskelainen Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vesajaaskelainen commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ldts left a comment •

edited

Loading

ldts Dec 9, 2025 •

edited

Loading

vesajaaskelainen Dec 10, 2025 •

edited

Loading