Skip to content

Conversation

@vesajaaskelainen
Copy link
Contributor

With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed thus causing secure storage to not be operational.

It was noticed that the actual data for verification path was not completed in a way that destination DDR memory had received the resulting AES decryption operation result.

As we know what output should be retry loop up to 1 second for the result to give some time for CPU internals to be able to deliver the result in DDR memory.


This is somewhat annoying issue. And only happens on secure device which makes it a bit harder to debug.

This doesn't happen so far with any other device than XCZU15EG. We have used several Xilinx MPSoC variants and they seem to be happy so far.

Practically it required one retry loop with 1ms delay to allow result to be available in DDR memory for sanity comparison.

In theory there could be issue in forward direction on encryption operation too -- but probably enough time is spent in there.

If someone figures better fix I would be very happy for that as this only scales for this particular case.

Existing code inadvertently used same length for iv buffer but it was
defined with wrong means.

Use sizeof(iv) as was used in encrypt operation.

Signed-off-by: Vesa Jääskeläinen <[email protected]>
Cache operations should never go out-of-bounds of allocated data as that
might have inadvertent effects.

Also change to use more generic cache_operation() function that handles
both L1 and L2 caches.

Signed-off-by: Vesa Jääskeläinen <[email protected]>
With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed
thus causing secure storage to not be operational.

It was noticed that the actual data for verification path was not
completed in a way that destination DDR memory had received the resulting
AES decryption operation result.

As we know what output should be retry loop up to 1 second for the result
to give some time for CPU internals to be able to deliver the result in
DDR memory.

Signed-off-by: Vesa Jääskeläinen <[email protected]>
}

/*
* Give CPU's internal fifos some time to transfer data to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the CSUDMA transfer to RAM that is incorrectly signaled as completed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is black box more or less for us so we are somewhat guessing in here :|

Current theory is that there are several internal FIFOs between the domains and those "splits" the bus ack so that the sender (CSUDMA) thinks that they are complete. And now some internal timing on larger chip (or so) has changed so then the delay is so slightly larger for result to arrive in DDR.

In our testing we only got one round i.e. 1 ms enough but for went for larger retry.

That being said -- other operations also for the CSUDMA are subjects for the same issue. Would really like to hear from AMD/Xilinx folks what is the right thing to do in here.

We had similar issue with FSBL where we decided to disable CSUDMA usage altogether for "memcpy" operation optimization. This happened also in smaller chips.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw. we have also observed similar issue when transferring data from FPGA to DDR that some data might just take a bit to transfer after we get IRQ so we had to implement some waiting logic and magic & counter checks to check that we know that data has finally arrived as complete before staring to work with that. This we concluded as normal operation. This could be the similar FIFO issue in both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prepared a different kind of change that still needs a testing in secured hardware -- but more or less I decided that 1 ms should be all enough for internal FIFO's to work their way -- and I relocated the fixup in CSU DMA driver:

diff --git a/core/drivers/zynqmp_csudma.c b/core/drivers/zynqmp_csudma.c
index 4e3b59c9c..e20a2798e 100644
--- a/core/drivers/zynqmp_csudma.c
+++ b/core/drivers/zynqmp_csudma.c
@@ -74,6 +74,28 @@ TEE_Result zynqmp_csudma_sync(enum zynqmp_csudma_channel channel)
                status = io_read32(dma + CSUDMA_I_STS_OFFSET);
                if ((status & CSUDMA_IXR_DONE_MASK) == CSUDMA_IXR_DONE_MASK) {
                        csudma_clear_intr(channel, CSUDMA_IXR_DONE_MASK);
+
+                       if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
+                               /*
+                               * This retry extra delay here is a workaround for hardware
+                               * issue where CSU-DMA transfer is not yet delivered to DDR
+                               * memory.
+                               *
+                               * Even thou there is now a signal that transfer has left the
+                               * DMA engine there are still other CPU's internal FIFOs that
+                               * needs some time to transfer data to the final destination.
+                               *
+                               * 1 millisecond delay should be more than enough to transfer
+                               * the remaining data.
+                               *
+                               * Technically this should only be needed on DDR destination but
+                               * as operations in OP-TEE are few, the extra delay for crypto
+                               * destination should not cause much an issue. "Guarantee" that
+                               * the data is in destination before returning to caller is the
+                               * driver here.
+                               */
+                               mdelay(1);
+                       }
                        return TEE_SUCCESS;
                }
        }

What do you think -- should we actually panic() if HUK generation fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it makes sense. As long as the CSUDMA is not used excessively on the OP-TEE side this should not impact performance. I agree that getting some feedback from Xilinx/AMD would be great here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that alternative change did not work on the hardware tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I based the csudma driver on https://github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers/csudma . Could you check if there is any fix for the issue you are reporting there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can check -- their CSU-DMA memcpy() has been broken for some releases -- we have a patch to workaround that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that alternative change did not work on the hardware tests.

Will retry this with cache line fix that had insufficient amount of bytes for size.

* If the transfer is not completed within 1 second then there
* is a real failure.
*/
if (retry_counter >= 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: move this to while() condition and check for errors via checking the retry counter afterwards. Does not need to be applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am OK to change the construct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please could you extract the workaround section into a workaround function? makes it easier to understand

Copy link
Contributor

@ldts ldts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. Yes the assert at the start of the function takes care of the actual restriction (re 87ddd09)

if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
dma = dma + CSUDMA_OFFSET_DIFF;
dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
res = cache_operation(TEE_CACHEINVALIDATE, addr, len);
Copy link
Contributor

@ldts ldts Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you sure this handles words and not bytes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also please add some details - this is not fixing anything, just introducing a regression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With cache functions you should never go over the range of expected bytes or you can get random problems as you might in-advertendtly reference memory that was used by some other code and cause unexpected behavior for that.

cache_operation() is used in many places in OP-TEE to do these operations. Unit of length is not specified as such but all code references tend to assume number of bytes.

dcache_inv_range() is no longer being used directly in OP-TEE.

There is helper allocator function alloc_cache_aligned() that would normally allocated cache aligned memory, but in here we have variable size and alignment magic in stack so cache line alignment happens internally.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcache_inv_range() is no longer being used directly in OP-TEE

Yes, it is. And I'd say that I tend to prefer it over cache_operation().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vesajaaskelainen I dont see what magic you are referring to and I dont think anything needs fixing here. The alignment is checked in the function for these page sizes.

Copy link
Contributor Author

@vesajaaskelainen vesajaaskelainen Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcache_inv_range() takes size in bytes.

Now the size is given as = SHIFT_U64(len, CSUDMA_SIZE_SHIFT).

Where:

#define CSUDMA_SIZE_SHIFT          2

Which is same as *4.

But what I did not spot on first round was:

	len = len / sizeof(uint32_t);

So sorry about that.

To get full picture original code:

	/* convert to 32 bit word transfers */
	len = len / sizeof(uint32_t);

	if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
		dma = dma + CSUDMA_OFFSET_DIFF;
		dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
	} else {
		dcache_clean_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
	}

Where sizeof(uint32_t) == 4

So we have arithmetics that counter each other.

Let me test the msleep(1) workaround again with to notion that was pointed out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the magic that I referred is:

	uint8_t dst[ZYNQMP_CSU_AES_DST_LEN(sizeof(src))]
		__aligned_csuaes = { 0 };

// where:
#define __aligned_csuaes	__aligned_csudma
#define ZYNQMP_CSUDMA_ALIGN			64
#define __aligned_csudma			__aligned(ZYNQMP_CSUDMA_ALIGN)

With __aligned() we align the memory to start in boundary of CSUDMA alignment requirement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dcache_inv_range() is no longer being used directly in OP-TEE

Yes, it is. And I'd say that I tend to prefer it over cache_operation().

I was looking with git grep:

$ git grep dcache_inv_range
core/arch/arm/kernel/cache_helpers_a32.S:FUNC dcache_inv_range , :
core/arch/arm/kernel/cache_helpers_a32.S:END_FUNC dcache_inv_range
core/arch/arm/kernel/cache_helpers_a64.S:FUNC dcache_inv_range , :
core/arch/arm/kernel/cache_helpers_a64.S:END_FUNC dcache_inv_range
core/arch/arm/kernel/entry_a32.S:               bl      dcache_inv_range
core/arch/arm/mm/core_mmu.c:            dcache_inv_range(va, len);
core/arch/riscv/kernel/cache_helpers_rv.S:/* void dcache_inv_range(void *addr, size_t size); */
core/arch/riscv/kernel/cache_helpers_rv.S:FUNC dcache_inv_range , :
core/arch/riscv/kernel/cache_helpers_rv.S:END_FUNC dcache_inv_range
core/arch/riscv/mm/core_mmu_arch.c:             dcache_inv_range(va, len);
core/drivers/zynqmp_csudma.c:           dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
core/include/kernel/cache_helpers.h:void dcache_inv_range(void *addr, size_t size);

And the comparison:

$ git grep cache_operation | wc -l
150

Used lots in drivers.

This made me believe that cache_operation() is the "public api" and the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong but isn't dcache_inv_range() working in CPU's L1 cache only?

If we look at Zynq UltraScale+ Device TRM:

From CSU Processor -> DDR:

  • IOP Outbound
  • LPD Main Switch
  • CCI
  • DDR Memory Controller

From A53 Core X -> DDR:

  • 32K L1 cache (for A53 Core X)
  • SCU
  • 1MB L2 cache (for all A53 cores)
  • CCI
  • DDR Memory Controller

Shared parts in here are CCI and DDR. If CCI just knows that everything is correct it should be good enough -- don't see the need to do any CCI operations.

SCU should help for requests that it does see but CSU Processor flow does not go thru that.

But what we have is L2 cache which is handled by cache_operation().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong but isn't dcache_inv_range() working in CPU's L1 cache only?

It operates on all internal caches to the Point of Coherency.

But I was a bit quick to dismiss cache_operation(), it's the right thing to use from a driver. cache_operation() also handles an eventual external cache.

The dcache_*() functions are sometimes more precise, so there are situations where those are needed.

@vesajaaskelainen
Copy link
Contributor Author

I will lose access to the secure hardware during holiday season. So I'll continue working on this PR after holiday season. Sorry for the delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants