-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Xilinx ZynqMP CSUDMA problem workaround #7603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Xilinx ZynqMP CSUDMA problem workaround #7603
Conversation
Existing code inadvertently used same length for iv buffer but it was defined with wrong means. Use sizeof(iv) as was used in encrypt operation. Signed-off-by: Vesa Jääskeläinen <[email protected]>
Cache operations should never go out-of-bounds of allocated data as that might have inadvertent effects. Also change to use more generic cache_operation() function that handles both L1 and L2 caches. Signed-off-by: Vesa Jääskeläinen <[email protected]>
With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed thus causing secure storage to not be operational. It was noticed that the actual data for verification path was not completed in a way that destination DDR memory had received the resulting AES decryption operation result. As we know what output should be retry loop up to 1 second for the result to give some time for CPU internals to be able to deliver the result in DDR memory. Signed-off-by: Vesa Jääskeläinen <[email protected]>
| } | ||
|
|
||
| /* | ||
| * Give CPU's internal fifos some time to transfer data to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the CSUDMA transfer to RAM that is incorrectly signaled as completed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is black box more or less for us so we are somewhat guessing in here :|
Current theory is that there are several internal FIFOs between the domains and those "splits" the bus ack so that the sender (CSUDMA) thinks that they are complete. And now some internal timing on larger chip (or so) has changed so then the delay is so slightly larger for result to arrive in DDR.
In our testing we only got one round i.e. 1 ms enough but for went for larger retry.
That being said -- other operations also for the CSUDMA are subjects for the same issue. Would really like to hear from AMD/Xilinx folks what is the right thing to do in here.
We had similar issue with FSBL where we decided to disable CSUDMA usage altogether for "memcpy" operation optimization. This happened also in smaller chips.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw. we have also observed similar issue when transferring data from FPGA to DDR that some data might just take a bit to transfer after we get IRQ so we had to implement some waiting logic and magic & counter checks to check that we know that data has finally arrived as complete before staring to work with that. This we concluded as normal operation. This could be the similar FIFO issue in both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prepared a different kind of change that still needs a testing in secured hardware -- but more or less I decided that 1 ms should be all enough for internal FIFO's to work their way -- and I relocated the fixup in CSU DMA driver:
diff --git a/core/drivers/zynqmp_csudma.c b/core/drivers/zynqmp_csudma.c
index 4e3b59c9c..e20a2798e 100644
--- a/core/drivers/zynqmp_csudma.c
+++ b/core/drivers/zynqmp_csudma.c
@@ -74,6 +74,28 @@ TEE_Result zynqmp_csudma_sync(enum zynqmp_csudma_channel channel)
status = io_read32(dma + CSUDMA_I_STS_OFFSET);
if ((status & CSUDMA_IXR_DONE_MASK) == CSUDMA_IXR_DONE_MASK) {
csudma_clear_intr(channel, CSUDMA_IXR_DONE_MASK);
+
+ if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
+ /*
+ * This retry extra delay here is a workaround for hardware
+ * issue where CSU-DMA transfer is not yet delivered to DDR
+ * memory.
+ *
+ * Even thou there is now a signal that transfer has left the
+ * DMA engine there are still other CPU's internal FIFOs that
+ * needs some time to transfer data to the final destination.
+ *
+ * 1 millisecond delay should be more than enough to transfer
+ * the remaining data.
+ *
+ * Technically this should only be needed on DDR destination but
+ * as operations in OP-TEE are few, the extra delay for crypto
+ * destination should not cause much an issue. "Guarantee" that
+ * the data is in destination before returning to caller is the
+ * driver here.
+ */
+ mdelay(1);
+ }
return TEE_SUCCESS;
}
}What do you think -- should we actually panic() if HUK generation fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it makes sense. As long as the CSUDMA is not used excessively on the OP-TEE side this should not impact performance. I agree that getting some feedback from Xilinx/AMD would be great here as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that alternative change did not work on the hardware tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I based the csudma driver on https://github.com/Xilinx/embeddedsw/tree/master/XilinxProcessorIPLib/drivers/csudma . Could you check if there is any fix for the issue you are reporting there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can check -- their CSU-DMA memcpy() has been broken for some releases -- we have a patch to workaround that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that alternative change did not work on the hardware tests.
Will retry this with cache line fix that had insufficient amount of bytes for size.
| * If the transfer is not completed within 1 second then there | ||
| * is a real failure. | ||
| */ | ||
| if (retry_counter >= 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: move this to while() condition and check for errors via checking the retry counter afterwards. Does not need to be applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK to change the construct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please could you extract the workaround section into a workaround function? makes it easier to understand
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. Yes the assert at the start of the function takes care of the actual restriction (re 87ddd09)
| if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) { | ||
| dma = dma + CSUDMA_OFFSET_DIFF; | ||
| dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT)); | ||
| res = cache_operation(TEE_CACHEINVALIDATE, addr, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure this handles words and not bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also please add some details - this is not fixing anything, just introducing a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With cache functions you should never go over the range of expected bytes or you can get random problems as you might in-advertendtly reference memory that was used by some other code and cause unexpected behavior for that.
cache_operation() is used in many places in OP-TEE to do these operations. Unit of length is not specified as such but all code references tend to assume number of bytes.
dcache_inv_range() is no longer being used directly in OP-TEE.
There is helper allocator function alloc_cache_aligned() that would normally allocated cache aligned memory, but in here we have variable size and alignment magic in stack so cache line alignment happens internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcache_inv_range() is no longer being used directly in OP-TEE
Yes, it is. And I'd say that I tend to prefer it over cache_operation().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vesajaaskelainen I dont see what magic you are referring to and I dont think anything needs fixing here. The alignment is checked in the function for these page sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcache_inv_range() takes size in bytes.
Now the size is given as = SHIFT_U64(len, CSUDMA_SIZE_SHIFT).
Where:
#define CSUDMA_SIZE_SHIFT 2
Which is same as *4.
But what I did not spot on first round was:
len = len / sizeof(uint32_t);
So sorry about that.
To get full picture original code:
/* convert to 32 bit word transfers */
len = len / sizeof(uint32_t);
if (channel == ZYNQMP_CSUDMA_DST_CHANNEL) {
dma = dma + CSUDMA_OFFSET_DIFF;
dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
} else {
dcache_clean_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
}
Where sizeof(uint32_t) == 4
So we have arithmetics that counter each other.
Let me test the msleep(1) workaround again with to notion that was pointed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the magic that I referred is:
uint8_t dst[ZYNQMP_CSU_AES_DST_LEN(sizeof(src))]
__aligned_csuaes = { 0 };
// where:
#define __aligned_csuaes __aligned_csudma
#define ZYNQMP_CSUDMA_ALIGN 64
#define __aligned_csudma __aligned(ZYNQMP_CSUDMA_ALIGN)
With __aligned() we align the memory to start in boundary of CSUDMA alignment requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dcache_inv_range() is no longer being used directly in OP-TEE
Yes, it is. And I'd say that I tend to prefer it over
cache_operation().
I was looking with git grep:
$ git grep dcache_inv_range
core/arch/arm/kernel/cache_helpers_a32.S:FUNC dcache_inv_range , :
core/arch/arm/kernel/cache_helpers_a32.S:END_FUNC dcache_inv_range
core/arch/arm/kernel/cache_helpers_a64.S:FUNC dcache_inv_range , :
core/arch/arm/kernel/cache_helpers_a64.S:END_FUNC dcache_inv_range
core/arch/arm/kernel/entry_a32.S: bl dcache_inv_range
core/arch/arm/mm/core_mmu.c: dcache_inv_range(va, len);
core/arch/riscv/kernel/cache_helpers_rv.S:/* void dcache_inv_range(void *addr, size_t size); */
core/arch/riscv/kernel/cache_helpers_rv.S:FUNC dcache_inv_range , :
core/arch/riscv/kernel/cache_helpers_rv.S:END_FUNC dcache_inv_range
core/arch/riscv/mm/core_mmu_arch.c: dcache_inv_range(va, len);
core/drivers/zynqmp_csudma.c: dcache_inv_range(addr, SHIFT_U64(len, CSUDMA_SIZE_SHIFT));
core/include/kernel/cache_helpers.h:void dcache_inv_range(void *addr, size_t size);
And the comparison:
$ git grep cache_operation | wc -l
150
Used lots in drivers.
This made me believe that cache_operation() is the "public api" and the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong but isn't dcache_inv_range() working in CPU's L1 cache only?
If we look at Zynq UltraScale+ Device TRM:
From CSU Processor -> DDR:
- IOP Outbound
- LPD Main Switch
- CCI
- DDR Memory Controller
From A53 Core X -> DDR:
- 32K L1 cache (for A53 Core X)
- SCU
- 1MB L2 cache (for all A53 cores)
- CCI
- DDR Memory Controller
Shared parts in here are CCI and DDR. If CCI just knows that everything is correct it should be good enough -- don't see the need to do any CCI operations.
SCU should help for requests that it does see but CSU Processor flow does not go thru that.
But what we have is L2 cache which is handled by cache_operation().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong but isn't
dcache_inv_range()working in CPU's L1 cache only?
It operates on all internal caches to the Point of Coherency.
But I was a bit quick to dismiss cache_operation(), it's the right thing to use from a driver. cache_operation() also handles an eventual external cache.
The dcache_*() functions are sometimes more precise, so there are situations where those are needed.
|
I will lose access to the secure hardware during holiday season. So I'll continue working on this PR after holiday season. Sorry for the delay. |
With Xilinx Zynq MPSoC XCZU15EG it was noticed that HUK generation failed thus causing secure storage to not be operational.
It was noticed that the actual data for verification path was not completed in a way that destination DDR memory had received the resulting AES decryption operation result.
As we know what output should be retry loop up to 1 second for the result to give some time for CPU internals to be able to deliver the result in DDR memory.
This is somewhat annoying issue. And only happens on secure device which makes it a bit harder to debug.
This doesn't happen so far with any other device than XCZU15EG. We have used several Xilinx MPSoC variants and they seem to be happy so far.
Practically it required one retry loop with 1ms delay to allow result to be available in DDR memory for sanity comparison.
In theory there could be issue in forward direction on encryption operation too -- but probably enough time is spent in there.
If someone figures better fix I would be very happy for that as this only scales for this particular case.