Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization Issue in lv_memcpy Due to Macro Usage #7406

Open
1saeed opened this issue Dec 4, 2024 · 8 comments
Open

Optimization Issue in lv_memcpy Due to Macro Usage #7406

1saeed opened this issue Dec 4, 2024 · 8 comments

Comments

@1saeed
Copy link

1saeed commented Dec 4, 2024

LVGL version

v9.2,2

Platform

STM32H755 Platform
GCC (-O3)
-ffunction-sections
-fdata-sections

What happened?

The current implementation of lv_memcpy uses macros like _COPY and _REPEAT8, which can lead to undefined or unintended behavior under high compiler optimization levels (-O3). This results in hard faults in some cases.

Here’s the refactored version of the function using inline functions:

#include <stdint.h>
#include <stddef.h>

#define _ALIGN_MASK (sizeof(uint32_t) - 1)

/* Inline function to copy a single byte */
static inline void copy_byte(volatile uint8_t **dst, const uint8_t **src) {
    **dst = **src;
    (*dst)++;
    (*src)++;
}

/* Inline function to copy a single 32-bit word */
static inline void copy_word(volatile uint32_t **dst, const uint32_t **src) {
    **dst = **src;
    (*dst)++;
    (*src)++;
}

/* Inline function for repeated byte copying */
static inline void repeat_byte_copy(volatile uint8_t **dst, const uint8_t **src, size_t times) {
    for (size_t i = 0; i < times; i++) {
        copy_byte(dst, src);
    }
}

/* Inline function for repeated word copying */
static inline void repeat_word_copy(volatile uint32_t **dst, const uint32_t **src, size_t times) {
    for (size_t i = 0; i < times; i++) {
        copy_word(dst, src);
    }
}

void lv_memcpy(void *dst, const void *src, size_t len) {
    volatile uint8_t *d8 = (volatile uint8_t *)dst;
    const uint8_t *s8 = (const uint8_t *)src;

    /* Simplify for small memories */
    if (len < 16) {
        while (len) {
            copy_byte(&d8, &s8);
            len--;
        }
        return;
    }

    uintptr_t d_align = (uintptr_t)d8 & _ALIGN_MASK;
    uintptr_t s_align = (uintptr_t)s8 & _ALIGN_MASK;

    /* Byte copy for unaligned memories */
    if (s_align != d_align) {
        while (len > 32) {
            repeat_byte_copy(&d8, &s8, 32);
            len -= 32;
        }
        while (len) {
            copy_byte(&d8, &s8);
            len--;
        }
        return;
    }

    /* Align the memories */
    if (d_align) {
        d_align = _ALIGN_MASK + 1 - d_align;
        while (d_align && len) {
            copy_byte(&d8, &s8);
            d_align--;
            len--;
        }
    }

    volatile uint32_t *d32 = (volatile uint32_t *)d8;
    const uint32_t *s32 = (const uint32_t *)s8;

    while (len > 32) {
        repeat_word_copy(&d32, &s32, 8);
        len -= 32;
    }

    d8 = (volatile uint8_t *)d32;
    s8 = (const uint8_t *)s32;
    while (len) {
        copy_byte(&d8, &s8);
        len--;
    }
}

How to reproduce?

No response

@liamHowatt
Copy link
Collaborator

Are you able to explain which part of the current implementation is undefined behavior?

@uLipe
Copy link
Collaborator

uLipe commented Dec 4, 2024

The idea of using inline code above is good and it is even more expressive than the current builtin option, the problem here is the behavior change across different compilers, using inline keyword will serve as a hint, but will not enforces the function inline behavior from the compiler, some platforms have methods to force the function to be inline, but it is not C standard.

On the same direction from @liamHowatt could you point which kind of issue did you find with the current approach? Something related to the address alignment?

@1saeed
Copy link
Author

1saeed commented Dec 5, 2024

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements. Given the strict aliasing, when we cast between uint8_t* and uint32_t*, we're technically violating this rule.

The compiler sees you writing through a uint32_t*. Later, it sees you reading through a uint8_t*. Under strict aliasing, the compiler knows these pointers can't point to the same memory. So it might cache the uint32_t value in a register and not reload it when you read through the uint8_t pointer.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points.

volatile uint8_t* d8 = dst;
volatile const uint8_t* s8 = src;

Another alternative would be to use proper type punning:

union ptr_cast {
    uint8_t* p8;
    uint32_t* p32;
};

// Then use it like:
union ptr_cast d;
d.p8 = d8;
uint32_t* d32 = d.p32;

@onecoolx
Copy link
Contributor

onecoolx commented Dec 5, 2024

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points. It also fixed the issue on my end.

volatile uint8_t* d8 = dst;
volatile const uint8_t* s8 = src;

Your problem can be solved by adding the GCC compiler option "-fno-strict-aliasing"?

@1saeed
Copy link
Author

1saeed commented Dec 5, 2024

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points. It also fixed the issue on my end.

volatile uint8_t* d8 = dst;

volatile const uint8_t* s8 = src;

Your problem can be solved by adding the GCC compiler option "-fno-strict-aliasing"?

The -fstrict-aliasing option is enabled at levels -O2, -O3, -Os.

However, code that only works with GCC with -fno-strict-aliasing is not correct C.

@liamHowatt
Copy link
Collaborator

Regarding the pointer aliasing assumption issue, it only dereferences a uint32_t * if it's aligned. Are you saying the aggressive optimization will generate machine code that eagerly accesses unaligned memory as though it's aligned, before it confirms it's aligned?

@kisvegabor
Copy link
Member

@1saeed Thank you for the detailed reply.

Have you really experienced a crash because of it or is it more like a theoretical issue?

These functions might heavily affect the performance so I wonder if you have checked if there is no regression.

@1saeed
Copy link
Author

1saeed commented Dec 20, 2024

Hello Gabor @kisvegabor. Sorry for the late reply! Had a super busy week.

I can confirm that I’ve replicated the issue with unaligned pointers, external SDRAM memory , cache being enable and O3 optimization on my platform (STM23H7)

uint8_t* src = (uint8_t*)0xD0000001;
uint8_t* dst = (uint8_t*)0xD0000063;
uint32_t N = 99;

lv_memcpy(dst, src, N);

Using volatile here would save it:

volatile uint8_t * d8 = dst;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants