Optimization Issue in lv_memcpy Due to Macro Usage #7406

1saeed · 2024-12-04T00:06:37Z

LVGL version

v9.2,2

Platform

STM32H755 Platform
GCC (-O3)
-ffunction-sections
-fdata-sections

What happened?

The current implementation of lv_memcpy uses macros like _COPY and _REPEAT8, which can lead to undefined or unintended behavior under high compiler optimization levels (-O3). This results in hard faults in some cases.

Here’s the refactored version of the function using inline functions:

#include <stdint.h>
#include <stddef.h>

#define _ALIGN_MASK (sizeof(uint32_t) - 1)

/* Inline function to copy a single byte */
static inline void copy_byte(volatile uint8_t **dst, const uint8_t **src) {
    **dst = **src;
    (*dst)++;
    (*src)++;
}

/* Inline function to copy a single 32-bit word */
static inline void copy_word(volatile uint32_t **dst, const uint32_t **src) {
    **dst = **src;
    (*dst)++;
    (*src)++;
}

/* Inline function for repeated byte copying */
static inline void repeat_byte_copy(volatile uint8_t **dst, const uint8_t **src, size_t times) {
    for (size_t i = 0; i < times; i++) {
        copy_byte(dst, src);
    }
}

/* Inline function for repeated word copying */
static inline void repeat_word_copy(volatile uint32_t **dst, const uint32_t **src, size_t times) {
    for (size_t i = 0; i < times; i++) {
        copy_word(dst, src);
    }
}

void lv_memcpy(void *dst, const void *src, size_t len) {
    volatile uint8_t *d8 = (volatile uint8_t *)dst;
    const uint8_t *s8 = (const uint8_t *)src;

    /* Simplify for small memories */
    if (len < 16) {
        while (len) {
            copy_byte(&d8, &s8);
            len--;
        }
        return;
    }

    uintptr_t d_align = (uintptr_t)d8 & _ALIGN_MASK;
    uintptr_t s_align = (uintptr_t)s8 & _ALIGN_MASK;

    /* Byte copy for unaligned memories */
    if (s_align != d_align) {
        while (len > 32) {
            repeat_byte_copy(&d8, &s8, 32);
            len -= 32;
        }
        while (len) {
            copy_byte(&d8, &s8);
            len--;
        }
        return;
    }

    /* Align the memories */
    if (d_align) {
        d_align = _ALIGN_MASK + 1 - d_align;
        while (d_align && len) {
            copy_byte(&d8, &s8);
            d_align--;
            len--;
        }
    }

    volatile uint32_t *d32 = (volatile uint32_t *)d8;
    const uint32_t *s32 = (const uint32_t *)s8;

    while (len > 32) {
        repeat_word_copy(&d32, &s32, 8);
        len -= 32;
    }

    d8 = (volatile uint8_t *)d32;
    s8 = (const uint8_t *)s32;
    while (len) {
        copy_byte(&d8, &s8);
        len--;
    }
}

How to reproduce?

No response

The text was updated successfully, but these errors were encountered:

liamHowatt · 2024-12-04T08:57:07Z

Are you able to explain which part of the current implementation is undefined behavior?

uLipe · 2024-12-04T20:36:42Z

The idea of using inline code above is good and it is even more expressive than the current builtin option, the problem here is the behavior change across different compilers, using inline keyword will serve as a hint, but will not enforces the function inline behavior from the compiler, some platforms have methods to force the function to be inline, but it is not C standard.

On the same direction from @liamHowatt could you point which kind of issue did you find with the current approach? Something related to the address alignment?

1saeed · 2024-12-05T02:06:11Z

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements. Given the strict aliasing, when we cast between uint8_t* and uint32_t*, we're technically violating this rule.

The compiler sees you writing through a uint32_t*. Later, it sees you reading through a uint8_t*. Under strict aliasing, the compiler knows these pointers can't point to the same memory. So it might cache the uint32_t value in a register and not reload it when you read through the uint8_t pointer.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points.

volatile uint8_t* d8 = dst;
volatile const uint8_t* s8 = src;

Another alternative would be to use proper type punning:

union ptr_cast {
    uint8_t* p8;
    uint32_t* p32;
};

// Then use it like:
union ptr_cast d;
d.p8 = d8;
uint32_t* d32 = d.p32;

onecoolx · 2024-12-05T02:23:24Z

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points. It also fixed the issue on my end.
volatile uint8_t* d8 = dst;
volatile const uint8_t* s8 = src;

Your problem can be solved by adding the GCC compiler option "-fno-strict-aliasing"?

1saeed · 2024-12-05T04:32:44Z

@liamHowatt @uLipe . Thanks a lot for your replies. So, the code switches between uint8_t* and uint32_t* pointers for optimization, but O3 might make aggressive assumptions about pointer aliasing that could break the alignment handling logic. The compiler might incorrectly assume these pointers can't alias to the same memory. Also, The manual loop unrolling (using that _REPEAT8 macro) could conflict with O3's auto-vectorization. The compiler might try to further optimize these already-unrolled loops, potentially causing cache-unfriendly access patterns or excessive register pressure. And last one is that O3 might reorder memory operations since there aren't explicit barriers. The _COPY macro (which does *d = *s; d++; s++;) could have its loads/stores reordered in ways we don't want.

So, I'm guessing that what I'm having is that O3's aggressive optimizations might not properly respect my platform's alignment requirements.

About the solution, that's a fair assumption you mentioned. One thing we can do is adding volatile qualifiers or memory barriers at critical points. It also fixed the issue on my end.
volatile uint8_t* d8 = dst;

volatile const uint8_t* s8 = src;
Your problem can be solved by adding the GCC compiler option "-fno-strict-aliasing"?

The -fstrict-aliasing option is enabled at levels -O2, -O3, -Os.

However, code that only works with GCC with -fno-strict-aliasing is not correct C.

liamHowatt · 2024-12-10T10:55:30Z

Regarding the pointer aliasing assumption issue, it only dereferences a uint32_t * if it's aligned. Are you saying the aggressive optimization will generate machine code that eagerly accesses unaligned memory as though it's aligned, before it confirms it's aligned?

kisvegabor · 2024-12-18T09:48:10Z

@1saeed Thank you for the detailed reply.

Have you really experienced a crash because of it or is it more like a theoretical issue?

These functions might heavily affect the performance so I wonder if you have checked if there is no regression.

1saeed · 2024-12-20T22:09:44Z

Hello Gabor @kisvegabor. Sorry for the late reply! Had a super busy week.

I can confirm that I’ve replicated the issue with unaligned pointers, external SDRAM memory , cache being enable and O3 optimization on my platform (STM23H7)

uint8_t* src = (uint8_t*)0xD0000001;
uint8_t* dst = (uint8_t*)0xD0000063;
uint32_t N = 99;

lv_memcpy(dst, src, N);

Using volatile here would save it:

volatile uint8_t * d8 = dst;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization Issue in lv_memcpy Due to Macro Usage #7406

Optimization Issue in lv_memcpy Due to Macro Usage #7406

1saeed commented Dec 4, 2024

liamHowatt commented Dec 4, 2024

uLipe commented Dec 4, 2024

1saeed commented Dec 5, 2024 •

edited

Loading

onecoolx commented Dec 5, 2024 •

edited

Loading

1saeed commented Dec 5, 2024

liamHowatt commented Dec 10, 2024

kisvegabor commented Dec 18, 2024

1saeed commented Dec 20, 2024 •

edited

Loading

Optimization Issue in lv_memcpy Due to Macro Usage #7406

Optimization Issue in lv_memcpy Due to Macro Usage #7406

Comments

1saeed commented Dec 4, 2024

LVGL version

Platform

What happened?

How to reproduce?

liamHowatt commented Dec 4, 2024

uLipe commented Dec 4, 2024

1saeed commented Dec 5, 2024 • edited Loading

onecoolx commented Dec 5, 2024 • edited Loading

1saeed commented Dec 5, 2024

liamHowatt commented Dec 10, 2024

kisvegabor commented Dec 18, 2024

1saeed commented Dec 20, 2024 • edited Loading

1saeed commented Dec 5, 2024 •

edited

Loading

onecoolx commented Dec 5, 2024 •

edited

Loading

1saeed commented Dec 20, 2024 •

edited

Loading