Fix issue with only max 20 XML sitemap files generated on cron jobs #161

dkarvounaris · 2023-03-15T14:26:36Z

Questions	Answers
Description?	Full description below
Type?	bug fix
BC breaks?	no
Deprecations?	no
Fixed ticket?	n/a
How to test?	To thoroughly test the sitemap generation process, it is necessary to have a sufficient number of products, images, and other pages or content. This will allow the generation of at least 20 sitemaps, which can then be used for testing purposes. The number of products required is not fixed and will depend on factors such as the number of images per product. For instance, for some shops it may be 5,000 products, for others 20,000 products. During the sitemap generation process, if a memory limit is exceeded (a specific check exists in gsitemap.php), the process may also move already to the next file with even less links processed. Currently, our shop has 200,000 products and 2.1 million images, resulting in the generation of 80 sitemap files. The size of these files varies between 2MB and 30MB. Prior to the patch, the generation process would always stop at the 20th file. However, after the patch, the sitemap generation process will continue until all available links are included in sitemap files, with the entire list of sitemap files being contained in the main `1_index_sitemap.xml` file. To run the test, execute the sitemap generation with the link for cron jobs given by the module on the CLI, as example: `/usr/bin/curl -L --max-redirs 99 "http://shop.com/modules/gsitemap/gsitemap-cron.php?token=1234..."` and check that sitemap generation adds files past `1_en_20_sitemap.xml`.

Why this pull request?

When generating a sitemap using cron, a CLI tool like curl is typically employed to send an HTTP request to gsitemap-cron.php. For large sites with tens or hundreds of thousands of products, the module divides the sitemap files by size. However, the generation process halts after creating the 20th sitemap XML file, preventing the inclusion of all products in the sitemap. This pull request resolves this issue, enabling sitemap generation to proceed beyond the initial 20 files and ensuring that all products are added to the sitemap.

What is the root of the issue?

The original code evaluates every 20 files generated and, rather than simply redirecting back to the module for the generation of the subsequent sitemap XML file, which would work, it sends an HTTP header "Refresh: 5" instead of an HTTP redirect.

This header lacks documentation (note that this refers to an HTTP header, not the meta refresh tag, which however is deprecated too) and is absent from the HTTP standard. The Refresh HTTP header is a non-standard implementation from the Netscape era, which has only been maintained by browsers. Most, if not all, non-browser HTTP toolkits, such as curl, do not implement it. As a result, it is not guaranteed that using the Refresh HTTP header with tools other than browsers, which is almost always the case with cron jobs, will successfully complete the sitemap generation for very large shops. Rather the opposite, generation will stop with an unrecoverable error.

Sources:
curl/curl#3657
https://daniel.haxx.se/blog/2019/03/12/looking-for-the-refresh-header/
request/request#92
https://chromium-review.googlesource.com/c/chromium/src/+/1520146 (a possible intention to remove support even in chrome, because non-standard)

Moreover, there is no need or advantage to using this HTTP header for every 20th cron job redirect, as a simple redirect functions as intended, and the refresh – even if it were to work – serves no purpose.

Changes introduced by the pull request

There are no deprecations or backward compatibility breaks, and no new or otherwise modified code that could potentially disrupt existing functionality.

The only additional change involves simplifying the else statements and removing superfluous exit() statements for improved clarity and reduced complexity, as seen in this commit: dkarvounaris@a504413?diff=split&w=1

Furthermore, I have updated the documentation to include information on the possible need of adjusting the number of redirects followed by the tool used in conjunction with cron jobs. This is particularly useful for very large shops which may require more redirects than the tool's default redirect limit. Something which can go otherwise unnoticed and may be difficult to debug.

…t tools (as curl)

…ty. All paths either return or exit() through class Tools

…tools for large shops in cron jobs

dkarvounaris · 2023-03-15T20:08:33Z

Just to clarify, the bugfix has been in production for several weeks now, although with the latest module release. I have now deployed the development branch containing the pull request to the server for testing, and it is still functioning as intended. 74 required sitemaps have been generated successfully with gsitemap-cron.php.

kpodemski

Thanks @dkarvounaris! LGTM!

I updated your PR to include the table required for our processes. Could you please write a short "How to test"? Something like "Have X products in your store, run CRON from CLI, you should have X sitemaps, but before PR you have 20 sitemaps"? You also forgot about translation domains.

views/templates/admin/configuration.tpl

dkarvounaris · 2023-03-30T18:01:04Z

@kpodemski Thank you!

I added the requested info. If you require something else from me for the pull request to be complete to be merged, please let me know.

Co-authored-by: Krystian Podemski <[email protected]>

kpodemski · 2023-03-31T14:51:56Z

@PrestaShop/qa-team it's quite a technical PR, I'd like to do the tests myself

HanaRebaiQA · 2023-03-31T14:59:50Z

Hello @kpodemski

Yes! you can do it. Good luck

nicosomb · 2023-09-01T12:54:08Z

@dkarvounaris the tests are failing, could you please fix it?

dkarvounaris added 3 commits March 15, 2023 14:06

Fix issue not creating more than 20 sitemaps with cron and CLI reques…

e0aa533

…t tools (as curl)

Both else and exit() not needed. Clearer intention and lower complexi…

a504413

…ty. All paths either return or exit() through class Tools

Troubleshooting documentation about increasing the redirect limit of …

30d4223

…tools for large shops in cron jobs

kpodemski reviewed Mar 29, 2023

View reviewed changes

views/templates/admin/configuration.tpl Outdated Show resolved Hide resolved

Update views/templates/admin/configuration.tpl

cacf3b9

Co-authored-by: Krystian Podemski <[email protected]>

kpodemski approved these changes Mar 31, 2023

View reviewed changes

kpodemski added the waiting for QA label Mar 31, 2023

HanaRebaiQA added the waiting for dev label Mar 31, 2023

nicosomb closed this Sep 1, 2023

nicosomb reopened this Sep 1, 2023

nicosomb added waiting for author and removed waiting for QA waiting for dev labels Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix issue with only max 20 XML sitemap files generated on cron jobs #161

Fix issue with only max 20 XML sitemap files generated on cron jobs #161

dkarvounaris commented Mar 15, 2023 •

edited

dkarvounaris commented Mar 15, 2023

kpodemski left a comment •

edited

dkarvounaris commented Mar 30, 2023

kpodemski commented Mar 31, 2023

HanaRebaiQA commented Mar 31, 2023

nicosomb commented Sep 1, 2023

Fix issue with only max 20 XML sitemap files generated on cron jobs #161

Are you sure you want to change the base?

Fix issue with only max 20 XML sitemap files generated on cron jobs #161

Conversation

dkarvounaris commented Mar 15, 2023 • edited

Why this pull request?

What is the root of the issue?

Changes introduced by the pull request

dkarvounaris commented Mar 15, 2023

kpodemski left a comment • edited

Choose a reason for hiding this comment

dkarvounaris commented Mar 30, 2023

kpodemski commented Mar 31, 2023

HanaRebaiQA commented Mar 31, 2023

nicosomb commented Sep 1, 2023

dkarvounaris commented Mar 15, 2023 •

edited

kpodemski left a comment •

edited