Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issue with only max 20 XML sitemap files generated on cron jobs #161

Open
wants to merge 4 commits into
base: dev
Choose a base branch
from

Conversation

dkarvounaris
Copy link

@dkarvounaris dkarvounaris commented Mar 15, 2023

Questions Answers
Description? Full description below
Type? bug fix
BC breaks? no
Deprecations? no
Fixed ticket? n/a
How to test? To thoroughly test the sitemap generation process, it is necessary to have a sufficient number of products, images, and other pages or content. This will allow the generation of at least 20 sitemaps, which can then be used for testing purposes. The number of products required is not fixed and will depend on factors such as the number of images per product. For instance, for some shops it may be 5,000 products, for others 20,000 products. During the sitemap generation process, if a memory limit is exceeded (a specific check exists in gsitemap.php), the process may also move already to the next file with even less links processed. Currently, our shop has 200,000 products and 2.1 million images, resulting in the generation of 80 sitemap files. The size of these files varies between 2MB and 30MB. Prior to the patch, the generation process would always stop at the 20th file. However, after the patch, the sitemap generation process will continue until all available links are included in sitemap files, with the entire list of sitemap files being contained in the main 1_index_sitemap.xml file. To run the test, execute the sitemap generation with the link for cron jobs given by the module on the CLI, as example: /usr/bin/curl -L --max-redirs 99 "http://shop.com/modules/gsitemap/gsitemap-cron.php?token=1234..." and check that sitemap generation adds files past 1_en_20_sitemap.xml.

Why this pull request?

When generating a sitemap using cron, a CLI tool like curl is typically employed to send an HTTP request to gsitemap-cron.php. For large sites with tens or hundreds of thousands of products, the module divides the sitemap files by size. However, the generation process halts after creating the 20th sitemap XML file, preventing the inclusion of all products in the sitemap. This pull request resolves this issue, enabling sitemap generation to proceed beyond the initial 20 files and ensuring that all products are added to the sitemap.

What is the root of the issue?

The original code evaluates every 20 files generated and, rather than simply redirecting back to the module for the generation of the subsequent sitemap XML file, which would work, it sends an HTTP header "Refresh: 5" instead of an HTTP redirect.

This header lacks documentation (note that this refers to an HTTP header, not the meta refresh tag, which however is deprecated too) and is absent from the HTTP standard. The Refresh HTTP header is a non-standard implementation from the Netscape era, which has only been maintained by browsers. Most, if not all, non-browser HTTP toolkits, such as curl, do not implement it. As a result, it is not guaranteed that using the Refresh HTTP header with tools other than browsers, which is almost always the case with cron jobs, will successfully complete the sitemap generation for very large shops. Rather the opposite, generation will stop with an unrecoverable error.

Sources:
curl/curl#3657
https://daniel.haxx.se/blog/2019/03/12/looking-for-the-refresh-header/
request/request#92
https://chromium-review.googlesource.com/c/chromium/src/+/1520146 (a possible intention to remove support even in chrome, because non-standard)

Moreover, there is no need or advantage to using this HTTP header for every 20th cron job redirect, as a simple redirect functions as intended, and the refresh – even if it were to work – serves no purpose.

Changes introduced by the pull request

There are no deprecations or backward compatibility breaks, and no new or otherwise modified code that could potentially disrupt existing functionality.

The only additional change involves simplifying the else statements and removing superfluous exit() statements for improved clarity and reduced complexity, as seen in this commit: dkarvounaris@a504413?diff=split&w=1

Furthermore, I have updated the documentation to include information on the possible need of adjusting the number of redirects followed by the tool used in conjunction with cron jobs. This is particularly useful for very large shops which may require more redirects than the tool's default redirect limit. Something which can go otherwise unnoticed and may be difficult to debug.

@dkarvounaris
Copy link
Author

Just to clarify, the bugfix has been in production for several weeks now, although with the latest module release. I have now deployed the development branch containing the pull request to the server for testing, and it is still functioning as intended. 74 required sitemaps have been generated successfully with gsitemap-cron.php.

image

Copy link
Contributor

@kpodemski kpodemski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dkarvounaris! LGTM!

I updated your PR to include the table required for our processes. Could you please write a short "How to test"? Something like "Have X products in your store, run CRON from CLI, you should have X sitemaps, but before PR you have 20 sitemaps"? You also forgot about translation domains.

views/templates/admin/configuration.tpl Outdated Show resolved Hide resolved
@dkarvounaris
Copy link
Author

@kpodemski Thank you!

I added the requested info. If you require something else from me for the pull request to be complete to be merged, please let me know.

@kpodemski
Copy link
Contributor

@PrestaShop/qa-team it's quite a technical PR, I'd like to do the tests myself

@HanaRebaiQA
Copy link

Hello @kpodemski

Yes! you can do it. Good luck

@nicosomb
Copy link
Contributor

nicosomb commented Sep 1, 2023

@dkarvounaris the tests are failing, could you please fix it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting for author
4 participants