Fix issue with only max 20 XML sitemap files generated on cron jobs #161
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
1_index_sitemap.xml
file. To run the test, execute the sitemap generation with the link for cron jobs given by the module on the CLI, as example:/usr/bin/curl -L --max-redirs 99 "http://shop.com/modules/gsitemap/gsitemap-cron.php?token=1234..."
and check that sitemap generation adds files past1_en_20_sitemap.xml
.Why this pull request?
When generating a sitemap using cron, a CLI tool like curl is typically employed to send an HTTP request to gsitemap-cron.php. For large sites with tens or hundreds of thousands of products, the module divides the sitemap files by size. However, the generation process halts after creating the 20th sitemap XML file, preventing the inclusion of all products in the sitemap. This pull request resolves this issue, enabling sitemap generation to proceed beyond the initial 20 files and ensuring that all products are added to the sitemap.
What is the root of the issue?
The original code evaluates every 20 files generated and, rather than simply redirecting back to the module for the generation of the subsequent sitemap XML file, which would work, it sends an HTTP header "Refresh: 5" instead of an HTTP redirect.
This header lacks documentation (note that this refers to an HTTP header, not the meta refresh tag, which however is deprecated too) and is absent from the HTTP standard. The Refresh HTTP header is a non-standard implementation from the Netscape era, which has only been maintained by browsers. Most, if not all, non-browser HTTP toolkits, such as curl, do not implement it. As a result, it is not guaranteed that using the Refresh HTTP header with tools other than browsers, which is almost always the case with cron jobs, will successfully complete the sitemap generation for very large shops. Rather the opposite, generation will stop with an unrecoverable error.
Sources:
curl/curl#3657
https://daniel.haxx.se/blog/2019/03/12/looking-for-the-refresh-header/
request/request#92
https://chromium-review.googlesource.com/c/chromium/src/+/1520146 (a possible intention to remove support even in chrome, because non-standard)
Moreover, there is no need or advantage to using this HTTP header for every 20th cron job redirect, as a simple redirect functions as intended, and the refresh – even if it were to work – serves no purpose.
Changes introduced by the pull request
There are no deprecations or backward compatibility breaks, and no new or otherwise modified code that could potentially disrupt existing functionality.
The only additional change involves simplifying the else statements and removing superfluous exit() statements for improved clarity and reduced complexity, as seen in this commit: dkarvounaris@a504413?diff=split&w=1
Furthermore, I have updated the documentation to include information on the possible need of adjusting the number of redirects followed by the tool used in conjunction with cron jobs. This is particularly useful for very large shops which may require more redirects than the tool's default redirect limit. Something which can go otherwise unnoticed and may be difficult to debug.