Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing: Home pages which exceed the 1Mb size limit #150

Open
m-i-l opened this issue May 16, 2024 · 0 comments
Open

Indexing: Home pages which exceed the 1Mb size limit #150

m-i-l opened this issue May 16, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@m-i-l
Copy link
Contributor

m-i-l commented May 16, 2024

It is important that all domains have a page in the index with is_home=true. This is so that they appear on the Browse page. See also #102 .

However, some sites have home pages which exceed the maximum pages size (1Mb) and so aren't indexed. Examples include 5.1Mb for https://www.gleech.org/, 1.5Mb https://www.allendowney.com/blog/ and 1.4Mb https://www.swyx.io/ .

Workaround is to identify pages without home pages via consistencycheck.py, check if the home pages size is >1Mb, and if so set a different home page (e.g. the /about page , which isn't ideal) and reindex. That's a bit of a faff though, so want to see if there's a better way of handling.

Ideas include a logging and alerting when this is detected during indexing, and potentially even seeing if there is a way of still adding the home page URL to the search index without indexing the content.

@m-i-l m-i-l added the enhancement New feature or request label May 16, 2024
@m-i-l m-i-l changed the title Indexing: Better handle home pages which are too large Indexing: Home pages which exceed the 1Mb size limit May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant