-
Notifications
You must be signed in to change notification settings - Fork 920
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve SEO and maintenance of documentation versions #3741
Comments
And on the topic of the flyout, here's my thinking: I think I have become numb to the whole Now, look at what happens when I change the default version to be
By having the number in the URL and also in the flyout by default, I think it's more obvious how the user should go and switch to their version of choice. @stichbury in your opinion, do you think this would make our docs journey more palatable? |
I think this is good, but doesn't it mean that you have to remember to increment the version number for |
It does... but sadly RTD doesn't allow lots of customization about the versioning rules for now. It's a small price to pay though, would happen only a handful of times per year. |
TIL: |
To note, RTD has automation rules https://docs.readthedocs.io/en/stable/automation-rules.html#actions-for-versions although the |
I think the Here's the 📣 proposal
The only thing we need to understand is what would be the impact on indexing and SEO cc @noklam @ankatiyar Thoughts @stichbury ? |
I've somewhat lost track of what your I would personally consider if it's sufficient to just keep |
In principle this is related to our indexing strategy, Let's chat next week |
Renamed this issue to better reflect what should we do here. In readthedocs/readthedocs.org#10648 (comment), RTD staff gave an option to inject meta It's clear that we have to shift our strategy by:
|
Today I had to manually index https://docs.kedro.org/projects/kedro-datasets/en/kedro-datasets-3.0.0 on Google (maybe there are no inbound links?) and I couldn't index 3.0.1 (it's currently blocked by our |
Summary of things to do here:
Refs: https://www.stevenhicks.me/blog/2023/11/how-to-deindex-your-docs-from-google/, https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls |
Today I've been researching about this again (yeah, I have weird hobbies...) I noticed that projects hosted on https://docs.rs don't seem to exhibit these SEO problems, and also that they seemingly take a basic, but effective, approach. Compare https://docs.rs/clap/latest/clap/ with https://docs.rs/clap/2.34.0/clap/. There is no trace of What they do though is having very lean sitemaps. If you look at https://docs.rs/-/sitemap/c/sitemap.xml, there's only 2 entries for <url>
<loc>https://docs.rs/clap/latest/clap/</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>1.0</priority>
</url>
<url>
<loc>https://docs.rs/clap/latest/clap/all.html</loc>
<lastmod>2024-08-10T00:24:50.344647+00:00</lastmod>
<priority>0.8</priority>
</url> Compare it with https://docs.kedro.org/sitemap.xml, which is, in comparison... less than ideal: <url>
<loc>https://docs.kedro.org/en/stable/</loc>
<lastmod>2024-08-01T18:53:11.571849+00:00</lastmod>
<changefreq>weekly</changefreq>
<priority>1</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/latest/</loc>
<lastmod>2024-08-09T09:39:27.628501+00:00</lastmod>
<changefreq>daily</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.7/</loc>
<lastmod>2024-08-01T18:53:11.647322+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.6/</loc>
<lastmod>2024-05-27T16:32:42.584307+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.5/</loc>
<lastmod>2024-04-22T11:56:55.928132+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.6</priority>
</url>
<url>
<loc>https://docs.kedro.org/en/0.19.4.post1/</loc>
<lastmod>2024-05-17T12:25:27.050615+00:00</lastmod>
<changefreq>monthly</changefreq>
<priority>0.5</priority>
</url>
... The way I read this is that RTD is treating tags as long-lived branches, and as a result telling search engines that docs of old versions will be updated monthly, which in our current scheme is incorrect. I am not sure if this is something worth reporting to RTD, but maybe we should look at uploading a custom |
Reopening until we solve the issue (whether improving the sitemaps, retroactively changing the tags, or painting a pentagon with a turkey's head...) |
Added @DimedS to our Google Search Console, hope this will help! |
For reference, I tried the redirection trick described in #3741 (comment) for kedro-datasets #4145 (comment) and seems to be working. I don't want to boil the ocean right now because we're in the middle of some delicate SEO experimentation phase, but when the dust settles, I will propose this for all our projects. |
The sitemap hasn't changed 😬 https://docs.kedro.org/sitemap.xml |
Newsflash: RTD now excludes hidden versions from the automatically generated sitemap readthedocs/readthedocs.org#11675 |
After a discussion with @astrojuanlu and an unsuccessful attempt to apply a custom sitemap.xml to the Kedro documentation in issue #4261, we changed all Kedro documentation versions, except for "stable" and "latest," to hidden in the Read the Docs (RTD) web dashboard. This immediately updated our However, there is still an issue with subfolders, "viz" and "datasets." Hiding versions for these subfolders does not affect |
I received an answer from the RTD team:
If I understand correctly, this means that to implement a manual I think we should give this a try. What do you think, @astrojuanlu? After we hid all versions of the main Kedro project, the search results improved for Kedro, but for datasets and Viz, it still seems to be referencing old versions. For example, if I search "kedro matplotlib dataset" on Google, I see everything except the correct link:
![]() |
From my understanding, removing all old version from the sitemap didn't hide them from the search results: In fact, none of these URLs are referenced in any of our current sitemaps. Not even
Long story short, the hypothesis I proposed in #3741 (comment) has been disproven. Just limiting the Now, if we use The method suggested by Google has 2 flavors:
|
@astrojuanlu, I agree that to achieve more reliable blocking of old documentation versions from being indexed, we should use content="noindex". I can work on implementing this approach in our Sphinx build. Additionally, if we continue with the current autogenerated setup, it’s likely that recent versions of the DataFrame and Viz documentation will remain unindexed, as we've observed. Therefore, I think we should consider reverting to our previous custom-generated |
Indeed, I'd say let's split this problem in two?
|
@astrojuanlu, I explored a few approaches in PR #4261, and one of them works: commit ff07526. This solution adds
For the current release, I propose moving forward only with a manual update to |
Yes let's move forward with this for now 👍🏼 |
I know I'm a pain in the neck 😬 but I'll leave this ticket open until we're happy with the solution... |
The new |
We are currently working on the final step: adding a However, our initial attempt didn’t work as expected. We’re now waiting for a patch from RTD to be implemented soon. |
This is now possible on RTD readthedocs/readthedocs.org#10648 (comment)
(from readthedocs/addons#431 (comment)) Example:
In theory this then is present in all old versions 🤞🏼 unclear what the |
Currently, in collaboration with the Read the Docs (RTD) team, we have prepared a temporary JavaScript script located at: 📍 https://storage.googleapis.com/rtd_meta_tag_file/deindex-old-docs.js This script executes during every docs run and injects a noindex meta tag into hidden versions of our docs to prevent them from being indexed by search engines. We need to test and confirm that this approach effectively prevents indexing. Once validated, we will migrate the script to the Kedro docs repository after the next release, @astrojuanlu |
It has been about two weeks since we started using the script. We've seen an effect, as the number of indexed pages has dropped by approximately 5%: ![]() However, we can still see some old versions of the documentation being indexed by Google: ![]() The likely reason for this is that these pages are blocked from crawling by robots.txt, preventing search engine crawlers from reading the latest version of the page. As a result, they cannot detect that the page now includes: ![]() Despite this, it seems that these pages no longer appear in search results, likely because they have become irrelevant for ranking. Overall, it looks like we have achieved our goal - search is now working as expected. To finalise this solution, we need to take a few last steps:
![]() Are we comfortable with this approach? The downside is that we will need to manually hide versions after each release. Another drawback is that users won’t see available versions in the console: ![]() I can try modifying the script so that it no longer relies on the hidden state. What do you think, @astrojuanlu? |
@DimedS Thanks for the analysis! I think the ideal thing would indeed to see some versions in the flyout menu, but tweaking the script so that only the last stable one is indexed. About the pages blocked by |
Yeah it was part of this effort of de-indexing versions from Google... let's see if we can decouple the version list from the index 🙏🏼 latest is still there and also an alias: https://docs.kedro.org/en/nightly/ |
In #2980 we discussed about the fact that too many Kedro versions appear in search results.
We fixed that in #3030 by manually controlling what versions did we want to be indexed.
This caused a number of issues though, most importantly #3710: we had been accidentally excluded our subprojects from our search results.
We fixed that in #3729 in a somewhat unsatisfactory fashion. In particular, there are concerns about consistency and maintainability #3729 (comment) (see also #2600 (comment) about the problem of projects under
kedro-org/kedro-plugins
not having astable
version).In addition, my mind has evolved a bit and I think we should only index 1 version in search engines:
stable
. There were concerns about users not understanding the flyout menu #2980 (comment) and honestly thelatest
part is also quite confusing (#2823, readthedocs/readthedocs.org#10674) but that's a whole separate discussion.For now, the problems we want to solve are
robots.txt
, ideally by not having to ever touch it again.The text was updated successfully, but these errors were encountered: