Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing issues for pages accessible at multiple URLs #582

Open
excentrickristy opened this issue Dec 6, 2022 · 22 comments
Open

Fix indexing issues for pages accessible at multiple URLs #582

excentrickristy opened this issue Dec 6, 2022 · 22 comments

Comments

@excentrickristy
Copy link

excentrickristy commented Dec 6, 2022

Google Search Console is having trouble indexing specification pages because they are accessible with and without the .html extension with no canonical URL set.

Example:

https://jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0
https://jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0.html

The jakarta.ee site is linking to the document using the .html extension, so it makes sense to use the version with the extension as the canonical version. But, since the rest of the site uses a trailing slash schema for the urls, it might actually be easier to rewrite the .html URLs to that schema. Then, run a search and replace for the links containing the .html urls in the specification pages to strengthen the internal linking structure.

Open to discussion on how to best deal with this one. At the very least we should be setting canonical URLs for the spec pages.

@excentrickristy
Copy link
Author

@chrisguindon @ivargrimstad

Hi guys,
I was just reviewing these old issues and I know in the other issue, Ivar mentioned that I would have to make requests in each individual spec. Is that true for this as well? I would think this could be addressed on the server level?

Would love to get a better understanding.

@chrisguindon
Copy link
Member

@excentrickristy I added to my queue an action item to build the site and investigate what is causing this duplication.

I might not have time to get to this this week.

@chrisguindon
Copy link
Member

I was not able to find time to look at this yet. @oliviergoulet5 can you take some time this week and investigate the reason why we have a duplicate page in our Hugo site?

@chrisguindon
Copy link
Member

chrisguindon commented Oct 5, 2023

Update: There is a setting in Netlify that is enabled by default called Pretty URLs:
https://docs.netlify.com/site-deploys/post-processing/

Pretty URLs. You can rewrite link URLs to pretty URLs. For example, with Pretty URLs enabled in Site configuration, Netlify rewrites /about to /about/ or /about.html to /about/.

I've disabled it on our test site for the specification repo to see if this will fix the duplication issue. The testing website is under https://jakartaee-specifications.netlify.app/

Based on the testing that @oliviergoulet5, this issue only seems to occur on the Netlify server. He was unable to reproduce the error while running the site in a local dev environment.

@chrisguindon
Copy link
Member

The change didn't seem to have fixed the issue. At least, from my end...

My next step is to download all the files that are deployed on production to confirm that the server does not have any duplicated files that could explain this duplication.

However, the download link on the Netlify site is falling. I will be contacting support about this issue.

@excentrickristy
Copy link
Author

@chrisguindon that's strange, the post processing option being on in the spec repo should have fixed the issue.

I was reading this but it looks like asset optimization is being deprecated according to the article you linked, but not until the 17th. Maybe there's a conflict there?

After reading that mess, I think you might be best to just reach out to Netlify support to resolve the URL issues as well!

@chrisguindon
Copy link
Member

I was reading this but it looks like asset optimization is being deprecated according to the article you linked, but not until the 17th. Maybe there's a conflict there?

It's possible but I just took a look and asset optimization is currently disabled for jakartaee-specifications.

I sent a support email to Netlify and shared a link to this issue to provide more context on what we are trying to solve.

Hopefully, they will be able to point us in the right direction. If anything, it will help once I can download the files that we deployed to confirm if the issue is caused by the server or our custom deployment script.

@chrisguindon
Copy link
Member

chrisguindon commented Oct 10, 2023

Screenshot 2023-10-10 141958

I got access to the deployed files and there is no duplicate. This tells me that the server is creating these duplicates URL. I have pretty URLs disabled for both the spec repo and jakarta.ee.

I suspect there might be a bug where the pretty URLs setting is always on. I will ask them. If this does work, we might be required to migrate the website to EF preview framework where we would have more control over the server.

However, this might not be something we have cycles to do before Q1 2024.

@chrisguindon
Copy link
Member

I heard back from Netlify. Their response is that their CDN normalizes those URLs. .html and without .html and both served by them. This always worked this way and there are no plans to change this.

However, they did mention that we could look at using Edge Functions: Edge Functions overview | Netlify Docs to add the canonical header or to even simply redirect to a particular version.

@chrisguindon
Copy link
Member

Disabling pretty URLs on jakarta.ee triggered a regression on our spec pages:
https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/3838

@excentrickristy
Copy link
Author

@chrisguindon - dredging up an old issue here. Since pretty URLS do not seem to work, can we set a canonical tag on these pages?

@chrisguindon
Copy link
Member

The challenge here is these are webpages that are provided by the Jakarta spec projects. All my team does if download these specs and deploy them as-is on the webserver.

I don't believe these projects are interested in modifying these files since they have been published. 

However, I believe we could set a canonical URL within an HTTP header which we could probably do on the server:
https://developers.google.com/search/docs/crawling-indexing/consolidate-duplicate-urls#rel-canonical-header-method

Netlify allow us to set headers using _headers file. I can do a test using

If it works, I will share the format with you on how this can be configured in case you want to add more of these.

@excentrickristy
Copy link
Author

That sounds good. Since the main website links to the version of the page without .html, that's probably the URL we should use a the canonical URL.

Additionally, it would be nice to do a templated title tag to include the version number as well as the word specification.

@chrisguindon
Copy link
Member

Additionally, it would be nice to do a templated title tag to include the version number as well as the word specification.

@ivargrimstad can we ask our spec projects to follow a template for how set their title tags for their docs? What would be the best way for @excentrickristy to ask them that.

@ivargrimstad
Copy link
Member

That should be possible if the asciidoctor plugin supports it. The documents are all generated from AsciiDoc. Would only be for future documents.

@chrisguindon
Copy link
Member

Thanks @ivargrimstad

@excentrickristy I don't think I need to be involved in that conversation but I would suggest that you start with a new issue asking spec projects to follow a title convention for their future documents!

Regarding the content duplication issue, I just deployed this change:
jakartaee/jakarta.ee@db32dce

This add a new HTTP header for both https://jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0.html and https://jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0 that says:

❯ curl -I https://jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0.html
HTTP/2 200
accept-ranges: bytes
age: 1
cache-control: public,max-age=0,must-revalidate
cache-status: "Netlify Edge"; fwd=miss
...
link: <https://www.jakarta.ee/specifications/faces/3.0/jakarta-faces-3.0>; rel="canonical"
...

Given the current constraints, I believe this is the best we can do.

Should you require additional configurations, please submit a request by including the details in the same format as my commit within an issue against the Jakarta EE website.

For further details, refer to the official documentation on custom headers from Netlify:
https://docs.netlify.com/routing/headers/

Let me know if you have any questions, otherwise I am thinking we can now close this!

@excentrickristy
Copy link
Author

@chrisguindon hmmm I'm not seeing those changes on live. the faces spec pages don't appear to have canonicals set?

@excentrickristy
Copy link
Author

@chrisguindon ah ok I see the canonical in the headers but the issue is that it's using www.jakarta.ee as a root domain instead of jakarta.ee in the canonical link so both faces urls are coming up as non-indexable.

@chrisguindon
Copy link
Member

@excentrickristy Good catch @excentrickristy - I made the fix with jakartaee/jakarta.ee@6fd6676

I would expect to go live in the next 15 minutes.

@excentrickristy
Copy link
Author

@chrisguindon Thank you - it's coming up properly now!

So - If I understand correctly, I will have to file an issue for each spec doc with the code from your commit but updated to reflect the proper URLS.

I will tag you in my first one to make sure I am doing it right! Thank you for your help.

@chrisguindon
Copy link
Member

@excentrickristy - We can do a bunch of them at the same time. It would definitely simplify things if you can provide the changes in the same format as this file so that we can simply copy and paste what you need: https://github.com/jakartaee/jakarta.ee/blob/src/static/_headers

@chrisguindon
Copy link
Member

However, we can start with updating one to make sure you have the correct format.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants