Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCRAPER] - Scraping recipe from theloopywhisk.com fails #4575

Open
3 tasks done
janchrillesen opened this issue Nov 18, 2024 · 0 comments
Open
3 tasks done

[SCRAPER] - Scraping recipe from theloopywhisk.com fails #4575

janchrillesen opened this issue Nov 18, 2024 · 0 comments
Labels
bug Something isn't working scraper triage

Comments

@janchrillesen
Copy link

First Check

  • I used the GitHub search to find a similar issue and didn't find it.

  • I have verified that this issue is not related to the underlying library
    hhyrsev/recipe-scrapers by 1) checking
    the debugger and data is returned, 2)
    verifying that there are errors in the log related to application level code, or
    3) verified that the site provides recipe data, or is otherwise supported by
    hhyrsev/recipe-scrapers

  • This issue can be replicated on the demo site (https://demo.mealie.io/)

Please provide 1-5 example URLs that are having errors

https://theloopywhisk.com/2024/06/29/gluten-free-burger-buns/

Please provide your logs for the Mealie container docker logs <container-id> > mealie.logs

theloopywhisk.com was added as a supported site in hhursev/recipe-scrapers#1220

INFO 2024-11-18T15:47:33 - HTTP Request: GET https://theloopywhisk.com/2024/06/29/gluten-free-burger-buns/ "HTTP/1.1 403 Forbidden"
INFO 2024-11-18T15:47:34 - HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO 2024-11-18T15:47:34 - [xx.xx.xx.xx:0] 400 Bad Request "POST /api/recipes/create/url HTTP/1.1"
INFO 2024-11-18T15:47:58 - [127.0.0.1:43332] 200 OK "GET /api/app/about HTTP/1.1"

It seems like the site is checking user-agent and blocking based on that.

curl https://theloopywhisk.com/2024/06/29/gluten-free-burger-buns/ results in "Sorry, you have been blocked" - but curl --user-agent "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/605.1.15" https://theloopywhisk.com/2024/06/29/gluten-free-burger-buns/ returns the complete recipe

The site seems to be behind cloudflare, so most likely they drop connections from specific user agents

Should support for using a non-standard user-agent for scraping be requested here, or in the upstream scraping library?

Deployment

Docker (Linux)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scraper triage
Projects
None yet
Development

No branches or pull requests

1 participant