Replies: 18 comments 40 replies
-
It may be also worth looking into what other web scraping services do, as there do exist commercial offerings which provide similar capabilities as jobfunnel. Other stopgaps are selenium on scrape failure, or more configurability for VPNs (i.e. switch VPNs after N scrapes / scrape failure). We can fairly easily detect the "I am human" page. In the short term I think we should provide a better error for Indeed specifically around detecting this page. As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well. |
Beta Was this translation helpful? Give feedback.
-
Right. I noticed this too a couple of weeks back. And this is exactly why I thought the factory pattern for Selenium might be a good fit. If a scrape fails(and like you said we should have better mechanisms for error detection for when CAPTCHA shows up), then we just send the request via a random proxy. |
Beta Was this translation helpful? Give feedback.
-
how can I see that I'm affected by a CAPTCHA issue? I'm getting 0 results though 2 pages of jobs are found and a plain scraping error (so not even one result, no previous scrapes from this machine)
|
Beta Was this translation helpful? Give feedback.
-
I believe seeing the captcha on the first ever (!) try means indeed is able to detect the headless browser. One option would be trying out if https://github.com/diprajpatra/selenium-stealth helps - another one might be shrinking down JS loaed in the browser as far as possible |
Beta Was this translation helpful? Give feedback.
-
I have been trying apify indeed scraper with success. It seems to have a way around errors and captchas. |
Beta Was this translation helpful? Give feedback.
This comment has been minimized.
This comment has been minimized.
-
Beta Was this translation helpful? Give feedback.
-
Stupid idea, just running it up the flag pole, because I thought this was a good project. Since this project does employ the use of selenium, and assuming one of the various drivers as well. Why not just simply load "Buster, the captcha busting browser extension"? The developer has invested heavily in its development, and it worked on a project, where I encountered the same issue. Admittedly, I was also using a randomly selected proxy, useragent, and something to mitigate cloudflare. It did work, though, and took care of the captcha for me. Sorry, if this is a stupid suggestion. Thanks for your time. |
Beta Was this translation helpful? Give feedback.
-
Can we try using a captcha solving service (for webscraping). For example, 2captcha has an API which can be integrated for solving captchas (https://2captcha.com/2captcha-api). The down side of this is that it does require spending money. But, it's fairly cheap. |
Beta Was this translation helpful? Give feedback.
-
It isn't a bad idea, almost practical, and I am solely speaking for myself here, but I personally am always turned off by projects which require pay-for-use APIs. Their use creates barriers that prevent other's from accessing technology and thus limits innovation. It also seems counterintuitive to expect an individual who needs a job to have to pay to acquire one, because usually that individual is facing poverty and already experiencing financial hardships. |
Beta Was this translation helpful? Give feedback.
-
So, after getting everything setup and ready to go, which was more involved than originally assumed. The Output from log.log: [2023-02-18 05:50:31,156] [DEBUG] JobFunnel: No master-CSV present, did not update block-list: job_search_results/block_list.json
[2023-02-18 05:50:31,157] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng', 'MonsterScraperUSAEng']
[2023-02-18 05:50:31,303] [DEBUG] IndeedScraperUSAEng: Got Base search results page: https://www.indeed.com/jobs?q=Linux&l=boston%2C+MA&radius=50&limit=50&filter=0
[2023-02-18 05:50:31,307] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2023-02-18 05:50:31,308] [DEBUG] JobFunnel: Scraped 0 jobs from IndeedScraperUSAEng, took 0.151s
[2023-02-18 05:50:31,312] [INFO] MonsterScraperUSAEng: No get() or set() will be done for Job attrs: ['REMOTENESS']
[2023-02-18 05:50:31,690] [ERROR] JobFunnel: Failed to scrape jobs for MonsterScraperUSAEng
[2023-02-18 05:50:31,690] [DEBUG] JobFunnel: Scraped 0 jobs from MonsterScraperUSAEng, took 0.382s
[2023-02-18 05:50:31,690] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2023-02-18 05:50:31,699] [WARNING] JobFunnel: No new jobs were added to CSV. On a brighter note, while manually checking the URL provided by indeed, I noticed the connection was routed So, below are three mitigation tools, which might prove to be useful doing thus.
|
Beta Was this translation helpful? Give feedback.
-
@PaulMcInnis With much embarrassment, I would like to say the issue has been resolved. It resulted from conflicting libraries on my local development server, as a result of the system updating from python-3.10 to python-3.11. After performing some clean up, I reran the program, it works fine. I hope to get a chance to update the Indeed element selectors soon. This will allow us the opportunity to test if buster does the trick to circumvent the captcha. It is lightening now, gotta run. Cheers. |
Beta Was this translation helpful? Give feedback.
-
@caeochoa I have been working on it very slowly, until I got totally distracted by one of life's humdingers. I have forked it, and most of my revisions have been pushed to the fork. You are more than welcome to take a look. I created a Discussion category on my github profile. We can go there or move to matrix. Either place, I just try to avoid discordia. Cheers |
Beta Was this translation helpful? Give feedback.
-
Your proposed solution seems solid, incorporating HTTP_Request_Randomizer and transitioning to Selenium for dynamic site support. Have you considered integrating Crawlbase for enhanced functionality? |
Beta Was this translation helpful? Give feedback.
-
Not sure if this project is dead, but has anyone tried or thought about hitting the mobile app endpoints for Indeed as an alternative? It's definitely a WebView, but there is no CAPTCHA (yet, at least) that I can see. Fewer results (20 per page vs. 50 per page), but significantly reduces the complexity. I've found the endpoints through Charles Proxy, and am able to scrape in my minimal tests without any CAPTCHA issues. Additionally, I have been working with SeleniumBase UC mode to get the indeed scraper to bypass CAPTCHA on desktop. It works in headed mode, and also works (sometimes) in headless mode. I have at least managed to get it to a place where it can scrape and save to the CSV. We could just start a headed browser for requests when necessary, but it would be nice if we could do everything in headless, especially for large scrapes. |
Beta Was this translation helpful? Give feedback.
-
Have opened #166 as a fix to the indeed scraping issue. Haven't had a ton of time to work on this outside of the day job, so would appreciate any help on review/testing/updating, as well as implementation of the French and German versions of the Indeed scraper if this approach seems reasonable. Fixes should not be time consuming, just need to check that the sites are consistent. For now, I have just taken the mobile endpoints I found analyzing the mobile version of the application, and am using randomized mobile agents. The scraper is working, and seems to be working consistently with no CAPTCHA issues. I have also updated the parsing so it now accounts for tags and remoteness. For remoteness in particular, I think we can probably make this even better by searching for the term in the title and description as the tags aren't always up to date. See screenshot below for the CSV output. If you want to test it out, I am using the settings_USA.yaml file which you can find updated in my pull request. |
Beta Was this translation helpful? Give feedback.
-
Good approach! Decoupling the web engine and rotating proxies should help with CAPTCHA. Quick suggestions: Consider ScraperAPI for reliable proxy management. |
Beta Was this translation helpful? Give feedback.
-
Hi there
Hope you are all doing well!
Is your feature request related to a problem? Please describe.
This is related to many problems that have appeared recently(CAPTCHA), but also related to issues we have had in the past(Dynamically loaded websites such as Glassdoor). Look at issues #144 and #142.
Describe the solution you'd like
I think CAPTCHA related problems could be solved by taking the approach suggested on #142 by using https://github.com/pgaref/HTTP_Request_Randomizer. However I'm thinking that the best way to approach this would be to make the web engine(using selenium) a factory. Instead of having the web engine be part of the
Job
class, it could be decoupled altogether and have a function that looks something like:This way if we get CAPTCHA in any step of scraping(whether it is while getting the description, number of job pages, etc) we can just request a new web engine from the function above that has a new proxy.
As you can see this also implies switching to Selenium, which I guess I'm proposing here as well. The reason for this is that if we switch to Selenium, we support static and dynamic sites. And it looks like the web drivers do have headless support, which is one of the main reasons why in the past we didn't use Selenium.
Describe alternatives you've considered
So far this is the only way I can think about tackling this at the moment. If anyone else has any other ideas, please don't hesitate to provide feedback!
Additional context
Hope these ideas make sense.
Cheers
Lorenzo
Beta Was this translation helpful? Give feedback.
All reactions