How can we re-build the scraping backend to account for recent CAPTCHA restrictions? #148

thebigG · 2021-06-01T17:58:15Z

thebigG
Jun 1, 2021
Collaborator

Hi there

Hope you are all doing well!

Is your feature request related to a problem? Please describe.
This is related to many problems that have appeared recently(CAPTCHA), but also related to issues we have had in the past(Dynamically loaded websites such as Glassdoor). Look at issues #144 and #142.

Describe the solution you'd like
I think CAPTCHA related problems could be solved by taking the approach suggested on #142 by using https://github.com/pgaref/HTTP_Request_Randomizer. However I'm thinking that the best way to approach this would be to make the web engine(using selenium) a factory. Instead of having the web engine be part of the Job class, it could be decoupled altogether and have a function that looks something like:

def get_web_engine(headless: bool, arg1, arg2, etc):
   proxy = get_random_proxy()
   engine = init_web_engine
   ...
   return engine

This way if we get CAPTCHA in any step of scraping(whether it is while getting the description, number of job pages, etc) we can just request a new web engine from the function above that has a new proxy.

As you can see this also implies switching to Selenium, which I guess I'm proposing here as well. The reason for this is that if we switch to Selenium, we support static and dynamic sites. And it looks like the web drivers do have headless support, which is one of the main reasons why in the past we didn't use Selenium.

Describe alternatives you've considered
So far this is the only way I can think about tackling this at the moment. If anyone else has any other ideas, please don't hesitate to provide feedback!

Additional context

Hope these ideas make sense.
Cheers
Lorenzo

PaulMcInnis · 2021-06-20T17:42:20Z

PaulMcInnis
Jun 20, 2021
Maintainer

It may be also worth looking into what other web scraping services do, as there do exist commercial offerings which provide similar capabilities as jobfunnel.

Other stopgaps are selenium on scrape failure, or more configurability for VPNs (i.e. switch VPNs after N scrapes / scrape failure).

We can fairly easily detect the "I am human" page. In the short term I think we should provide a better error for Indeed specifically around detecting this page.

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

0 replies

thebigG · 2021-06-20T17:49:01Z

thebigG
Jun 20, 2021
Collaborator Author

As an aside I just tested it now and got to ~66 scrapes before the CAPTCHA, oh well.

Right. I noticed this too a couple of weeks back. And this is exactly why I thought the factory pattern for Selenium might be a good fit. If a scrape fails(and like you said we should have better mechanisms for error detection for when CAPTCHA shows up), then we just send the request via a random proxy.

0 replies

benb0jangles · 2021-11-01T05:33:49Z

benb0jangles
Nov 1, 2021

https://github.com/niespodd/browser-fingerprinting

0 replies

chris-aeviator · 2021-12-01T08:57:11Z

chris-aeviator
Dec 1, 2021

how can I see that I'm affected by a CAPTCHA issue?

I'm getting 0 results though 2 pages of jobs are found and a plain scraping error (so not even one result, no previous scrapes from this machine)

[2021-12-01 08:50:39,070] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperDEGer']
[2021-12-01 08:50:39,939] [INFO] IndeedScraperDEGer: Found 2 pages of search results for query={redacted}
[2021-12-01 08:50:40,812] [INFO] IndeedScraperDEGer: Scraped 0 job listings from search results pages
[2021-12-01 08:50:40,819] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperDEGer
[2021-12-01 08:50:40,821] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.

1 reply

PaulMcInnis Dec 1, 2021
Maintainer

You will see per-job scrapes fail past a certain number, maybe 20 or 30 jobs.

chris-aeviator · 2021-12-01T09:21:31Z

chris-aeviator
Dec 1, 2021

I believe seeing the captcha on the first ever (!) try means indeed is able to detect the headless browser.

One option would be trying out if https://github.com/diprajpatra/selenium-stealth helps - another one might be shrinking down JS loaed in the browser as far as possible

0 replies

benb0jangles · 2021-12-01T10:51:43Z

benb0jangles
Dec 1, 2021

I have been trying apify indeed scraper with success. It seems to have a way around errors and captchas.

0 replies

benb0jangles · 2021-12-01T11:00:24Z

benb0jangles Dec 1, 2021

I mean't perhaps have a look at the log on the apify indeed scraper as it runs, to see how it works?

chris-aeviator · 2021-12-01T16:46:39Z

chris-aeviator Dec 1, 2021

ok - I understand - solving hcaptcha with YoloV3 basically works within <10 sec

PaulMcInnis · 2021-12-01T17:01:22Z

PaulMcInnis Dec 1, 2021
Maintainer

hey @chris-aeviator and @benb0jangles please use reply if you are responding to something, to avoid starting a new thread every time.

thx for dicussion all!

chris-aeviator · 2021-12-01T16:44:07Z

chris-aeviator
Dec 1, 2021

It seems like not CAPTCHA is the issue (with indeed) but rather the HTML classnames / the scraper.

indeed renders the jobs as .result instead of data-tn-component as described inside of _get_job_soups_from_search_page

EDIT: the screenshot is the content of self.session.get(url).text saved as HTML

4 replies

PaulMcInnis Dec 1, 2021
Maintainer

hello, I encourage you to open an issue and a PR to update the scraper 👍

Even if the scraping of jobs does not itself work, the captcha will still be an issue, which this thread is concerned with.

Nllii Apr 3, 2022

Hello, If you haven't closed this project, are you open to reverse engineering some stuff?
I went ahead and started:
https://github.com/Nllii/jobfilter/tree/main/job_apis

Let me know so I can go ahead and break stuff implementing it in Jobfunnel, Hopefully I get help fixing it.

Lastly, I was reading the code, https://github.com/PaulMcInnis/JobFunnel/blob/master/jobfunnel/backend/tools/filters.py
You are familiar with nltk and numpy; do you want to implement computer vision in your project using cv2 and a training a model using https://www.fast.ai?

PaulMcInnis Apr 4, 2022
Maintainer

Hello @Nllii if you are open to improving this repo with some commits that would be most appreciated.

However, I am not interested in deploying cv2-based models to defeat captcha.

Nllii Apr 5, 2022

However, I am not interested in deploying cv2-based models to defeat captcha.

It's not to defeat captcha,I have used opencv bots in the past to play online games, chess games and connect four, I was thinking this idea could used to apply to jobs faster.

It scan webpages, finds the OCR that matches the required pages and fills in the information.
This is the concept: https://github.com/g-arnav/DinoML

anoduck · 2022-09-17T19:18:43Z

anoduck
Sep 17, 2022

Stupid idea, just running it up the flag pole, because I thought this was a good project.

Since this project does employ the use of selenium, and assuming one of the various drivers as well. Why not just simply load "Buster, the captcha busting browser extension"? The developer has invested heavily in its development, and it worked on a project, where I encountered the same issue.

Admittedly, I was also using a randomly selected proxy, useragent, and something to mitigate cloudflare. It did work, though, and took care of the captcha for me.

Sorry, if this is a stupid suggestion. Thanks for your time.

13 replies

jeremyjs Dec 15, 2022

I am interested in contributing/collaborating.

Seems like apify uses Crawlee under the hood: https://crawlee.dev/docs/introduction.

anoduck Dec 16, 2022

@jeremyjs I am definitely interested as well. Since I am on mobile and walking, someone needs to tag the other guys and see if they are interested.

I am currently working on a scraping project of my own, and just happened to upgrade captcha buster to the newest release. It still works for that particular site, without need of a proxy or cloudflare mitigation. I will try to write a test script today and see if it works with some of the job sites. That is if I remember to.

PaulMcInnis Feb 14, 2023
Maintainer

thank's for taking a look into this @anoduck

anoduck Feb 16, 2023

@PaulMcInnis You must have read the "...if I remember to...", because I didn't remember. Although, I have written a few more crawlers since then. I will get on it. Cheers.

caeochoa May 6, 2023

@anoduck @jeremyjs @a-curious-coder I've been recently looking into this project and I'm really interested in getting it back to working. Not sure if any of you got very far with this but I would like to collaborate on this. I see there was some talk of maybe making a discord or matrix to chat about it, I think that would be a good idea to get on the same page on what progress has been made recently and what's needed now.

autodidactdev · 2023-01-03T22:23:21Z

autodidactdev
Jan 3, 2023

Can we try using a captcha solving service (for webscraping). For example, 2captcha has an API which can be integrated for solving captchas (https://2captcha.com/2captcha-api). The down side of this is that it does require spending money. But, it's fairly cheap.

1 reply

Nllii Feb 14, 2023

http://www.catb.org/~esr/faqs/hacker-howto.html#believe2

No problem should ever have to be solved twice.

it's not a permanent solution. We don't have the freedom. There is Friction between us and them.

anoduck · 2023-01-05T23:38:52Z

anoduck
Jan 5, 2023

It isn't a bad idea, almost practical, and I am solely speaking for myself here, but I personally am always turned off by projects which require pay-for-use APIs. Their use creates barriers that prevent other's from accessing technology and thus limits innovation. It also seems counterintuitive to expect an individual who needs a job to have to pay to acquire one, because usually that individual is facing poverty and already experiencing financial hardships.

0 replies

anoduck · 2023-02-18T12:41:49Z

anoduck
Feb 18, 2023

So, after getting everything setup and ready to go, which was more involved than originally assumed. The
tests were run, and the result was not a successful one. As the scrape appears to have failed due to some
unknown reason. When the URL was opened in a browser to test if the link was good, the page opened without any
error or sign of a captcha. So, it is uncertain if the captcha prevented the scraping or whether it was
something else that interfered with the script. Further testing will be needed to see what is the cause of the
failure.

Output from log.log:

[2023-02-18 05:50:31,156] [DEBUG] JobFunnel: No master-CSV present, did not update block-list: job_search_results/block_list.json
[2023-02-18 05:50:31,157] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng', 'MonsterScraperUSAEng']
[2023-02-18 05:50:31,303] [DEBUG] IndeedScraperUSAEng: Got Base search results page: https://www.indeed.com/jobs?q=Linux&l=boston%2C+MA&radius=50&limit=50&filter=0
[2023-02-18 05:50:31,307] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2023-02-18 05:50:31,308] [DEBUG] JobFunnel: Scraped 0 jobs from IndeedScraperUSAEng, took 0.151s
[2023-02-18 05:50:31,312] [INFO] MonsterScraperUSAEng: No get() or set() will be done for Job attrs: ['REMOTENESS']
[2023-02-18 05:50:31,690] [ERROR] JobFunnel: Failed to scrape jobs for MonsterScraperUSAEng
[2023-02-18 05:50:31,690] [DEBUG] JobFunnel: Scraped 0 jobs from MonsterScraperUSAEng, took 0.382s
[2023-02-18 05:50:31,690] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2023-02-18 05:50:31,699] [WARNING] JobFunnel: No new jobs were added to CSV.

On a brighter note, while manually checking the URL provided by indeed, I noticed the connection was routed
through Cloudflare. Although in my tests, Cloudflare did not appear to be causing the experienced issues, it
might be worth implementing a mitigation strategy for cloudflare to prevent scraping failures in the future.

So, below are three mitigation tools, which might prove to be useful doing thus.

cfscrape-http-proxy: Creates a proxy that implements
the cloudscraper module. #Active
cloudscraper: Appears to be the only actively developed means
to bypass the Cloudflare network as a python module. Implementation method is unknown. Uses the requests
library, so using selenium-requests might be an option. #Active #Requests
FlareSolver: Can only be run in a dockerized container, and
is based on nodejs. #Honorable_Mention

6 replies

anoduck Mar 3, 2023

Ah, fiddlesticks... Getting a nasty error attempting to execute funnel load -s my_settings.yml in a virtual environment. Unfortunately, now that Python has implemented PEP 668, I can't simply inform pip to install them at the user level with the --user flag without passing the flag to --break-system-packages. Opening up an issue with the full output.

anoduck Mar 3, 2023

Might try to build a docker image for poetry later on. Since that seems like the "trendy" thing to do these days... This would allow bypassing recent issues encountered.

PaulMcInnis Mar 8, 2023
Maintainer

aye if you want to do some work on this issue you know it will be appreciated, there are a lot of 👀 on this project, I just haven't been able to give it the love it deserves.

anoduck Mar 12, 2023

@PaulMcInnis The issue was resolved.

anoduck Mar 12, 2023

It does appear CaptchaBuster has successfully mitigated the captcha, and the reason for still unsuccessfully scraping the page are due to outdated beautiful soup selectors. As the results page has changed significantly since the backend was last written.

anoduck · 2023-03-09T22:39:49Z

anoduck
Mar 9, 2023

@PaulMcInnis With much embarrassment, I would like to say the issue has been resolved. It resulted from conflicting libraries on my local development server, as a result of the system updating from python-3.10 to python-3.11. After performing some clean up, I reran the program, it works fine.

I hope to get a chance to update the Indeed element selectors soon. This will allow us the opportunity to test if buster does the trick to circumvent the captcha.

It is lightening now, gotta run.

Cheers.

3 replies

PaulMcInnis Mar 17, 2023
Maintainer

hey no worries, I'm just happy to see someone taking a stab at reviving this 👍

anoduck Mar 27, 2023

@PaulMcInnis the first thing that ran through my mind when I encountered this project was that it had real application to life, and it wasn't just another application to do something cool. It actually had the potential to greatly change and improve someone's life. It's creation was altruistic. Yes, I am sure it benefited yourself for creating, but think about how many people it also helped gain employment. That is my two cent spiel, at least...

PaulMcInnis Mar 27, 2023
Maintainer

I actually found a job by using the v 0.1 of this tool 😄

I never expected anything to come of it when I shared the source on HN, but it ended up blowing up a bit, and it feels like it had some impact too as evidenced by the usership in that first year or so.

Definitely things have changed a lot since I released it, and I've considered closing the repo a few times (as in open to forks, but closed for my maintenance), since I haven't had a good path forwards to making it really useful again.

If the captcha stuff hadn't clamped down so hard I think maybe I would have invested more time into the TFIDF features and so on as well.

anoduck · 2023-05-07T02:42:12Z

anoduck
May 7, 2023

@caeochoa I have been working on it very slowly, until I got totally distracted by one of life's humdingers. I have forked it, and most of my revisions have been pushed to the fork. You are more than welcome to take a look. https://github.com/anoduck/jobfunnel Just don't judge me, I am still a learning programming, and not a pro like @PaulMcInnis.

I created a Discussion category on my github profile. We can go there or move to matrix. Either place, I just try to avoid discordia.

Cheers

3 replies

caeochoa May 8, 2023

@anoduck That's great! Thank you, I will have a look and if I have any thoughts I'll add them to the discussion on your profile. And don't worry, I'm also learning programming at the moment so you're likely to know more than I do haha

anoduck May 9, 2023

@caeochoa We will learn together. When you go and have a gander, make sure you change over to the “dev” branch. Which is where all the changes have been made.

Primarily, the changes occurred in the indeed backend file.

The last time I was working on it, I was focusing on just getting the indeed backend running, and I had encountered a roadblock. I was receiving the Failed to scrape jobs for IndeedScraperUsaEng. What I was not receiving was where things were going afoul. Funnel was retrieving the correct URL for the page listing the jobs, which could signal captcha is no longer an issue, but I couldn't get funnel to generate enough output to further troubleshoot the problem. With debugging enabled, all I received was:

[2023-05-09 03:29:27,068] [DEBUG] JobFunnel: No master-CSV present, did not update block-list: job_search_results/block_list.json
[2023-05-09 03:29:27,104] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2023-05-09 03:29:27,387] [DEBUG] IndeedScraperUSAEng: Got Base search results page: https://www.indeed.com/jobs?q=Linux&l=sacremento%2C+CA&radius=50&limit=50&filter=0
[2023-05-09 03:29:27,424] [ERROR] JobFunnel: Failed to scrape jobs for IndeedScraperUSAEng
[2023-05-09 03:29:27,425] [DEBUG] JobFunnel: Scraped 0 jobs from IndeedScraperUSAEng, took 0.320s
[2023-05-09 03:29:27,425] [INFO] JobFunnel: Completed all scraping, found 0 new jobs.
[2023-05-09 03:29:27,443] [WARNING] JobFunnel: No new jobs were added to CSV.

So, if the issue is in the selectors, something is going to need to be implemented in order to provide that information.

anoduck Jun 3, 2023

@caeochoa Don't know if you are still interested, since I never heard back from you. But I have created a Maitrix/Element room for jobfunnel development.

This should be the invite here

I will try 🤞🏻 to keep a client going at all times.

Charlotte-br560 · 2024-03-28T09:43:45Z

Charlotte-br560
Mar 28, 2024

Your proposed solution seems solid, incorporating HTTP_Request_Randomizer and transitioning to Selenium for dynamic site support. Have you considered integrating Crawlbase for enhanced functionality?

1 reply

anoduck Mar 29, 2024

@Charlotte-br560 Not to sound conceited, but I am assuming you're referring to my suggestion to randomize the proxy and user request on a per-scrape/per-session basis. It should work, but the backend needs refactoring and a complete work over. A new testing framework would have to be implemented to streamline testing.

I started to rework the backend, but encountered complications in my personal life that had to be dealt with.

sammytheindi · 2024-09-09T13:28:05Z

sammytheindi
Sep 9, 2024
Collaborator

Not sure if this project is dead, but has anyone tried or thought about hitting the mobile app endpoints for Indeed as an alternative? It's definitely a WebView, but there is no CAPTCHA (yet, at least) that I can see. Fewer results (20 per page vs. 50 per page), but significantly reduces the complexity. I've found the endpoints through Charles Proxy, and am able to scrape in my minimal tests without any CAPTCHA issues.

Additionally, I have been working with SeleniumBase UC mode to get the indeed scraper to bypass CAPTCHA on desktop. It works in headed mode, and also works (sometimes) in headless mode. I have at least managed to get it to a place where it can scrape and save to the CSV.

We could just start a headed browser for requests when necessary, but it would be nice if we could do everything in headless, especially for large scrapes.

3 replies

PaulMcInnis Sep 10, 2024
Maintainer

👋 Hey there, I am totally open to any and all PR's to resurrect this one, if mobile works it works.

I am sort of close to just rewriting this to use typescript and using more modern tooling as well, as python in this way is just a bit antiquated rn (another option is also to pull in some kind of easy-to-use VPN proxy to spread requests across many IPs)

PaulMcInnis Sep 10, 2024
Maintainer

I have also gotten some emails w.r.t captcha solvers but yeah... feels a bit too unethical for me to endorse this tooling publically, especially when they are not free options.

anoduck Sep 12, 2024

Whaaaaaaaat!!!! That's brillant.

sammytheindi · 2024-09-11T21:21:17Z

sammytheindi
Sep 11, 2024
Collaborator

Have opened #166 as a fix to the indeed scraping issue. Haven't had a ton of time to work on this outside of the day job, so would appreciate any help on review/testing/updating, as well as implementation of the French and German versions of the Indeed scraper if this approach seems reasonable. Fixes should not be time consuming, just need to check that the sites are consistent.

For now, I have just taken the mobile endpoints I found analyzing the mobile version of the application, and am using randomized mobile agents. The scraper is working, and seems to be working consistently with no CAPTCHA issues. I have also updated the parsing so it now accounts for tags and remoteness. For remoteness in particular, I think we can probably make this even better by searching for the term in the title and description as the tags aren't always up to date. See screenshot below for the CSV output.

If you want to test it out, I am using the settings_USA.yaml file which you can find updated in my pull request.

2 replies

anoduck Oct 26, 2024

Now, you are just showing off. I will give it a run tonight.

anoduck Nov 2, 2024

I was able to scrape 423 job listings from indeed. So, I would say that worked rather well.

Charlotte-br560 · 2025-02-08T11:30:40Z

Charlotte-br560
Feb 8, 2025

Good approach! Decoupling the web engine and rotating proxies should help with CAPTCHA. Quick suggestions:

Consider ScraperAPI for reliable proxy management.
Use undetected-chromedriver to avoid Selenium detection.
Try CAPTCHA solvers like 2Captcha if needed.
Handle failures & retries in get_web_engine for resilience.
Let me know if you need refinements!

0 replies

How can we re-build the scraping backend to account for recent CAPTCHA restrictions? #148

thebigG Jun 1, 2021 Collaborator

Replies: 18 comments · 40 replies

PaulMcInnis Jun 20, 2021 Maintainer

thebigG Jun 20, 2021 Collaborator Author

PaulMcInnis Dec 1, 2021 Maintainer

This comment has been minimized.

PaulMcInnis Dec 1, 2021 Maintainer

PaulMcInnis Dec 1, 2021 Maintainer

PaulMcInnis Apr 4, 2022 Maintainer

PaulMcInnis Feb 14, 2023 Maintainer

PaulMcInnis Mar 8, 2023 Maintainer

PaulMcInnis Mar 17, 2023 Maintainer

PaulMcInnis Mar 27, 2023 Maintainer

sammytheindi Sep 9, 2024 Collaborator

PaulMcInnis Sep 10, 2024 Maintainer

PaulMcInnis Sep 10, 2024 Maintainer

sammytheindi Sep 11, 2024 Collaborator

thebigG
Jun 1, 2021
Collaborator

Replies: 18 comments 40 replies

PaulMcInnis
Jun 20, 2021
Maintainer

thebigG
Jun 20, 2021
Collaborator Author

PaulMcInnis Dec 1, 2021
Maintainer

PaulMcInnis Dec 1, 2021
Maintainer

PaulMcInnis Dec 1, 2021
Maintainer

PaulMcInnis Apr 4, 2022
Maintainer

PaulMcInnis Feb 14, 2023
Maintainer

PaulMcInnis Mar 8, 2023
Maintainer

PaulMcInnis Mar 17, 2023
Maintainer

PaulMcInnis Mar 27, 2023
Maintainer

sammytheindi
Sep 9, 2024
Collaborator

PaulMcInnis Sep 10, 2024
Maintainer

PaulMcInnis Sep 10, 2024
Maintainer

sammytheindi
Sep 11, 2024
Collaborator