Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should not visit pages that have already been visited #23

Open
abineetds opened this issue Oct 10, 2022 · 5 comments
Open

Should not visit pages that have already been visited #23

abineetds opened this issue Oct 10, 2022 · 5 comments

Comments

@abineetds
Copy link

How can I make it not visit the same page multiple times?
How can I make it so that it doesn't visit any pages outside of the domain?

@abineetds
Copy link
Author

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

@route
Copy link
Member

route commented Oct 11, 2022

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

I think you should tune your OS for example for Linux

@route
Copy link
Member

route commented Oct 11, 2022

As for the issue I have a plan to intro an option for request but unfortunately it won't work for all the websites. So it's going to be very optional.

@abineetds
Copy link
Author

Also when I ran it with a memo, I got an error eventually

.../gems/ruby-3.1.2/gems/ferrum-0.11/lib/ferrum/browser/web_socket.rb:19:in `initialize': Too many open files - socket(2) for "127.0.0.1" port 65073 (Errno::EMFILE)

I think you should tune your OS for example for Linux

What is the root cause for this? It seems to me that while opening a TCP Socket connection, ferrum opens a file but never closes it? Shouldn't this not happen since the number of pages being processed at once is at most the number of processors (unless overridden).

@route
Copy link
Member

route commented Oct 11, 2022

Ferrum opens only one connection per page and closes it when page is processed releasing the page and connection. So something is wrong with the crawler most likely

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants